Transcription

40-414 Compiler DesignImplementation of Lexical AnalysisLecture 31

Notation There is variation in regular expressionnotation At least one: A Union: A BOption: A Range: ‘a’ ’b’ ’z’Excluded range:complement of [a-z]Prof. Aiken AA* A B A? [a-z] [ a-z]2

Regular Expressions in Lexical Specification Last Lecture: a specification for the predicates L(R)Set of strings But a yes/no answer is not enough! Instead: partition the input into tokensc1c2c3 c4c5c6c7 We adapt regular expressions to this goalProf. Aiken3

Regular Expressions Lexical Spec. (1)1. Write a rexp for the lexemes of each token Number digit Keyword ‘if’ ‘else’ Identifier letter (letter digit)*OpenPar ‘(‘ Prof. Aiken4

Regular Expressions Lexical Spec. (2)2. Construct R, matching all lexemes for alltokensR Keyword Identifier Number R1 R2 Prof. Aiken5

Regular Expressions Lexical Spec. (3)3. Let input be x1 xnFor 1 i n checkx1 xi L(R)4. If success, then we know thatx1 xi L(Rj) for some j5. Remove x1 xi from input and go to (3)Prof. Aiken6

Ambiguities (1) There are ambiguities in the algorithm How much input is used? What if x1 xi L(R) and also x1 xK L(R)k i Rule: Pick longest possible string in L(R)– The “maximal munch”Prof. Aiken7

Ambiguities (2) Which token is used? What if x1 xi L(Rj) and also x1 xi L(Rk)R R1 R2 R3 k iKeyword ‘if’ ‘else’ Identifier letter (letter digit)* Rule: use rule listed first (j if j k)– Treats “if” as a keyword, not an identifierProf. Aiken8

Error Handling What ifNo rule matches a prefix of input ?x1 xi L(Rj) Problem: Can’t just get stuck Solution:– Write a rule matching all “bad” strings– Put it last (lowest priority)Prof. Aiken9

Summary Regular expressions provide a concise notation forstring patterns Use in lexical analysis requires small extensions– To resolve ambiguities– To handle errors Good algorithms known– Require only single pass over the input– Few operations per character (table lookup)Prof. Aiken10

Finite Automata Regular expressions specification Finite automata implementation A finite automaton consists of–––––An input alphabet A finite set of states SA start state nA set of accepting states F SA set of transitions state input stateProf. Aiken11

Finite Automata Transitions1 a s2 Is readIn state s1 on input “a” go to state s2 If end of input and in accepting state accept Otherwise reject Prof. AikenTerminates in a state s that isNOT an accepting state (s F)Gets stuck12

Finite Automata State Graphs A state The start state An accepting statea A transitionProf. Aiken13

A Simple Example A finite automaton that accepts only “1”1A Accepts ‘1’ : 1, Rejects ‘0’ : 0 Rejects ’10’ : 1,B1 1 0Prof. Aiken14

Another Simple Example A finite automaton accepting any number of 1’sfollowed by a single 0 Alphabet: {0,1}1A0B Accepts ‘110’: 110, 1 10, 11 0, Rejects ‘100’ : 100, 1 00,Prof. Aiken110 10 015

And Another Example Alphabet {0,1} What language does this recognize?010011Prof. Aiken16

And Another ExampleSelect the regularlanguage that denotesthe same language asthis finite automaton0101(0 1)*01(1* 0)(1 0)1* (01)* (001)* (000*1)*(0 1)*00Prof. Aiken17

And Another ExampleSelect the regularlanguage that denotesthe same language asthis finite automaton0101(0 1)*01(1* 0)(1 0)1* (01)* (001)* (000*1)*(0 1)*0018

Epsilon Moves Another kind of transition: -moves AB Machine can move from state A to state Bwithout reading inputProf. Aiken19

Deterministic and Nondeterministic Automata Deterministic Finite Automata (DFA)– One transition per input per state– No -moves Nondeterministic Finite Automata (NFA)– Can have multiple transitions for one input in agiven state– Can have -movesProf. Aiken20

Execution of Finite Automata A DFA can take only one path through thestate graph– Completely determined by input NFAs can choose– Whether to make -moves– Which of multiple transitions for a single input totakeProf. Aiken21

Acceptance of NFAs An NFA can get into multiple states1000 Input: Possible States:Rule: NFA accepts if it can get to a final state22Prof. Aiken

Acceptance of NFAs An NFA can get into multiple states100A01 Input: Possible States:Rule: NFA accepts if it can get to a final state23Prof. Aiken

Acceptance of NFAs An NFA can get into multiple states100A01 Input: Possible States: {A}Rule: NFA accepts if it can get to a final state24Prof. Aiken

Acceptance of NFAs An NFA can get into multiple states100A01 0 Input: Possible States: {A}Rule: NFA accepts if it can get to a final state25Prof. Aiken

Acceptance of NFAs An NFA can get into multiple states100BA01 0 Input:{A, B} Possible States: {A}Rule: NFA accepts if it can get to a final state26Prof. Aiken

Acceptance of NFAs An NFA can get into multiple states100BA01 0 0 Input:{A, B} Possible States: {A}Rule: NFA accepts if it can get to a final state27Prof. Aiken

Acceptance of NFAs An NFA can get into multiple states100BAC01 0 0 Input:{A, B, C}{A, B} Possible States: {A}Rule: NFA accepts if it can get to a final state28Prof. Aiken

NFA vs. DFA (1) NFAs and DFAs recognize the same set oflanguages (regular languages) DFAs are faster to execute– There are no choices to consider NFAs are, in general, smaller–Sometimes exponentially smallerProf. Aiken29

NFA vs. DFA (2) For a given language NFA can be simpler thanDFA1NFA0001DFA00011 DFA can be exponentially larger than NFAProf. Aiken30

Regular Expressions to Finite Automata High-level Table-drivenImplementation of DFAProf. Aiken31

Regular Expressions to NFA (1) For each kind of rexp, define an NFA– Notation: NFA for rexp MM For For input aaProf. Aiken32

Regular Expressions to NFA (2) For AB AB For A B B AProf. Aiken33

Regular Expressions to NFA (3) For A* A Prof. Aiken34

Example of RegExp - NFA conversion Consider the regular expression(1 0)*1 The NFA is A B C 1 0DF E G H I1J Prof. Aiken35

NFA to DFA: The Trick Simulate the NFA Each state of DFA a non-empty subset of states of the NFA Start state -closure of the start state of NFA Add a transition S a S’ to DFA iff– S’ is the set of NFA states reachable from any statein S after seeing the input a, considering -moves aswell Final states Subsets that include at least one final state of NFA36Prof. Aiken

-closure of a state -closure(B) {B,C,D} -closure(G) {A,B,C,D,G,H,I}Prof. Aiken37

NFA - DFA Example A B C 1 0DF E G H I1J Prof. Aiken38

NFA - DFA Example A B C 1 0DF E G H I1J Prof. Aiken39

NFA - DFA Example A B C 1 0DF E G H I1J Prof. Aiken40

NFA - DFA Example A B C 1 0DF E G H I1J ABCDHIProf. Aiken41

NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHIProf. Aiken42

NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHIProf. Aiken43

NFA - DFA Example A B C 1 0DF E G H I1J 0FGHIABCDABCDHIProf. Aiken44

NFA - DFA Example A B C 1 0DF E G H I1J 0FGHIABCDABCDHI1Prof. Aiken45

NFA - DFA Example A B C 1 0DF E G H I1J 0FGHIABCDABCDHI1Prof. Aiken46

NFA - DFA Example A B C 1 0DF E G H I1J 0FGHIABCD1EJGHIABCDABCDHIProf. Aiken47

NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHI1FGHIABCD0EJGHIABCDProf. Aiken48

NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHI1FGHIABCD10EJGHIABCDProf. Aiken49

NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHI10FGHIABCD10EJGHIABCDProf. Aiken50

NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHI10FGHIABCD10EJGHIABCDProf. Aiken151

Implementation A DFA can be implemented by a 2D table T– One dimension is “states”– Other dimension is “input symbol”– For every transition Si a Sk define T[i,a] k DFA “execution”– If in state Si and input a, read T[i,a] k and skip tostate Sk– Very efficientProf. Aiken52

Table Implementation of a DFA0S0T101USTU0TTTProf. Aiken11UUU53

Implementation (Cont.) NFA - DFA conversion is at the heart of toolssuch as flex But, DFAs can be huge In practice, flex-like tools trade off speedfor space in the choice of NFA and DFArepresentationsProf. Aiken54

DFA for recognizing two relational operatorsstart0 6 7other8return(SYMBOL, )* return(SYMBOL, )We’ve accepted “ ” and have read “other” character that must beunread. That is moving the input pointer one character back.55

DFA of Pascal relational operatorsstart0 1 2return(SYMBOL, ) 3return(SYMBOL, )other 45*return(SYMBOL, )return(SYMBOL, ) 6 7other8return(SYMBOL, )*return(SYMBOL, )56

DFA for recognizing id and keywordletter or digitstart0letter9other10*return(get token(), install id())returns either a KEYWORD or ID based onthe type of the tokenIf the token is an ID, its lexemeis inserted into the symboltable (only one record for eachlexeme); and lexeme of thetoken is returned.57

DFA of Pascal Unsigned E14 -15digit16other17*digitEotherreturn(NUM, lexeme of the number)58

Lexical errors Some errors are out of power of lexicalanalyzer to recognize:fi (a f(x)) However, it may be able to recognize errorslike: d 2r Such errors are recognized when no patternfor tokens matches a character sequence 59

Error recovery Panic mode: successive characters are ignoreduntil we reach to a well formed token Delete one character from the remaining input Insert a missing character into the remaininginput Replace a character by another character Transpose two adjacent characters60

Lexical Analyzer in Perspective (Revisited)Symbol Tablekeylexemetype 1positionreal 2initialreal 3ratereal sourceprogramtoken type, attribute lexicalanalyzeroutputparserget nexttokensymboltableposition initial rate * 6061 id, 1 op, id, 2 op, id, 3 op, * num, 60

Using Buffer to Enhance EfficiencyCurrent tokenE M *C * * 2 eoflexeme beginningif forward at end of first half then beginreload second half ;Block I/Oforward : forward 1endelse if forward at end of second half then beginreload first half ;Block I/Omove forward to biginning of first halfendelse forward : forward 1 ;forward (scansahead to findpattern match)62

Algorithm: Buffered I/O with SentinelsCurrent tokenE M * eof C * * 2 eoflexeme beginningforward : forward 1 ;if forward is at eof then beginif forward at end of first half then beginreload second half ;Block I/Oforward : forward 1endelse if forward at end of second half then beginreload first half ; Block I/Omove forward to biginning of first halfendelse / * eof within buffer signifying end of input * /terminate lexical analysisend2nd eof no more input !eofforward (scansahead to findpattern match)63

Question?Consider the string abbbaacc . Which of the following lexicalspecifications produces the tokenization:ab/bb/a/accChoose all that applyc*a(b c*)abb b b b ab*ac*ac*abac*Prof. Aiken64

Question?Using the lexical specification below, how is the string “dictatorial”tokenized?Choose all that apply1, 33dict (1)dictator (2)[a-z]* (3)dictatorial (4)42, 3Prof. Aiken65

Question?Given the following lexical specification:a(ba)*b*(ab)*abdd Which of the followingstatements is true?Choose all that applybabad will be tokenized as: bab/a/dababdddd will be tokenized as: abab/dddddddabbabab will be tokenized as: ddd/a/bbababababddababa will be tokenized as: ab/abd/d/ababaProf. Aiken66

Question?Given the following lexical specification:(00)*01 10 Which strings are NOTsuccessfully processed by thisspecification?Choose all that apply01111001100100011001100001101Prof. Aiken67

Question?Which of the following regularexpressions generate thesame language as the onerecognized by this NFA?Choose all that apply(000)*(01) 0(000)*1(01)*(000)*(10) 0(00)*(10)*0(000)*(01)*Prof. Aiken68

Question?Which of the following automata are DFA?Choose all that apply69Prof. Aiken

Question?Which of the following automata are NFA?Choose all that apply70Prof. Aiken

Lexical errors Some errors are out of power of lexical analyzer to recognize: fi (a f(x)) However, it may be able to recognize errors like: d 2r Such errors are recognize