
Transcription
40-414 Compiler DesignImplementation of Lexical AnalysisLecture 31
Notation There is variation in regular expressionnotation At least one: A Union: A BOption: A Range: ‘a’ ’b’ ’z’Excluded range:complement of [a-z]Prof. Aiken AA* A B A? [a-z] [ a-z]2
Regular Expressions in Lexical Specification Last Lecture: a specification for the predicates L(R)Set of strings But a yes/no answer is not enough! Instead: partition the input into tokensc1c2c3 c4c5c6c7 We adapt regular expressions to this goalProf. Aiken3
Regular Expressions Lexical Spec. (1)1. Write a rexp for the lexemes of each token Number digit Keyword ‘if’ ‘else’ Identifier letter (letter digit)*OpenPar ‘(‘ Prof. Aiken4
Regular Expressions Lexical Spec. (2)2. Construct R, matching all lexemes for alltokensR Keyword Identifier Number R1 R2 Prof. Aiken5
Regular Expressions Lexical Spec. (3)3. Let input be x1 xnFor 1 i n checkx1 xi L(R)4. If success, then we know thatx1 xi L(Rj) for some j5. Remove x1 xi from input and go to (3)Prof. Aiken6
Ambiguities (1) There are ambiguities in the algorithm How much input is used? What if x1 xi L(R) and also x1 xK L(R)k i Rule: Pick longest possible string in L(R)– The “maximal munch”Prof. Aiken7
Ambiguities (2) Which token is used? What if x1 xi L(Rj) and also x1 xi L(Rk)R R1 R2 R3 k iKeyword ‘if’ ‘else’ Identifier letter (letter digit)* Rule: use rule listed first (j if j k)– Treats “if” as a keyword, not an identifierProf. Aiken8
Error Handling What ifNo rule matches a prefix of input ?x1 xi L(Rj) Problem: Can’t just get stuck Solution:– Write a rule matching all “bad” strings– Put it last (lowest priority)Prof. Aiken9
Summary Regular expressions provide a concise notation forstring patterns Use in lexical analysis requires small extensions– To resolve ambiguities– To handle errors Good algorithms known– Require only single pass over the input– Few operations per character (table lookup)Prof. Aiken10
Finite Automata Regular expressions specification Finite automata implementation A finite automaton consists of–––––An input alphabet A finite set of states SA start state nA set of accepting states F SA set of transitions state input stateProf. Aiken11
Finite Automata Transitions1 a s2 Is readIn state s1 on input “a” go to state s2 If end of input and in accepting state accept Otherwise reject Prof. AikenTerminates in a state s that isNOT an accepting state (s F)Gets stuck12
Finite Automata State Graphs A state The start state An accepting statea A transitionProf. Aiken13
A Simple Example A finite automaton that accepts only “1”1A Accepts ‘1’ : 1, Rejects ‘0’ : 0 Rejects ’10’ : 1,B1 1 0Prof. Aiken14
Another Simple Example A finite automaton accepting any number of 1’sfollowed by a single 0 Alphabet: {0,1}1A0B Accepts ‘110’: 110, 1 10, 11 0, Rejects ‘100’ : 100, 1 00,Prof. Aiken110 10 015
And Another Example Alphabet {0,1} What language does this recognize?010011Prof. Aiken16
And Another ExampleSelect the regularlanguage that denotesthe same language asthis finite automaton0101(0 1)*01(1* 0)(1 0)1* (01)* (001)* (000*1)*(0 1)*00Prof. Aiken17
And Another ExampleSelect the regularlanguage that denotesthe same language asthis finite automaton0101(0 1)*01(1* 0)(1 0)1* (01)* (001)* (000*1)*(0 1)*0018
Epsilon Moves Another kind of transition: -moves AB Machine can move from state A to state Bwithout reading inputProf. Aiken19
Deterministic and Nondeterministic Automata Deterministic Finite Automata (DFA)– One transition per input per state– No -moves Nondeterministic Finite Automata (NFA)– Can have multiple transitions for one input in agiven state– Can have -movesProf. Aiken20
Execution of Finite Automata A DFA can take only one path through thestate graph– Completely determined by input NFAs can choose– Whether to make -moves– Which of multiple transitions for a single input totakeProf. Aiken21
Acceptance of NFAs An NFA can get into multiple states1000 Input: Possible States:Rule: NFA accepts if it can get to a final state22Prof. Aiken
Acceptance of NFAs An NFA can get into multiple states100A01 Input: Possible States:Rule: NFA accepts if it can get to a final state23Prof. Aiken
Acceptance of NFAs An NFA can get into multiple states100A01 Input: Possible States: {A}Rule: NFA accepts if it can get to a final state24Prof. Aiken
Acceptance of NFAs An NFA can get into multiple states100A01 0 Input: Possible States: {A}Rule: NFA accepts if it can get to a final state25Prof. Aiken
Acceptance of NFAs An NFA can get into multiple states100BA01 0 Input:{A, B} Possible States: {A}Rule: NFA accepts if it can get to a final state26Prof. Aiken
Acceptance of NFAs An NFA can get into multiple states100BA01 0 0 Input:{A, B} Possible States: {A}Rule: NFA accepts if it can get to a final state27Prof. Aiken
Acceptance of NFAs An NFA can get into multiple states100BAC01 0 0 Input:{A, B, C}{A, B} Possible States: {A}Rule: NFA accepts if it can get to a final state28Prof. Aiken
NFA vs. DFA (1) NFAs and DFAs recognize the same set oflanguages (regular languages) DFAs are faster to execute– There are no choices to consider NFAs are, in general, smaller–Sometimes exponentially smallerProf. Aiken29
NFA vs. DFA (2) For a given language NFA can be simpler thanDFA1NFA0001DFA00011 DFA can be exponentially larger than NFAProf. Aiken30
Regular Expressions to Finite Automata High-level Table-drivenImplementation of DFAProf. Aiken31
Regular Expressions to NFA (1) For each kind of rexp, define an NFA– Notation: NFA for rexp MM For For input aaProf. Aiken32
Regular Expressions to NFA (2) For AB AB For A B B AProf. Aiken33
Regular Expressions to NFA (3) For A* A Prof. Aiken34
Example of RegExp - NFA conversion Consider the regular expression(1 0)*1 The NFA is A B C 1 0DF E G H I1J Prof. Aiken35
NFA to DFA: The Trick Simulate the NFA Each state of DFA a non-empty subset of states of the NFA Start state -closure of the start state of NFA Add a transition S a S’ to DFA iff– S’ is the set of NFA states reachable from any statein S after seeing the input a, considering -moves aswell Final states Subsets that include at least one final state of NFA36Prof. Aiken
-closure of a state -closure(B) {B,C,D} -closure(G) {A,B,C,D,G,H,I}Prof. Aiken37
NFA - DFA Example A B C 1 0DF E G H I1J Prof. Aiken38
NFA - DFA Example A B C 1 0DF E G H I1J Prof. Aiken39
NFA - DFA Example A B C 1 0DF E G H I1J Prof. Aiken40
NFA - DFA Example A B C 1 0DF E G H I1J ABCDHIProf. Aiken41
NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHIProf. Aiken42
NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHIProf. Aiken43
NFA - DFA Example A B C 1 0DF E G H I1J 0FGHIABCDABCDHIProf. Aiken44
NFA - DFA Example A B C 1 0DF E G H I1J 0FGHIABCDABCDHI1Prof. Aiken45
NFA - DFA Example A B C 1 0DF E G H I1J 0FGHIABCDABCDHI1Prof. Aiken46
NFA - DFA Example A B C 1 0DF E G H I1J 0FGHIABCD1EJGHIABCDABCDHIProf. Aiken47
NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHI1FGHIABCD0EJGHIABCDProf. Aiken48
NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHI1FGHIABCD10EJGHIABCDProf. Aiken49
NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHI10FGHIABCD10EJGHIABCDProf. Aiken50
NFA - DFA Example A B C 1 0DF E G H I1J 0ABCDHI10FGHIABCD10EJGHIABCDProf. Aiken151
Implementation A DFA can be implemented by a 2D table T– One dimension is “states”– Other dimension is “input symbol”– For every transition Si a Sk define T[i,a] k DFA “execution”– If in state Si and input a, read T[i,a] k and skip tostate Sk– Very efficientProf. Aiken52
Table Implementation of a DFA0S0T101USTU0TTTProf. Aiken11UUU53
Implementation (Cont.) NFA - DFA conversion is at the heart of toolssuch as flex But, DFAs can be huge In practice, flex-like tools trade off speedfor space in the choice of NFA and DFArepresentationsProf. Aiken54
DFA for recognizing two relational operatorsstart0 6 7other8return(SYMBOL, )* return(SYMBOL, )We’ve accepted “ ” and have read “other” character that must beunread. That is moving the input pointer one character back.55
DFA of Pascal relational operatorsstart0 1 2return(SYMBOL, ) 3return(SYMBOL, )other 45*return(SYMBOL, )return(SYMBOL, ) 6 7other8return(SYMBOL, )*return(SYMBOL, )56
DFA for recognizing id and keywordletter or digitstart0letter9other10*return(get token(), install id())returns either a KEYWORD or ID based onthe type of the tokenIf the token is an ID, its lexemeis inserted into the symboltable (only one record for eachlexeme); and lexeme of thetoken is returned.57
DFA of Pascal Unsigned E14 -15digit16other17*digitEotherreturn(NUM, lexeme of the number)58
Lexical errors Some errors are out of power of lexicalanalyzer to recognize:fi (a f(x)) However, it may be able to recognize errorslike: d 2r Such errors are recognized when no patternfor tokens matches a character sequence 59
Error recovery Panic mode: successive characters are ignoreduntil we reach to a well formed token Delete one character from the remaining input Insert a missing character into the remaininginput Replace a character by another character Transpose two adjacent characters60
Lexical Analyzer in Perspective (Revisited)Symbol Tablekeylexemetype 1positionreal 2initialreal 3ratereal sourceprogramtoken type, attribute lexicalanalyzeroutputparserget nexttokensymboltableposition initial rate * 6061 id, 1 op, id, 2 op, id, 3 op, * num, 60
Using Buffer to Enhance EfficiencyCurrent tokenE M *C * * 2 eoflexeme beginningif forward at end of first half then beginreload second half ;Block I/Oforward : forward 1endelse if forward at end of second half then beginreload first half ;Block I/Omove forward to biginning of first halfendelse forward : forward 1 ;forward (scansahead to findpattern match)62
Algorithm: Buffered I/O with SentinelsCurrent tokenE M * eof C * * 2 eoflexeme beginningforward : forward 1 ;if forward is at eof then beginif forward at end of first half then beginreload second half ;Block I/Oforward : forward 1endelse if forward at end of second half then beginreload first half ; Block I/Omove forward to biginning of first halfendelse / * eof within buffer signifying end of input * /terminate lexical analysisend2nd eof no more input !eofforward (scansahead to findpattern match)63
Question?Consider the string abbbaacc . Which of the following lexicalspecifications produces the tokenization:ab/bb/a/accChoose all that applyc*a(b c*)abb b b b ab*ac*ac*abac*Prof. Aiken64
Question?Using the lexical specification below, how is the string “dictatorial”tokenized?Choose all that apply1, 33dict (1)dictator (2)[a-z]* (3)dictatorial (4)42, 3Prof. Aiken65
Question?Given the following lexical specification:a(ba)*b*(ab)*abdd Which of the followingstatements is true?Choose all that applybabad will be tokenized as: bab/a/dababdddd will be tokenized as: abab/dddddddabbabab will be tokenized as: ddd/a/bbababababddababa will be tokenized as: ab/abd/d/ababaProf. Aiken66
Question?Given the following lexical specification:(00)*01 10 Which strings are NOTsuccessfully processed by thisspecification?Choose all that apply01111001100100011001100001101Prof. Aiken67
Question?Which of the following regularexpressions generate thesame language as the onerecognized by this NFA?Choose all that apply(000)*(01) 0(000)*1(01)*(000)*(10) 0(00)*(10)*0(000)*(01)*Prof. Aiken68
Question?Which of the following automata are DFA?Choose all that apply69Prof. Aiken
Question?Which of the following automata are NFA?Choose all that apply70Prof. Aiken
Lexical errors Some errors are out of power of lexical analyzer to recognize: fi (a f(x)) However, it may be able to recognize errors like: d 2r Such errors are recognize