PLP: Lecture 3
Lexical Analysis
Restrictions on Regular Expressions
Cannot contain recursive definitions, which limits the expressiveness of the language. An example is the fact that you can’t nest comments in C or Java, e.g., /* … /* … */ … */ is illegal
Recognizing Tokens
REs specify tokens. The lexical analysis part of compilation is the utilization of a Deterministic Finite Automaton (DFA) to recognize the language that can be generated by a RE alphabet.
DFA := Finite State Machine
Scanners are reponsible for tokenizing the source, as well as removing comments and handling pragmas, etc.
Whitespace is handled naturally - if encountered when scanning a token, stop scanning. May need to recognize delimiters.
Java Implementation details of a Scanner
- Scanner class
- Enums for token types
Pragmas are comments that may contain information useful to the compiler or other tools. E.g., hinting about potential for vectorization
Closing remarks
Reserved words and keywords are distinguised by same REs as indentifiers. Specific to language. Keywords are reserved words but actually have some function. Store them in a hash map for fast lookup checking.