Lexical Analysis

Restrictions on Regular Expressions

Cannot contain recursive definitions, which limits the expressiveness of the language. An example is the fact that you can’t nest comments in C or Java, e.g., /* … /* … */ … */ is illegal

Recognizing Tokens

REs specify tokens. The lexical analysis part of compilation is the utilization of a Deterministic Finite Automaton (DFA) to recognize the language that can be generated by a RE alphabet.

DFA := Finite State Machine

Scanners are reponsible for tokenizing the source, as well as removing comments and handling pragmas, etc.

Whitespace is handled naturally - if encountered when scanning a token, stop scanning. May need to recognize delimiters.

Java Implementation details of a Scanner

  • Scanner class
  • Enums for token types

Pragmas are comments that may contain information useful to the compiler or other tools. E.g., hinting about potential for vectorization

Closing remarks

Reserved words and keywords are distinguised by same REs as indentifiers. Specific to language. Keywords are reserved words but actually have some function. Store them in a hash map for fast lookup checking.