Compilation Overview

  1. Character stream -> Scanner (lexical analysis) ->
  2. Token stream -> Parser (syntax analysis) ->
  3. Parse tree -> Semantic analysis and intermediate code generation ->
  4. Abstract syntax tree or other intermediate form -> Machine-independent code improvement (optional) ->
  5. Modified intermediate form -> Target code generation ->
  6. Target language (e.g., assembler) -> Machine-specific code improvement (optional) ->
  7. Modified target language

1-3 represent the front-end and 4-7 represent the back-end. Java byte-code is an “intermediate form” from 4. You can mix and match different front-ends and back-ends. Creates a symbol table to map addresses/other info for variables.

Interpretation Overview

  1. Character stream -> Scanner (lexical analysis) ->
  2. Token stream -> Parser (syntax analysis) ->
  3. Parse tree -> Semantic analysis and intermediate code generation ->
  4. Abstract syntax tree or other intermediate form -> Tree-walk routines
  5. Program input -> Program output (at run-time)

Lexical Analysis

Reference: Scott, CH 2 sections 2.1.1 and 2.2, need to skim 2.2.3

Lexers and scanners are synonymous. Note that syntax describes the structure of a language, while semantics describes the “meaning”.

Regular Expressions

Gives us a formal notation for lexical structure.

Formal definition of RE

A minimal RE definition is:

  1. An atom, which is an element of a finite alphabet $\Sigma$
  2. $\epsilon$ which denotes the set containing the empty string
  3. Concatentation denotes the set containing a string from the first RE concatenated with a string from the second RE
  4. Alternative, represented by ‘|’, which denotes the union of two sets
  5. Kleene closure, which is a regular expression followed by a Kleene star * meaning the set of strings obtained by concatenation of zero or more strings from the set denoted by the regular expression in front of the star
  6. Positive closure is a RE followed by + which means the concatenation of one or more strings generated by the regular expression in front of the star
  7. Not(A) is $\Sigma$\A

Examples: Specifying tokens of a programming language

Can name regular expressions and use them in other regular expressions as if they were symbols.

  • <op/> ::= + | == | *
  • <digit> ::= 0 | <nonzero digit>
  • <nonzero digit> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
  • Identifiers
    • <identifier> ::= <identifier start> <identifier part>*
    • <identifier start> ::= A .. Z | a .. z | $
    • <identifier part> ::= <identifier start> | <digit>
  • <numeric literal/> ::= <nonzero digit> <digit/> * | 0
  • <token/> ::= <identifier/> | <numeric literal> | <op/>
  • <comment> ::= //(Not(Eol))*Eol