PLP: Lecture 2
Compilation Overview
- Character stream -> Scanner (lexical analysis) ->
- Token stream -> Parser (syntax analysis) ->
- Parse tree -> Semantic analysis and intermediate code generation ->
- Abstract syntax tree or other intermediate form -> Machine-independent code improvement (optional) ->
- Modified intermediate form -> Target code generation ->
- Target language (e.g., assembler) -> Machine-specific code improvement (optional) ->
- Modified target language
1-3 represent the front-end and 4-7 represent the back-end. Java byte-code is an “intermediate form” from 4. You can mix and match different front-ends and back-ends. Creates a symbol table to map addresses/other info for variables.
Interpretation Overview
- Character stream -> Scanner (lexical analysis) ->
- Token stream -> Parser (syntax analysis) ->
- Parse tree -> Semantic analysis and intermediate code generation ->
- Abstract syntax tree or other intermediate form -> Tree-walk routines
- Program input -> Program output (at run-time)
Lexical Analysis
Reference: Scott, CH 2 sections 2.1.1 and 2.2, need to skim 2.2.3
Lexers and scanners are synonymous. Note that syntax describes the structure of a language, while semantics describes the “meaning”.
Regular Expressions
Gives us a formal notation for lexical structure.
Formal definition of RE
A minimal RE definition is:
- An atom, which is an element of a finite alphabet $\Sigma$
- $\epsilon$ which denotes the set containing the empty string
- Concatentation denotes the set containing a string from the first RE concatenated with a string from the second RE
- Alternative, represented by ‘|’, which denotes the union of two sets
- Kleene closure, which is a regular expression followed by a Kleene star * meaning the set of strings obtained by concatenation of zero or more strings from the set denoted by the regular expression in front of the star
- Positive closure is a RE followed by + which means the concatenation of one or more strings generated by the regular expression in front of the star
- Not(A) is $\Sigma$\A
Examples: Specifying tokens of a programming language
Can name regular expressions and use them in other regular expressions as if they were symbols.
- <op/> ::= + | == | *
- <digit> ::= 0 | <nonzero digit>
- <nonzero digit> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
- Identifiers
- <identifier> ::= <identifier start> <identifier part>*
- <identifier start> ::= A .. Z | a .. z | $
- <identifier part> ::= <identifier start> | <digit>
- <numeric literal/> ::= <nonzero digit> <digit/> * | 0
- <token/> ::= <identifier/> | <numeric literal> | <op/>
- <comment> ::= //(Not(Eol))*Eol