PLP: Lecture 2

Compilation Overview

Character stream -> Scanner (lexical analysis) ->
Token stream -> Parser (syntax analysis) ->
Parse tree -> Semantic analysis and intermediate code generation ->
Abstract syntax tree or other intermediate form -> Machine-independent code improvement (optional) ->
Modified intermediate form -> Target code generation ->
Target language (e.g., assembler) -> Machine-specific code improvement (optional) ->
Modified target language

1-3 represent the front-end and 4-7 represent the back-end. Java byte-code is an “intermediate form” from 4. You can mix and match different front-ends and back-ends. Creates a symbol table to map addresses/other info for variables.

Interpretation Overview

Character stream -> Scanner (lexical analysis) ->
Token stream -> Parser (syntax analysis) ->
Parse tree -> Semantic analysis and intermediate code generation ->
Abstract syntax tree or other intermediate form -> Tree-walk routines
Program input -> Program output (at run-time)

Lexical Analysis

Reference: Scott, CH 2 sections 2.1.1 and 2.2, need to skim 2.2.3

Lexers and scanners are synonymous. Note that syntax describes the structure of a language, while semantics describes the “meaning”.

Regular Expressions

Gives us a formal notation for lexical structure.

Formal definition of RE

A minimal RE definition is:

An atom, which is an element of a finite alphabet $\Sigma$
$\epsilon$ which denotes the set containing the empty string
Concatentation denotes the set containing a string from the first RE concatenated with a string from the second RE
Alternative, represented by ‘|’, which denotes the union of two sets
Kleene closure, which is a regular expression followed by a Kleene star * meaning the set of strings obtained by concatenation of zero or more strings from the set denoted by the regular expression in front of the star
Positive closure is a RE followed by + which means the concatenation of one or more strings generated by the regular expression in front of the star
Not(A) is $\Sigma$\A

Examples: Specifying tokens of a programming language

Can name regular expressions and use them in other regular expressions as if they were symbols.

<op/> ::= + | == | *
<digit> ::= 0 | <nonzero digit>
<nonzero digit> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Identifiers
- <identifier> ::= <identifier start> <identifier part>*
- <identifier start> ::= A .. Z | a .. z | $
- <identifier part> ::= <identifier start> | <digit>
<numeric literal/> ::= <nonzero digit> <digit/> * | 0
<token/> ::= <identifier/> | <numeric literal> | <op/>
<comment> ::= //(Not(Eol))*Eol