Notes
Outline
Compilation and Compilers
Parts of a compiler
Lexical analyzer
Parser
Code generator
Languages
Grammars and parse trees
Syntax diagrams
   Parts of a Compiler
Compilers divide their problem into steps or passes to conquer it
Initial pass takes as input the source program
Last pass outputs the code for execution
Driver programs such as cc are often used to invoke programs that cause subsequent passes to take place
Compiler
A program or set of programs that translates one language into another
Four pass compiler
   Passes
Pass 1: Preprocessor
Macro and constant substitution
Strip comments from source code
Pass 2: Lexical analyzer, Parser, Code generator
Heart of the compiler
Translates source into a platform independent language much like assembler (Intermediate code 1)
Pass 3: Optimizer
Improves the quality of the intermediate code
Pass 4: Back end
Translates the optimized code to real assembler  language or directly to to some form of binary executable code
Provides target independence for earlier phases
Java, UCSD Pascal (Apple II): produces byte-code, which still needs to be interpreted (or compiled again) when executing
About Passes of a Compiler
Each pass requires the traversal of the current representation of the source
Slow, especially if the representation is not all in memory (huge source files, ......)
Modern compilers may accomplish the work associated with more than one pass in one pass over the code
Turbo Pascal in the 1980’s combined the work from all passes into a single pass
Advantage was speed
Disadvantage is that you have trouble dealing with more complex language features and problems generating efficient code
Lexical Analyzer
(Scanner or Tokenizer)
Translates source into a form more usable by the compiler
Converts the incoming source into a series of basic language elements called tokens
A = BB + 3, has 5 tokens A, =, BB, +, 3
Tokens have meaning and are indivisible
In C, while and for are tokens, you can’t say wh ile
Can be placed in a symbol table and have information associated with them
For example type, value, name of identifier, relationship to other structures
Can be referenced by unique integer in later intermediate representations of source
Original string that comprises a token is called a lexeme
Need to map each lexeme to a token
In general a lexical analyzer recognizes/chooses the token that matches the longest lexeme (as required by language specifications)
For ex:  “>>“ is  interpreted as output in C++ not  greater than greater than
Choice of token set is an import aspect of compiler design
For ex: are >, >=, >>, and >>= four comparison operators, or a single comparison operator with further type information?
   Parser
The parser analyses the source grammatically to determine whether it meets the language specification and to develop a representation better suited to code generation
Parser invokes the lexical analyzer to get the next token (reference into symbol table) and its corresponding lexeme (value)
Note: lexical analyzer is separate and can be replaced
Example of checking syntax and semantics of a sentence
Syntax diagram relate parts of a sentence
Jane  Sees  Spot
Semantics gives the parse tree
Jane sees Spot run
sentence
subject    predicate
noun       verb           object
Jane        sees            noun       participle
                                 Spot        run
Syntax and Parse Trees for A*B + C*D
Syntax tree
                                +
                        *             *
                   A     B      C      D
Parse tree
                               expression
                       factor      +      factor
               factor * factor   factor * factor
                   A        B           C           D
Some compilers have an explicit representation of the tree, others do not
Physical representation would include pointers etc.
The parse tree breaks up the program source into a hierarchical representation for valid programs
Sentence (program) at the top
Tokens at the bottom
To summarize
Parser breaks the token stream into a parse tree
Parse tree is a structural representation of the sentence or program being parsed
   Code Generator
Last part of the compiler proper
Most compilers generate the output of the code generator as the parse progresses instead of leaving it until after a parse tree is built
Small parts of the parse tree fill in code templates that are generated by the code generator
An if, then, else will have compare, branch, and label components in the corresponding code
Code generator can generate
Executable code
Advantage: fast
Some aspects of optimization can still take place by observing the final linear instruction stream
OR
Intermediate language representation that is close to assembler but has additional information
Makes it easier for optimizers to perform further optimizations to generate faster code
Intermediate
Languages
Give flexibility
One intermediate language can be mapped onto many targets (Intel, Motorola, etc.)
The targets only need separate back ends
Can be processed by interpreters
Interpreters read the instructions and have to map them onto machine executable instructions
MOV(A,B) becomes a subroutine in an interpreter
JAVA is like C++ but is compiled into a byte code (which is an assembly language for a virtual machine)
This byte code can then be interpreted on different platforms
Offers platform independence but is slow
Some vendors, translate the byte code into a target dependent machine code to make it run faster
Translation to machine code can be done on the fly (Just-In-Time compilers)
   Languages
Computer languages are chosen in a way that makes them easier to compile
They are described using grammars
Grammars
Backus Naur (also called Backus-Normal) form often used to describe grammars for languages
Set of tokens called terminal symbols
For things like numbers, key words, predefined symbols
Set of definitions called non-terminal symbols
For ex:   a -> b | c,    reads as a is either b or c
Definitions create a system in which every legal structure can be represented
Grammars are typically recursive, so recursion can be used to parse the grammar
Language for Simple Expression Grammar
Terminals: end-of-input, epsilon, number, +, *, (, ), ;
end-of-input represents end of file
epsilon is an empty string needed to match an expression, it is not a token, it represents the absence of a token
+, * are the legal operators in this language
Non-terminal symbols:
statements -> end-of-input | expression; statements
expression -> term expression’
expression’ -> + term expression’ | epsilon
term -> factor term’
term’ -> * factor term’ | epsilon
factor -> number | (expression)
Consider a parse tree of “1+2;”
                                         statements
                      expression                           ;       statements
         term               expression’                       end-of-input
     factor    term’   +    term     expression’
  1               epsilon      factor  term’   epsilon
                                     2       epsilon
BNF grammar for a subset of Pascal
prog -> PROGRAM prog-name VAR dec-list BEGIN stmt-list END.
prog-name -> identifier
dec-list -> dec | dec-list ; dec
dec -> id-list : type
type -> INTEGER
id-list -> identifier | id-list , identifier
stmt-list -> stmt | stmt-list ; stmt
stmt -> assign | read | write | for
assign -> identifier := exp
exp -> term | exp + term | exp - term
term -> factor | term * factor | term DIV factor
factor - > identifier | integer | ( exp )
read -> READ ( id-list )
write -> WRITE  ( id-list )
for -> FOR index-exp DO body
index-exp -> identifier := exp TO exp
body -> stmt | BEGIN stmt-list END
Reserved keywords (terminals): PROGRAM, VAR, BEGIN, END, ., ;, :, INTEGER, :=, +, -, *, DIV, (, ), READ, WRITE, FOR, DO, TO
Also assume that tokenizer recognizes IDENTIFIERS and INTEGERS
   Notes on Compilers
Compilers can recognize when templates or objects are instantiated and destroyed
It can do so because these are part of the language definition
Once the pattern is matched, it can output intermediate level code to support these operations
Template parameters can be filled in
Calls are made to appropriate routines to construct/destroy objects
This processing is often called a middle pass and can be supported by the parser in combination with the code generator and a set of supporting routines invoked during program execution
The subroutines  invoked during program execution are part of  the runtime environment of the program and are provided by the compilation/execution environment
Interpreters
Give flexibility but are slower
Can modify the interpreted program on the fly and see the impact immediately without a  regeneration of code
Interpretation can be at the source program level, or at an intermediate language level