|
|
|
|
Parts of a compiler |
|
Lexical analyzer |
|
Parser |
|
Code generator |
|
Languages |
|
Grammars and parse trees |
|
Syntax diagrams |
|
|
|
|
|
Compilers divide their problem into steps or passes
to conquer it |
|
Initial pass takes as input the source program |
|
Last pass outputs the code for execution |
|
Driver programs such as cc are often used to
invoke programs that cause subsequent passes to take place |
|
Compiler |
|
A program or set of programs that translates one
language into another |
|
Four pass compiler |
|
|
|
|
|
Pass 1: Preprocessor |
|
Macro and constant substitution |
|
Strip comments from source code |
|
Pass 2: Lexical analyzer, Parser, Code generator |
|
Heart of the compiler |
|
Translates source into a platform independent
language much like assembler (Intermediate code 1) |
|
Pass 3: Optimizer |
|
Improves the quality of the intermediate code |
|
Pass 4: Back end |
|
Translates the optimized code to real
assembler language or directly to
to some form of binary executable code |
|
Provides target independence for earlier phases |
|
Java, UCSD Pascal (Apple II): produces
byte-code, which still needs to be interpreted (or compiled again) when
executing |
|
|
|
|
|
|
Each pass requires the traversal of the current
representation of the source |
|
Slow, especially if the representation is not
all in memory (huge source files, ......) |
|
Modern compilers may accomplish the work
associated with more than one pass in one pass over the code |
|
Turbo Pascal in the 1980’s combined the work
from all passes into a single pass |
|
Advantage was speed |
|
Disadvantage is that you have trouble dealing
with more complex language features and problems generating efficient code |
|
|
|
|
|
|
Translates source into a form more usable by the
compiler |
|
Converts the incoming source into a series of
basic language elements called tokens |
|
A = BB + 3, has 5 tokens A, =, BB, +, 3 |
|
Tokens have meaning and are indivisible |
|
In C, while and for are tokens, you can’t say wh
ile |
|
Can be placed in a symbol table and have
information associated with them |
|
For example type, value, name of identifier,
relationship to other structures |
|
Can be referenced by unique integer in later
intermediate representations of source |
|
Original string that comprises a token is called
a lexeme |
|
Need to map each lexeme to a token |
|
In general a lexical analyzer recognizes/chooses
the token that matches the longest lexeme (as required by language
specifications) |
|
For ex:
“>>“ is interpreted as
output in C++ not greater than
greater than |
|
Choice of token set is an import aspect of
compiler design |
|
For ex: are >, >=, >>, and >>=
four comparison operators, or a single comparison operator with further
type information? |
|
|
|
|
|
The parser analyses the source grammatically to
determine whether it meets the language specification and to develop a
representation better suited to code generation |
|
Parser invokes the lexical analyzer to get the
next token (reference into symbol table) and its corresponding lexeme
(value) |
|
Note: lexical analyzer is separate and can be
replaced |
|
Example of checking syntax and semantics of a
sentence |
|
Syntax diagram relate parts of a sentence |
|
Jane Sees Spot |
|
Semantics gives the parse tree |
|
Jane sees Spot run |
|
sentence |
|
|
|
subject
predicate |
|
|
|
noun
verb object |
|
|
|
Jane
sees noun participle |
|
|
|
Spot run |
|
|
|
|
|
Syntax tree |
|
+ |
|
* * |
|
A
B C D |
|
Parse tree |
|
expression |
|
factor + factor |
|
factor * factor factor *
factor |
|
A
B C D |
|
Some compilers have an explicit representation
of the tree, others do not |
|
Physical representation would include pointers
etc. |
|
The parse tree breaks up the program source into
a hierarchical representation for valid programs |
|
Sentence (program) at the top |
|
Tokens at the bottom |
|
To summarize |
|
Parser breaks the token stream into a parse tree |
|
Parse tree is a structural representation of the
sentence or program being parsed |
|
|
|
|
|
Last part of the compiler proper |
|
Most compilers generate the output of the code
generator as the parse progresses instead of leaving it until after a parse
tree is built |
|
Small parts of the parse tree fill in code
templates that are generated by the code generator |
|
An if, then, else will have compare, branch, and
label components in the corresponding code |
|
Code generator can generate |
|
Executable code |
|
Advantage: fast |
|
Some aspects of optimization can still take
place by observing the final linear instruction stream |
|
OR |
|
Intermediate language representation that is
close to assembler but has additional information |
|
Makes it easier for optimizers to perform
further optimizations to generate faster code |
|
|
|
|
|
|
Give flexibility |
|
One intermediate language can be mapped onto
many targets (Intel, Motorola, etc.) |
|
The targets only need separate back ends |
|
Can be processed by interpreters |
|
Interpreters read the instructions and have to
map them onto machine executable instructions |
|
MOV(A,B) becomes a subroutine in an interpreter |
|
JAVA is like C++ but is compiled into a byte
code (which is an assembly language for a virtual machine) |
|
This byte code can then be interpreted on
different platforms |
|
Offers platform independence but is slow |
|
Some vendors, translate the byte code into a
target dependent machine code to make it run faster |
|
Translation to machine code can be done on the
fly (Just-In-Time compilers) |
|
|
|
|
|
|
|
Computer languages are chosen in a way that
makes them easier to compile |
|
They are described using grammars |
|
Grammars |
|
Backus Naur (also called Backus-Normal) form
often used to describe grammars for languages |
|
Set of tokens called terminal symbols |
|
For things like numbers, key words, predefined
symbols |
|
Set of definitions called non-terminal symbols |
|
For ex:
a -> b | c, reads as a
is either b or c |
|
Definitions create a system in which every legal
structure can be represented |
|
Grammars are typically recursive, so recursion
can be used to parse the grammar |
|
|
|
|
|
Terminals: end-of-input, epsilon, number, +, *,
(, ), ; |
|
end-of-input represents end of file |
|
epsilon is an empty string needed to match an
expression, it is not a token, it represents the absence of a token |
|
+, * are the legal operators in this language |
|
Non-terminal symbols: |
|
statements -> end-of-input | expression;
statements |
|
expression -> term expression’ |
|
expression’ -> + term expression’ | epsilon |
|
term -> factor term’ |
|
term’ -> * factor term’ | epsilon |
|
factor -> number | (expression) |
|
Consider a parse tree of “1+2;” |
|
statements |
|
expression ; statements |
|
term
expression’
end-of-input |
|
factor term’ +
term expression’ |
|
1 epsilon factor term’ epsilon |
|
2 epsilon |
|
|
|
|
prog -> PROGRAM prog-name VAR dec-list BEGIN
stmt-list END. |
|
prog-name -> identifier |
|
dec-list -> dec | dec-list ; dec |
|
dec -> id-list : type |
|
type -> INTEGER |
|
id-list -> identifier | id-list , identifier |
|
stmt-list -> stmt | stmt-list ; stmt |
|
stmt -> assign | read | write | for |
|
assign -> identifier := exp |
|
exp -> term | exp + term | exp - term |
|
term -> factor | term * factor | term DIV
factor |
|
factor - > identifier | integer | ( exp ) |
|
read -> READ ( id-list ) |
|
write -> WRITE ( id-list ) |
|
for -> FOR index-exp DO body |
|
index-exp -> identifier := exp TO exp |
|
body -> stmt | BEGIN stmt-list END |
|
|
|
Reserved keywords (terminals): PROGRAM, VAR,
BEGIN, END, ., ;, :, INTEGER, :=, +, -, *, DIV, (, ), READ, WRITE, FOR, DO,
TO |
|
|
|
Also assume that tokenizer recognizes
IDENTIFIERS and INTEGERS |
|
|
|
|
|
|
Compilers can recognize when templates or
objects are instantiated and destroyed |
|
It can do so because these are part of the
language definition |
|
Once the pattern is matched, it can output
intermediate level code to support these operations |
|
Template parameters can be filled in |
|
Calls are made to appropriate routines to
construct/destroy objects |
|
This processing is often called a middle pass
and can be supported by the parser in combination with the code generator
and a set of supporting routines invoked during program execution |
|
The subroutines
invoked during program execution are part of the runtime environment of the program
and are provided by the compilation/execution environment |
|
Interpreters |
|
Give flexibility but are slower |
|
Can modify the interpreted program on the fly
and see the impact immediately without a
regeneration of code |
|
Interpretation can be at the source program
level, or at an intermediate language level |
|