CS230
Compiler Design
Project Assignment #1
Lexical Analysis
The following is an extended BNF for a subset of Pascal:
<program> → program
<id> ; <compound_statement> .
<compound_statement> → begin
<optional_statements>? end
<optional_statements> →<statement_list>
<statement_list> → <statement>
( ; <statement> )*
<statement> →<variable>
<assignop> <expression>
<variable> →<id>
<expression> → <simple_expression>
<simple_expression> →
<term> ( <addop> <term> ) *
<term> → <factor> (<mulop>
<factor>) *
<factor> → <id> |
(<expression>) | <num>
<id> → <letter> ( <letter>
| <digit> ) *
<mulop> → * |
/ | mod | div
<num> → <digit>
<digit>*
<digit> → [0 - 9]
<letter> → [a
- z A - Z]
<assignop> → :=
<addop> → + | -
Problems to Solve
-
Write down all tokens (i.e., classes of tokens) and their possible values
(i.e., lexemes, or a description of them if there are an infinite number) defined in the grammar.
-
Write down transition diagrams for each token. Taken together, the diagrams
constitute a nondeterministic finite automaton for recognizing tokens in
the language.
-
Write a program to lexically analyze programs in the language described.
Keep in mind that you will be gradually expanding this grammar to include
a larger subset of Pascal. The program should echo each line read. For
each token, print a line giving the class of token and its value. Ignore
the symbol table for now. On the other hand, you may want to define a table
of reserved words and the token values they can have. This table should
be modular in fashion and allow easy alteration.
If (as I assume you will) you use javacc for this assignment, part (3) ought
to be quite straightforward. All you need to do is include the specification
of the tokens in a javacc file (called, say, Lexer.jj). These are specified in the
rules beginning with the one for <id>, as well as implicitly in the keywords and
punctuation given in the earlier rules (e.g., begin and end). You will need a
very rudimentary grammar, in particular, <program> → (<token expression>)*,
where <token expression>
represents the union of the various tokens defined. For example,
if you have only defined <addop> and <assignop>, then <token expression> would be
<addop> | <assignop>. In order to have the parser perform actions, in particular
to print out tokens, embed statements in the rule as will be explained in lectures
and labs. All of this ought to be clarified by an example which you
can click here to look at and download.
Run your lexical analyzer on the following program and one of
your own design.
program TheSource ;
begin
a := b3;
xyz := a + b mod c + c - p/q;
a := xyz * (p + q);
p := a - xyz - p
end.
BE SURE TO :
-
Document all function/methods and variables (where appropriate).
-
Use mnemonic names.
-
Use spacing and indenting for readability.
-
Design abstract data types where appropriate (e.g., as described above).
-
Repeat steps 1 through 4 for the rest of the assignments.
Back to CS 230 Home Page