CS230

Compiler Design


Project Assignment #1

Lexical Analysis


The following is an extended BNF for a subset of Pascal:

<program> → program <id> ;  <compound_statement> .

<compound_statement> → begin <optional_statements>? end

<optional_statements>  →<statement_list>

<statement_list>  → <statement> ( ; <statement> )*

<statement>  →<variable> <assignop> <expression>

<variable> →<id>

<expression> → <simple_expression>

<simple_expression>  → <term> ( <addop> <term> ) *

<term> →   <factor> (<mulop> <factor>) *

<factor>  → <id> |  (<expression>) |  <num>

<id> → <letter> ( <letter>  | <digit> ) *

<mulop> → *  |  /  | mod |  div

<num>  →   <digit>   <digit>*

<digit> →   [0 - 9]

<letter> →    [a - z A - Z]

<assignop> → :=

<addop> →   + | -
 
 


Problems to Solve

 
  1. Write down all tokens (i.e., classes of tokens) and their possible values (i.e., lexemes, or a description of them if there are an infinite number) defined in the grammar.
  2. Write down transition diagrams for each token. Taken together, the diagrams constitute a nondeterministic finite automaton for recognizing tokens in the language.
  3. Write a program to lexically analyze programs in the language described. Keep in mind that you will be gradually expanding this grammar to include a larger subset of Pascal. The program should echo each line read. For each token, print a line giving the class of token and its value. Ignore the symbol table for now. On the other hand, you may want to define a table of reserved words and the token values they can have. This table should be modular in fashion and allow easy alteration.
  4.  
If (as I assume you will) you use javacc for this assignment, part (3) ought to be quite straightforward. All you need to do is include the specification of the tokens in a javacc file (called, say, Lexer.jj). These are specified in the rules beginning with the one for <id>, as well as implicitly in the keywords and punctuation given in the earlier rules (e.g., begin and end). You will need a very rudimentary grammar, in particular, <program> → (<token expression>)*, where <token expression> represents the union of the various tokens defined. For example, if you have only defined <addop> and <assignop>, then <token expression> would be <addop> | <assignop>. In order to have the parser perform actions, in particular to print out tokens, embed statements in the rule as will be explained in lectures and labs. All of this ought to be clarified by an example which you can click here to look at and download.

 Run your lexical analyzer on the following program and one of your own design.

program TheSource ;
 begin
    a := b3;
    xyz := a + b mod c + c - p/q;
    a := xyz * (p + q);
    p := a - xyz - p
 end.

BE SURE TO :
 

  1. Document all function/methods and variables (where appropriate).
  2. Use mnemonic names.
  3. Use spacing and indenting for readability.
  4. Design abstract data types where appropriate (e.g., as described above).
  5. Repeat steps 1 through 4 for the rest of the assignments.



Back to CS 230 Home Page