CS230

Compiler Design


Project Assignment #1

Lexical Analysis


The following is an extended BNF for a (tiny!) subset of C:

<program> → void main ( ) <block>

<block> → { <optional_statements>? }

<optional_statements>  → <statement_list>

<statement_list>  → <statement> ( <statement> )*

<statement>  →<variable> <assignop> <expression>;

<variable> →<id>

<expression> → <simple_expression>

<simple_expression>  → <term> ( <addop> <term> ) *

<term> →   <factor> (<mulop> <factor>) *

<factor>  → <id> |  (<expression>) |  <num>

<id> → <letter> ( <letter>  | <digit> ) *

<mulop> → *  |  | %

<num>  →   <digit>   <digit>*

<digit> →   [0 - 9]

<letter> →    [a - z | A - Z]

<assignop> → =

<addop> →   + | -
 
  Note in the above that terminal symbols such as { are in boldface and underlined, while meta-linguistic symbols, e.g., "(", "*", are not. (In cases where it might cause confusion, i.e. for assignment and addition operators, the underline is omitted, but boldface is retained.)


Problems to Solve

 
  1. Write down all tokens (i.e., classes of tokens) and their possible values (i.e., lexemes, or a description of them if there are an infinite number) defined in the grammar.
  2. Write down transition diagrams for each token. Taken together, the diagrams constitute a nondeterministic finite automaton for recognizing tokens in the language.
  3. Write a program to lexically analyze programs in the language described. Keep in mind that you will be gradually expanding this grammar to include a larger subset of C. For each token, the program should print a line giving the class of token and its value. The output should look like something along these lines, for the input given below. Ignore the symbol table for now.
  4.  
If (as I assume you will) you use javacc for this assignment, part (3) ought to be quite straightforward. All you need to do is include the specification of the tokens in a javacc file (called, say, Lexer.jj). These are specified in the rules beginning with the one for <id>, as well as implicitly in the keywords and punctuation given in the earlier rules (e.g., "{" and "}"). You will need a very rudimentary grammar, in particular, <program> → (<token expression>)*, where <token expression> represents the union of the various tokens defined. For example, if you have only defined <addop> and <assignop>, then <token expression> would be <addop> | <assignop>. In order to have the parser perform actions, in particular to print out tokens, embed statements in the rule as will be explained in lectures and labs. All of this ought to be clarified by an example which you can click here to look at and download.

 Run your lexical analyzer on the following program and one of your own design.

void main()
 {
    a = b3;
    xyz = a + b % c + c - p/q;
    a = xyz * (p + q);
    p = a - xyz - p;
 }

BE SURE TO :
 

  1. Document all function/methods and variables (where appropriate).
  2. Use mnemonic names.
  3. Use spacing and indenting for readability.
  4. Design abstract data types where appropriate (e.g., as described above).
  5. Repeat steps 1 through 4 for the rest of the assignments.

Please submit your file(s), in a .zip, at this location.



Back to CS 230 Home Page