code Implement lexical analysis.

Contents

Overview

For the first stage of the compiler implementation project, you will implement the scanner (a.k.a. lexer) for your Roost compiler. The Roost language specification describes the full syntax for Roost. For this stage, you must implement all lexical tokens for the core language and all of its standard extensions. In later stages, some standard extensions to the core language will be optional.

You will implement the lexical analyzer for your Roost compiler by writing a lexical specification for JFlex (skip the installation instructions), a lexical analyzer generator tool that translates your specification to a Java implementation of DFAs to recognize the lexical tokens you specify. You will also apply your lexer impelementation as the first stage in your compiler pipeline and use it to scan tokens from Roost source files.

Before starting this stage, please read the Compiler Requirements and Coding Guide for the project.

Setup

Complete the Getting Started steps on the main project page. Those steps will get you setup with Git, GitLab, IntelliJ, building your compiler, and running your compiler.

Once you have setup everything there, review the code structure of the project. All of the code you write should be under the package roost, which initially contains:

  • roost.error.CompilerError: code for representing and reporting errors
  • roost.lex package:
    • roost.flex: JFlex lexer specification
    • sym: token enumeration (Java)
    • Token: Token type definition
    • running JFlex will generate Lexer.java, the lexer implementation
  • roost.parse: code for the next stage (ignore for now)
  • roost.Compiler: the main entrypoint
  • roost.Util: utility code

Lexer Implementation

The core component of this project stage, the JFlex specification, consists of a pairing of a regular expression plus a corresponding value generation rule for each kind of token in the source language.

Lexical Tokens: Token.scala

Before implementing the lexical specification itself, you will need to define the values used to represent each individual token in the compiler after lexical analysis. The lexer will return an object of this type Token for each token. The roost.lex.Token class must contain at least the following information:

  • id, a sym identifier for the kind of token matched (see the sym enumeration below);
  • line, the (1-indexed) line number where the token occurs in the input file;
  • col, the (1-indexed) column number where the token occurs within that line;
  • value, the specific value of the token (each kind of token may have an arbitrary kind of Object value, e.g. the character string, or the numeric value).

The different kinds of tokens are distinguished by elements of a Java enumeration, sym, which must be placed in a file sym.java in the roost.lex package. The enumeration sym will have the following structure:

public enum sym {
  IDENTIFIER,
  LESS_THAN,
  INTEGER,
  ...
}

Scala and Java

Scala and Java are designed to be interoperable and used together. Most of the code you write will be in Scala, but some of the auto-generated lexer and parser code will be in Java. The JFlex Java output and Scala should interoperate without problem. That is, you should be able to define Token as a Scala class and then create Token objects in your JFlex and Java code as usual. I do, however, suggest you leave sym as a Java enumeration since that file will be replaced by an auto-generated file when we write the parser in the next stage.

Lexer Specification: roost.flex

You will construct the JFlex lexer specification in roost.flex by writing a regular expression and rule for each keyword or other atomic lexical symbol in the Roost Language Specification. (See especially the discussion of Lexical Considerations and Syntax to enumerate the set of all such symbols or tokens.) Each regular expression for a token will be paired with a rule for what value to produce for matching tokens. Consult the JFlex documentation (skip the installation instructions) to understand the syntax of the .flex specification file itself. Compiling this specification with the JFlex lexical analyzer generator will produce the Lexer.java file that contains the lexical analyzer for Roost. Your lexer should produce Token objects in most rules. Edit only roost.flex; do not edit Lexer.java.

To generate a Lexer.java file from your roost.flex specification, run the Lexer/Parser Generators task you configured in the project setup.

Nested ML-style Comments

Save this for last. Recognizing nested ML-style comments (* like (* this *) one *) correctly requires more than regular expressions. Since balanced parentheses is a classic context-free language, it might be tempting to implement this later in the Roost parser, but doing so would introduce a painful amount of pollution into the grammar. Instead we will recognize these comments in the lexer by taking advantage of the actions associated with patterns to keep and use state for nested comment recognition. This is “cheating” in the sense that it requires strictly more power than regular expressions offer, so our lexer will not compile to a pure DFA, but it’s a cleaner solution. We will discuss.

Compiler Integration

The roost.Compiler will be the main entrypoint of your Roost compiler, orchestrating the pipeline of compilation stages. As we progress through stages of the project, it will acquire more functionality.

Using the Lexer to Scan Source Code

For this stage, you will implement logic in the main compiler pipeline to perform scanning of all tokens in a source file using the lexer generated from your JFlex specification. You may integrate the scanning logic with the main compiler pipeline as you wish, but try to keep your code modular and well-organized.

The compiler should take a path to a .roost source file as a required command-line argument. It should also accept the optional command-line flag --show-tokens.

The compiler should open the source file, break it into tokens by successive calls to the next method of the generated Lexer, and accept the source file if there are no lexical errors in the file or reject with an informative error message otherwise.

If and only if the --show-tokens option is given, the compiler should print a list of the tokens it scans to standard output as described below. When this option is not specified, the compiler should scan silently, reporting an error message it encounters a lexical error. (Update: Note that the provided starter code for main already accepts the command-line option --show-tokens. All command line option values are summarized in the Config value passed to compile.)

For each token, in order, print one line containing the line number in the source file where the token was found, the column number within that line where the token started, the type of token (your compiler’s internal name for this kind of token), and finally the value carried by the Token, if any. Note that lines and columns are traditionally 1-indexed. Each of this set of four items should be separated by exactly one space character and the line should be terminated by a newline character. Abstractly, this line format would be:

<line-number> <column-number> <token-type> <token-characters>

Furthermore, to make this output recognizable, the full sequence of token lines should be prefixed once by the line:

# Start Tokens

and suffixed once by the line:

# End Tokens

For example, given the source code:

fn csid() -> i64 {
  301
}

The compiler should print the following output (the names in the third column are up to you):

# Start Tokens
1 1 FN
1 4 ID csid
1 8 LPAREN
1 9 RPAREN
1 11 ARROW
1 14 I64
1 18 LBRACE
2 3 NUM 301
3 1 RBRACE
# End Tokens

Compiler Requirements and Coding Guide

This and all project stages are subject to the Compiler Requirements and Coding Guide documented on the main project page, instead of copied to each individual stage specification.

That page includes:

Some of these criteria will seem to be trivial or overkill on this first small stage. These guidelines apply to the entire project.