Lexical Analysis
- Assign: Tuesday, 12 February
- Checkpoint: Complete what you can by the end of Friday, 15 February
- Due: Tuesday, 19 February
- Reference:
Contents
Overview
For the first stage of the compiler implementation project, you will implement the scanner (a.k.a. lexer) for your Roost compiler. The Roost language specification describes the full syntax for Roost. For this stage, you must implement all lexical tokens for the core language and all of its standard extensions. In later stages, some standard extensions to the core language will be optional.
You will implement the lexical analyzer for your Roost compiler by writing a lexical specification for JFlex (skip the installation instructions), a lexical analyzer generator tool that translates your specification to a Java implementation of DFAs to recognize the lexical tokens you specify. You will also apply your lexer impelementation as the first stage in your compiler pipeline and use it to scan tokens from Roost source files.
Before starting this stage, please read the Compiler Requirements and Coding Guide for the project.
Setup
Complete the Getting Started steps on the main project page. Those steps will get you setup with Git, GitLab, IntelliJ, building your compiler, and running your compiler.
Once you have setup everything there, review the code structure of the
project. All of the code you write should be under the package
roost
, which initially contains:
roost.error.CompilerError
: code for representing and reporting errorsroost.lex
package:roost.flex
: JFlex lexer specificationsym
: token enumeration (Java)Token
:Token
type definition- running JFlex will generate
Lexer.java
, the lexer implementation
roost.parse
: code for the next stage (ignore for now)roost.Compiler
: the main entrypointroost.Util
: utility code
Lexer Implementation
The core component of this project stage, the JFlex specification, consists of a pairing of a regular expression plus a corresponding value generation rule for each kind of token in the source language.
Lexical Tokens: Token.scala
Before implementing the lexical specification itself, you will need to
define the values used to represent each individual token in the
compiler after lexical analysis. The lexer will return an object of
this type Token
for each token. The
roost.lex.Token
class must contain at least
the following information:
id
, asym
identifier for the kind of token matched (see thesym
enumeration below);line
, the (1-indexed) line number where the token occurs in the input file;col
, the (1-indexed) column number where the token occurs within that line;value
, the specific value of the token (each kind of token may have an arbitrary kind ofObject
value
, e.g. the character string, or the numeric value).
The different kinds of tokens are distinguished by elements of a Java
enumeration, sym
, which must be placed in a file sym.java
in the
roost.lex
package. The enumeration sym
will
have the following structure:
public enum sym {
IDENTIFIER,
LESS_THAN,
INTEGER,
...
}
Scala and Java
Scala and Java are designed to be
interoperable and used together. Most of the code you write will be in
Scala, but some of the auto-generated lexer and parser code will be in
Java. The JFlex Java output and Scala should interoperate without
problem. That is, you should be able to define Token
as a Scala
class and then create Token
objects in your JFlex and Java code as
usual. I do, however, suggest you leave sym
as a Java enumeration
since that file will be replaced by an auto-generated file when we
write the parser in the next stage.
Lexer Specification: roost.flex
You will construct the JFlex lexer specification in
roost.flex
by writing a regular expression and rule for
each keyword or other atomic lexical symbol in the Roost
Language Specification. (See
especially the discussion of Lexical Considerations and Syntax to
enumerate the set of all such symbols or tokens.) Each regular
expression for a token will be paired with a rule for what value to
produce for matching tokens. Consult the JFlex documentation (skip
the installation instructions) to understand the syntax of the
.flex
specification file itself. Compiling this specification with
the JFlex lexical analyzer generator will produce the Lexer.java
file that contains the lexical analyzer for Roost. Your lexer
should produce Token
objects in most rules. Edit only roost.flex
;
do not edit Lexer.java
.
Nested ML-style Comments
Save this for last. Recognizing nested ML-style comments (* like
(* this *) one *)
correctly requires more than regular
expressions. Since balanced parentheses is a classic context-free
language, it might be tempting to implement this later in the
Roost parser, but doing so would introduce a painful amount of
pollution into the grammar. Instead we will recognize these comments
in the lexer by taking advantage of the actions associated with
patterns to keep and use state for nested comment recognition. This is
“cheating” in the sense that it requires strictly more power than
regular expressions offer, so our lexer will not compile to a pure
DFA, but it’s a cleaner solution. We will discuss.
Compiler Integration
The roost.Compiler
will be the main entrypoint of your
Roost compiler, orchestrating the pipeline of compilation
stages. As we progress through stages of the project, it will acquire
more functionality.
Using the Lexer to Scan Source Code
For this stage, you will implement logic in the main compiler pipeline to perform scanning of all tokens in a source file using the lexer generated from your JFlex specification. You may integrate the scanning logic with the main compiler pipeline as you wish, but try to keep your code modular and well-organized.
The compiler should take a path to a .roost
source file
as a required command-line argument. It should also accept the
optional command-line flag --show-tokens
.
The compiler should open the source file, break it into tokens by
successive calls to the next
method of the generated Lexer
, and
accept the source file if there are no lexical
errors in the file or reject with an informative
error message otherwise.
If and only if the --show-tokens
option is given, the compiler
should print a list of the tokens it scans to standard output as
described below. When this option is not specified, the compiler
should scan silently, reporting an error message it
encounters a lexical error. (Update: Note that the provided
starter code for main
already accepts the command-line option
--show-tokens
. All command line option values are summarized in the
Config
value passed to compile
.)
For each token, in order, print one line containing the line number in
the source file where the token was found, the column number within
that line where the token started, the type of token (your compiler’s
internal name for this kind of token), and finally the value carried
by the Token
, if any. Note that lines and columns are traditionally
1-indexed. Each of this set of four items should be separated by
exactly one space character and the line should be terminated by a
newline character. Abstractly, this line format would be:
<line-number> <column-number> <token-type> <token-characters>
Furthermore, to make this output recognizable, the full sequence of token lines should be prefixed once by the line:
# Start Tokens
and suffixed once by the line:
# End Tokens
For example, given the source code:
fn csid() -> i64 {
301
}
The compiler should print the following output (the names in the third column are up to you):
# Start Tokens
1 1 FN
1 4 ID csid
1 8 LPAREN
1 9 RPAREN
1 11 ARROW
1 14 I64
1 18 LBRACE
2 3 NUM 301
3 1 RBRACE
# End Tokens
Compiler Requirements and Coding Guide
This and all project stages are subject to the Compiler Requirements and Coding Guide documented on the main project page, instead of copied to each individual stage specification.
That page includes:
- General requirements for compiler output
- Programming support (including provided code) that will be useful in this (and other) stages
- Testing support
- Documentation and style guidelines
- Submission and evaluation information
- and more!
Some of these criteria will seem to be trivial or overkill on this first small stage. These guidelines apply to the entire project.