/ edsa / projects

Acknowledgements

We wish to thank Eduardo Martínez Gracia, professor at the University of Murcia, for his help and his guidance in this project.

Introduction

This project corresponds to the practical part of the subject “Compiladores” (Compilers). The subject usually corresponds to the second year of Computer Engineering. Nonetheless, it is a third year subject for the students who decide to enroll in the simultaneous studies program of the University of Murcia. Both of us are currently studying Mathematics and Computer Engineering.

Our initial task was writing a small compiler for a subset of Pascal’s language. The essential features of the language were

  • Differentiation of constants and variables

  • Flow control through the if, for and while statements

  • Code reuse through function definitions

But it was greatly simplified in comparison to Pascal.

  • Restricted to a single type: integer values

  • Absent of relational operators

We decided we wanted to focus on this project and implement a language with more features

  • Support for multiple types

  • Type checking

  • Advanced name resolution

    • Nested name scopes

    • Chronological name scopes which enforce the declaration of a name before its use

    • Achronological name scopes which don’t

  • Operator and function overloading

  • Llvm output

We have spent most of the time learning and implementing a good system. Our focus has always been extensibility. We were always thinking "`If we were to implement this feature from language X, would it be easy with the code as is right now?`". This document serves the purpose of describing our work and our view on the code structure of a compiler.

Compiler Features Overview

Rationale of the Features Supported

As this is a project meant for learning, there is no real reasoning behind supporting some features and leaving behind others. Our major constraint has been time, which we have mostly devoted the name system.

Language Grammar in BNF format

program

program id ( ) ;
functions
declarations
compound_statement .

functions

functions function ;

|

λ

function

function function_id ( optional_args ) : type
declarations
compound_statement

function_id

id

|

+ "operator"

|

- "operator"

|

* "operator"

|

/ "operator"

optional_args

args

|

λ

args

single_arg

|

args , single_arg

single_arg

id : type

|

const id : type

type

int

|

str

declarations

declarations var idenifiers : type ;

|

declarations const constants ;

|

λ

identifiers

id

|

identifiers , id

constants

id := expression

|

constants , id := expression

compound_statement

begin
optional_statements
end

optional_statements

statements

|

λ

statements

statement

statements

|

statements ; statement

statement

id := expression

|

if expression then
statement

|

if expression then
statement
else
statement

|

while expression then
statement

|

for id := expression to expression do
statement

|

write ( expressions )

|

read ( identifiers )

|

compound_statement

optional_expressions

expressions

|

λ

expressions

expression

|

expressions , expression

expression

expression + expression

|

expression - expression

|

expression * expression

|

expression / expression

|

- expression

|

( expression )

|

id

|

int_lit

|

str_lit

|

id ( optional_expressions )

Where id, int_lit and str_lit are identifiers, int literals and str literals as recognized by our lexical analyzer.

Semantics

Assignments

An assignment operation stores the value of an expression in a variable. The type of the expression must match de type of the variable.

Operators Available

There are already defined some operators for our built-in types int and str. They are:

  • Given two expressions of type int a and b:

    • Unary minus operator: -a returns the additive inverse of a.

    • Binary plus operator: a + b returns a plus b.

    • Binary minus operator: a - b returns a minus b.

    • Binary asterisk operator a * b returns a times b.

    • Binary slash operator a / b returns the signed integer quotient of a and b rounded towards zero.

  • Given two expressions of type str a and b:

    • a + b returns the concatenation of a and b.

Function and operator Overloading

You can use the same function name for more than one function definition provided that they differ either by the arity or types of their parameters.

Likewise, you can overload the available operators of our language (+, -, *, /). This is because operators are just functions that allow a different syntax. Like any other function, operators are not restricted in the return type but they must specify a parameter list with only one element for unary operators and two elements for binary operators.

Example of operator overloading (example_program5.mp)
function * operator (const lhs : str, const rhs : int) : str

Name scopes

Identifiers are declared inside name scopes. You can define the same name more than once as long as you do it in different scopes.

The name scope in which functions are defined is acronological and global, it means that every reference to a function in any point of the code is valid provided that at some other point (even afterwards) the function referenced is well defined.

A function declaration introduces a new name scope as a child of the global name scope. The function arguments and declarations belong to the child scope. In contrast to the global name scope, functions name scopes are cronological. This means that a name must be defined before using it.

Example Translation

Input File
program FibonacciDrawing ();

function fib_drawing (const upto : int) : int
var i, last, new : int;
begin
    last := 0;
	for i := 0 to upto do
	begin
		new := fibonacci(i);
        write("_" * (new-last));
		write("F");
		last := new
	end;
	fib_drawing := 0
end;

function fibonacci (n : int) : int
var a,b, target_a, target_b, tmp1, tmp2 : int;
begin
	target_a := 0;
	target_b := 0;
	if (n) then
	begin
		n := n - 1;
		a := 0;
		b := 1;
		target_a := 1;
		while (n) do
		begin
			if (n-n/2*2) then
			begin
				tmp1 := target_a*a + target_b*b;
				tmp2 := target_a*b + target_b*(a+b);
				target_a := tmp1;
				target_b := tmp2
			end;
			tmp1 := a*a + b*b;
			tmp2 := a*b + b*(a+b);
			a := tmp1;
			b := tmp2;
			n := n / 2
		end
	end;
	fibonacci := target_a+target_b
end;

function * operator (const lhs : str, const rhs : int) : str
var i : int;
begin
	op := "";
	for i := 0 to rhs do
	begin
		op := op + lhs
	end
end;

function max(const lhs : int, const rhs : int) : int
var dif : int;
begin
	max := lhs;
	dif := rhs-lhs;
	while (dif) do
	begin
		if (dif-1) then
		begin
		end
		else
			max := rhs;
		if (dif+1) then
		begin
		end
		else
			max := lhs;
		dif := dif/2
	end
end;



var upTo : int;

begin
	read(upTo);
	upTo := -fib_drawing(upTo)
end.
Output File
; ModuleID = '<stdin>'
source_filename = "FibonacciDrawing"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

@.io.int = private unnamed_addr constant [3 x i8] c"%d\00", align 1
@.io.str = private unnamed_addr constant [3 x i8] c"%s\00", align 1
@.strlit.2 = private unnamed_addr constant [2 x i8] c"_\00", align 1
@.strlit.3 = private unnamed_addr constant [2 x i8] c"F\00", align 1
@.strlit.4 = private unnamed_addr constant [1 x i8] c"\00", align 1


; Function
define i32 @fib_drawing(i32) {
	%upto = alloca i32, align 4
	%i = alloca i32, align 4
	%last = alloca i32, align 4
	%new = alloca i32, align 4
	%fib_drawing = alloca i32, align 4
	store i32 %0, i32* %upto, align 4
	store i32 0, i32* %last, align 4
	store i32 0, i32* %i, align 4
	br label %forcomp2
forcomp2:
	%2 = load i32, i32* %i, align 4
	%3 = load i32, i32* %upto, align 4
	%4 = icmp ne i32 %2, %3
	br i1 %4, label %forloop2, label %afterfor2
forloop2:
	%5 = load i32, i32* %i, align 4
	%6 = call i32 @fibonacci(i32 %5)
	store i32 %6, i32* %new, align 4
	%7 = load i32, i32* %new, align 4
	%8 = load i32, i32* %last, align 4
	%9 = sub nsw i32 %7, %8
	%10 = call i8* @.operatorASTERISK(i8* getelementptr inbounds ([2 x i8], [2 x i8]* @.strlit.2, i32 0, i32 0), i32 %9)
	%11 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([3 x i8], [3 x i8]* @.io.str, i32 0, i32 0), i8* %10)
	%12 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([3 x i8], [3 x i8]* @.io.str, i32 0, i32 0), i8* getelementptr inbounds ([2 x i8], [2 x i8]* @.strlit.3, i32 0, i32 0))
	%13 = load i32, i32* %new, align 4
	store i32 %13, i32* %last, align 4
	%14 = add nsw i32 %2, 1
	store i32 %14, i32* %i, align 4
	br label %forcomp2
afterfor2:
	store i32 0, i32* %fib_drawing, align 4
	%15 = load i32, i32* %fib_drawing, align 4
	ret i32 %15
}

; Function
define i32 @fibonacci(i32) {
	[...]
}

; Function
define i8* @.operatorASTERISK(i8*, i32) {
	%lhs = alloca i8*, align 8
	%rhs = alloca i32, align 4
	%i = alloca i32, align 4
	%op = alloca i8*, align 8
	store i8* %0, i8** %lhs, align 8
	store i32 %1, i32* %rhs, align 4
	store i8* getelementptr inbounds ([1 x i8], [1 x i8]* @.strlit.4, i32 0, i32 0), i8** %op, align 8
	store i32 0, i32* %i, align 4
	br label %forcomp3
forcomp3:
	%3 = load i32, i32* %i, align 4
	%4 = load i32, i32* %rhs, align 4
	%5 = icmp ne i32 %3, %4
	br i1 %5, label %forloop3, label %afterfor3
forloop3:
	%6 = load i8*, i8** %op, align 8
	%7 = load i8*, i8** %lhs, align 8
	%8 = call i64 @strlen(i8* %6)
	%9 = call i64 @strlen(i8* %7)
	%10 = add nsw i64 %8, %9
	%11 = add nsw i64 %10, 1
	%12 = alloca i8, i64 %11
	%13 = call i8* @strcpy(i8* %12, i8*%6)
	%14 = getelementptr i8, i8* %12, i64 %8
	%15 = call i8* @strcpy(i8* %14, i8*%7)
	store i8* %12, i8** %op, align 8
	%16 = add nsw i32 %3, 1
	store i32 %16, i32* %i, align 4
	br label %forcomp3
afterfor3:
	%17 = load i8*, i8** %op, align 8
	%18 = call i64 @strlen(i8* %17)
	%19 = add nsw i64 %18, 1
	%20 = call noalias i8* @malloc(i64 %19)
	%21 = call i8* @strcpy(i8* %20, i8* %17)
	ret i8* %21
}

; Function
define i32 @max(i32, i32) {
	[...]
}

; Function
define i32 @main() {
	%upTo = alloca i32, align 4
	%.main = alloca i32, align 4
	store i32 0, i32* %.main, align 4
	%1 = call i32 (i8*, ...) @__isoc99_scanf(i8* getelementptr inbounds ([3 x i8], [3 x i8]* @.io.int, i32 0, i32 0), i32* %upTo)
	%2 = load i32, i32* %upTo, align 4
	%3 = call i32 @fib_drawing(i32 %2)
	%4 = sub nsw i32 0, %3
	store i32 %4, i32* %upTo, align 4
	%5 = load i32, i32* %.main, align 4
	ret i32 %5
}

declare i32 @__isoc99_scanf(i8*, ...)
declare i32 @printf(i8*, ...)
declare noalias i8* @malloc(i64)
declare i64 @strlen(i8*)
declare i8* @strcpy(i8*, i8*)

Language and Tools

We were taught in class how to use Bison. Bison is an Open Source parser generator usually used along Flex, an Open Source lexical analyzer generator. Although originally written for the C language, both Bison and Flex allow you to work with C++. We chose to use C++ because it is a very powerful language and, being this a new project, we had no reason to stick to C.

In addition, we used

Compiler Parts

Imperative languages share some similarities. Although they may differ in the implementation, the usually share the core concepts.

Statement

Statements are the basic units of a program. In a typical language, assignment, flow control directives (loops, conditional statements and branching statements) and function invocations are all statements with their own syntax.

Expression

Typically, statements take parameters, and these parameters are usually expressions. For example, you can assign a variable a literal value or the sum of two variables. Both would be valid expressions (for the assignment value).

Function

A function is a set of statements that can be invoked in other parts of the program.

Variable

A variable is an abstract entity that holds a value which can be used (as an expression) in statements and can also be modified trough some statements (like assignment).

Abstraction

We have found that the ability to abstract concepts is key in the design of a compiler. We can merge many concepts, leading to an easier understanding and simpler logic.

Merging Concepts

We can define a constant as a name that holds a value that cannot be changed. When translating to machine code, constants can be allocated in read-only segments or globally instead of in the stack. However, for programming purposes, a constant is a variable which cannot be modified. Whenever we see that a concept can be expressed in terms of another concept in the compiler, we will find languages which merge both concepts.

For instance, in C++ variables declared with the const attribute may not be modified, but they aren’t exactly constants. The reason is that a function can take a constant reference to a variable as a parameter, which means that the address of memory associated with that variable can be read inside the function by means of using the variable name, but the compiler ensures that the variable is not modified, even if the variable was not declared as const in the function that called it, the same address of memory could be modified there.

As another example, expressions can be thought of as statements. The reason is that in some languages they can modify the state of the computer, just as statements. In general, a statement could be thought as an expression which doesn’t return a result, or expressions could be statements that did return a result. In the Lisp family of languages, every statement returns a value that can be used for another statement. Therefore, there is not a distinction between the two. We can also shorten the distance between expressions and statements by making the return type of some statements a special type of which the programmer cannot handle values.

Definition and Usage

Programming languages serve the purpose of creating programs that process data and do calculations. Some languages mantain a structure very similar to assembly. However, all of them introduce modular entities that the programmer can customize and use. Variables can be declared. Types can be created grouping smaller types. Functions can be created grouping statements.

It is common that an identifier (a name) is used to refer to this entities. When this is the case, we usually need to conceptually separate the difference between the definition, i.e. the programmer specifies that there is a function with name foo that consists of these statements; and the usage, the programmer calls a function defined at some point in the code. C and C++ even diferentiate between declarations and implementations, where the declaration only specifies how an object can be used (which parameters does a function take).

The difference must translate to the abstract syntax tree too. We must have different nodes for a function definition and a function call. And again, this can be generalized further. C++ considers the construction name(args) as an operator and allows overloading it. Therefore, in C++ you can call a function but you can also call a variable whose type has the operator overloaded. This is a usage abstraction and gives place to the concept of callable.

Implementation Reasoning

Designing a language and designing the compiler are completely different tasks. Designing a language involves choosing its features (knowing in advance that they can be achieved) and how they interact. Designing the compiler is designing an application…​ using a programming language.

We believe a natural separation of a compiler is

  • The structure known as the abstract syntax tree (AST)

  • The algorithms that operate on that structure

However, this separation is rather obvious and provides little help to beginners. We believe this is a better classification.

  • The lexer, which divides the input in tokens.

  • The parser, which builds the initial AST from the tokens

  • The name resolution algorithms, which bind each identifier with a definition

  • The type system structures and algorithms, in charge of types equivalence, conversion and other advanced features, such as inheritance

  • The semantic correction algorithms, which check things such that the expressions and the variable in a typical for statement are of the same type.

  • The optimization algorithms, which modify the AST

  • The translation algorithm, which produces the final result.

This could be a good modularization of a compiler project. Nevertheless, there are also dependencies between systems. For example, a name resolution algorithm first applies to identify the possible functions that can be associated with a function call. After that, there must be a criteria for choosing which one applies. However, that algorithm needs to know which types are compatible. Hence, it can be difficult to separate the name resolution algorithms from the type system.

Variants or Inheritance. The Visitor Pattern.

As we have already seen, a lot of algorithms in the compiler are related to the AST. When programming a smaller compiler such as ours, without a rich type system and without optimization phases, it might sound reasonable to implement the AST using inheritance.

ASTs for Languages like Java contain ∼50 node types, and compilers like the GNU Compiler Collection (GCC) have ∼200 phases.[1]
Crafting a Compiler

As programmers of a small compiler, we cannot recommend this. Even in a small compiler you would need to implement 3 to 5 virtual functions for each node of the AST. This results in code with the same purpose being dispersed along multiple files.

In addition, declaring interfaces for the different types of nodes, such as expressions, does not scale properly. As the complexity increases, a node can start implementing many interfaces.

Our implementation uses C++ 17’s std::variant to simulate the visitor design pattern. With this approach, an expression is one of many possibilities, instead of a base class. The approach is similar to using a C union but allows dynamic dispatching as a language feature thanks to the function std::visit, which automatically invokes the method of a callable that better suits the current object.

Expression definition
enum UnaryOperators : char {
    kUnaMinus = '-',
};

enum BinaryOperators : char {
    kPlus     = '+',
    kBinMinus = '-',
    kAsterisk = '*',
    kSlash    = '/',
};

template<UnaryOperators op>  struct UnaOp;
template<BinaryOperators op> struct BinOp;
class Id;
struct IntLit;
struct StrLit;
struct FunCall;
struct NoExp;


using Exp = std::variant<
    UnaOp<kUnaMinus>*,
    BinOp<kPlus>*,
    BinOp<kBinMinus>*,
    BinOp<kAsterisk>*,
    BinOp<kSlash>*,
    RVar,
    IntLit*,
    StrLit*,
    FunCall*,
    NoExp*
>;

The AST becomes a very simple data structure which the algorithms are free to modify.

A function call node
struct FunCall {
    RFun rfun;
    std::vector<Exp> args;

    FunCall(RFun rfun, std::vector<Exp>&& args) : rfun(rfun), args(args) {  };
};

And we can include all the code related to a pass over the AST inside a single class which packs the methods and the data it needs to act. This also favors debugging of large systems, because this type of system doesn’t rely on singletons. We can create as many instances of an optimizer as we want and pass a suite of tests over plainly ASTs defined by the programmer.

Implementation and Project Structure

The Abstract Syntax Tree

The definition of the whole AST is divided in four files.

ast_defs.hpp

Contains the basic definitions of the AST. It contains the supported operators, the variant expression (Exp) and the variant statement (Stmt).

ast.hpp

Contains the AST classes which are objects in the language and have a detailed description of their implementation as declared by the programmer.

These classes are special because they can pack information that is needed for the final translation. We have also considered a good choice to inherit from these classes, because in this case the class polymorphism was beneficial. For example, our builtin operators inherit from Fun, the class that represents a function.

expressions.hpp

Declares the expression nodes.

statements.hpp

Declares the statements structures.

Name Resolution

Name Scopes

Big programs consists of thousands of lines of code. Languages usually offer mechanisms to avoid name conflicts. Name scopes are an abstraction that group the names, allowing the same name to belong to different name scopes.

Name scopes usually receive a name that allows to refer to the names inside that name scope from a different one, usually by prepending the name with the namescope’s name.

We wanted to design a general system that would allow

  • Nesting of name scopes

  • Exporting an AST with unresolved names

  • Using identifiers previous to their declaration (for some use cases)

Regarding the last point, we thought that this could be a very useful feature to allow the use of constants and functions previous to their definition.

However, we beleived this was a feature we wouldn’t like to apply to every single identifier. The reason is simple. Given the following code

def f() {    // namespace of function f
    if () {  // namespace created by a compound statement
        a(); // unresolved name (hasn't been declared at this point)
    }
    int a = 3;
}

def a() {

}

In a typical imperative language, the usage of the name a would not point to the variable. Neither it would to the function, because it was declared afterwards, but we wanted to maintain the possibility of having name scopes in which names are not available until you define them.

Our solution is creating two types of name scopes.

Acronological Name Scopes

In acronological name scopes definitions don’t follow any order. In an advanced system, this usually would imply that the compiler would not guarantee any order in the initialization. By definition, any definition or statement could make use of the rest of the names. Nested name scopes inherit all of the names declared in this name scope, independent of the moment where they are defined. Another good name for this type of name scope could be declaration name scope or parallel name scope.

Cronological Name Scopes

In cronological name scopes there exists a total order between definitions. A definition may only use the definitions from the name scope that were defined before it. The compiler can guarantee the order of initialization and can easily resolve names during the parsing by maintaining a stack of active identifiers for each name. Another good name for this type of name scope could be implementation name scope or ordered name scope.

Acronological name scopes can be useful for the global name scope, classes name scopes and some user-defined name scopes. Cronological name scopes can be useful for the body of functions, loop statements and user-defined name scopes where the order of initialization is important.

The implementation inside the compiler is easy if we fix that acronological name scopes may only be children of an acronological name scope too. If this is the case the stack of active name scopes at any point in the code always looks as a sucession of acronological name scopes followed by a sucession of cronological. When a name is used, the compiler can check the active names and check if it references an object in a cronological name scope (which must be already defined). If the top active identifier with this name is not above the top acronological scope, an identifier in the top acronological scope is created. At the end of the program, an algorithm can easily redirect identifiers in acronological scopes which weren’t defined to an identifier in a parent scope. This, precisely, is our implementation.

Identifiers

As pointed in the previous section, our design of the language means that the nodes in the AST cannot point directly to the objects they refer. The reason is, we only know the name of such an object, but different objects can have the same names.

Names can be resolved doing a pass over the AST. To maintain type safety in our code, we followed this scheme.

Named Abstraction Name A reference to a named abstraction

type (Type)

type usage (RType)
(in the declaration of variables and functions)

variable (Var)

identifier
(uniquely identified by name and name scope)

var usage (RVar)
(as an expression or as a memory location)

function (Fun)

function call (RFun, FunCall)

and made use of the following definitions.

Named references
union RVar {
    identifiers::Id* id;
    Var* var;

    RVar() {  }
    explicit RVar(identifiers::Id* id) : id(id) {  }
};

union RType {
    identifiers::Id* id;
    Type* ty;

    RType() {  }
    explicit RType(identifiers::Id* id) : id(id) {  }
};

union RFun {
    identifiers::Id* id;
    Fun* fun;

    RFun() {  }
    explicit RFun(identifiers::Id* id) : id(id) {  }
};

By using unions, we incur in no extra cost in memory space. The AST is defined in a way that an expression or statement which uses a variable has a member of type RVar instead of a pointer to a variable object (Var*). During the name resolution pass, we change the reference to point to the object, whose information has been referenced inside the identifiers::Id class at the moment of definition. Passes that happen after this one use this references as if they pointed to the named abstraction.

This is only an implementation detail, but by using unions instead of generic pointers, we can benefit of type checking by the compiler and we avoid coding static casts everywhere. In addition, it is clear from a programmer point of view that RVar is a reference to a variable object, whether this object is currently represented by its identifier or not.

The name system is implemented in three files:

ast_defs.hpp

Contains the definitions of RType, RVar and RFun.

identifiers.hpp

Contains the definitions of the identifiers name space (in the code, not in the sense of name scope in the compiler). It defines the classes NameScope and Id and contains functions to add and change name scopes during the parsing.

id_resolution.hpp

Contains the class in charge of performing the name resolution and updating the named references to point to named objects. We have also used this class to perform semantic checks during the pass that ensure the program correctness.

Type Handling

We designed our compiler with the idea of being able to support user-defined types. However, we have not had time to do so.

Fortunately, we designed our compiler with two primitive types. This means we considered type checking in our design.

Error Reporting

Many verifications are performed by the compiler to ensure that the input text representation of the program fullfils the syntactic and semantic requirements. Through these verifications we can catch some errors in the input and report them to the user in form of error messages. We can classify the analysis in three phases:

  • Lexical phase

  • Syntactic phase

  • Semantic phase

Lexical Phase

During the lexical analysis phase we can detect some typical errors:

  • Unclosed comments or strings

    This type of error is treated by placing the scanner in the corresponding start condition and matching the applicable rules until end-of-file or line feed are encountered, respectively.

    Note
    Start Conditions. Flex provides a mechanism for conditionally activating rules. Any rule whose pattern is prefixed with "<sc>" will only be active when the scanner is in the start condition named "sc". More in http://dinosaur.compilertools.net/flex/flex_11.html
  • Exceeding length of identifiers or literals

    There is a limit to how long the identifiers and the literals can be.

    • String literals

      In this case, each time we scan a new character to be added to the string, the sum of the current string size and the size of the character sequence in yytext is checked in order to not surpass the string maximum size (7kB). In case this happens, we have a special start condition in which all the characters until the end of the literal or the end of the line are skipped, and then the scanner returns to the initial start condition.

    • Integer literals

      Once we scan a digit sequence, we simply check if the number represented by the literal fits in a 32 bit integer.

    • Identifiers

      Identifiers cannot excede 15 characters. Our implementation in flex includes a specific rule ({letter}|_)({letter}|{digit}|_){16,16} for catching identifiers with at least 16 letters. The reason why we read only the first 16 is to avoid a buffer overflow while the identifier is being scanned. The compiler simply discards the rest of the identifier by using start conditions. We represent oversized identifiers with the name BigXXLName, which can appear in the output of semantic errors.

  • Appearance of illegal characters

    There are a bunch of characters that our scanner recognizes as illegal characters because they can’t be used in any token. Printing an error for each unrecognized character may create a very big output. For example, if the user accidentally tries to compile a binary program. The lexical scanner groups sequences of these characters until a valid character is found with the rule [^0-9a-zA-Z()".,:;=+\-*/\\ \t\r\n]+. This is called panic mode recovery.

Syntactic Phase

We have applied two methods of dealing with errors supported by Bison.

  • Panic mode recovery. When a unexpected token is recieved while parsing the program, the parser discards all the incoming tokens until a token in a selected set of synchronizing tokens appears.

  • Error production. Certain errors can be incorported by augmenting the grammar with error productions that generate erroneous constructs. This method allow us to generate appropriate error messages for typical mistakes.

Error Production

We begin by observing in which cases we were able to use the second method, since these cases don’t need more explanation. In all rules where a semicolon was needed (always as a separator), we added an additional rule that can parse the symbol resulting of subtracting the semicolon token from the symbol parsed by the original rule. A similar procedure has been taken with the absence of the token "then" in the if statement, the token "do" in the while statement or the token "var" in a variable declaration.

This technique is useful not only to detect the lack of tokens, but also to detect the innecessary presence of them, as in the case of the semicolon after the last statement in a compound statement. As an improvement of the compiler, we could add more of these rules to produce the error related to the absence of the "begin" token in the compound_statement symbol, mismatched brackets, etc.

Panic Mode Recovery

To recover from unexpected errors, the strategy followed is based on looking for tokens that can lead us to a steady state, from which we can continue the parsing without high risk of propagating the "same" error. This is also why we normally use the sentence yyerrok; in bison’s rules right after the synchronization token is found.

Note
To prevent an outpouring of error messages, the parser will output no error message for another syntax error that happens shortly after the first; only after three consecutive input tokens have been successfully shifted will error messages resume. You can make error messages resume immediately by using the macro yyerrok in an action.

Now let’s take a closer look at the error rules in syntax.yy, from top to bottom. The first error rule appearing is related to an error in the program header the program symbol

program:
    ...
    error ";" functions declarations compound_statement "."

It is clear that this rule is enough for this symbol (although it makes the parser discard all the valid tokens previous to the error), provided that there are error rules for symbols functions, declarations and compound_statement. For the symbol function we have the rule

function:
    ...
    "function" error declarations compound_statement

that again relies on the error rules of symbols declarations and compound_statement. In this case the synchronization tokens are "var" and "const", so we are forced to not have a rule that starts with the token error in the symbol declarations, but we decided this was fine because a variable declaration without the "var" token is already conceived by the rule

declarations:
    ...
    declarations comma_sep_dcl[ids] ":" rtype ";"
    ...

and other error without the "var" token is "weird". If this thing occurs, the error is recovered in the error rule of the symbols program or function. So, for the declarations symbol, we have the rules

declarations:
    ...
    declarations "var" error ";"
    |
    declarations "const" error ";"
    ...

In respect of the statements, we have these error rules:

statement:
    ...
    |
    semcolon_sep_stmts_ error
    |
    error ";" {yyerrok;} statement
    ...

The second one allow us to discard incorrect statements (and report all of them thanks to yyerrok) until the first valid statement is found. The first one parses invalid tokens until the lookahead token is the semicolon token, then, the parser shifts the semicolon in rule

statement:
    semcolon_sep_stmts_ ";" {yyerrok;} statement
    ...

allowing us, once again, to report all the contiguous incorrect sentences and continue with the correct parsing of the program. But what if all of the statements of a compound_statement are incorrect? In that case none of the presented rules are helpful, so we introduce a new rule for the symbol compound_statement

compound_statement
    ...
    "begin" error "end"

In fact, this is also a strategy for the panic mode recovery, it consists in recover to the matching close-delimiter of an opening-delimiter that has already been parsed. We do this everytime an opening-bracket is parsed in a statement or in a expression.

The last error rule is the one in symbol rtype:

rtype:
    ...
    error {
        $$ = ast::RType(builtin::ErrorType()->id());
        yyclearin;
        yyerrok;
    }

As we can see, our grammar accepts any token besides the expected tokens "int" and "str" by inserting a reference to a type explicitly defined for error cases. When user-defined types are implemented, this rule will not be necessary because instead, the semantic phase would be responsible of checking that the identifier referenced belongs to a type.

Semantic Phase

While building the AST, we can detect the redefinition of an object by simply checking our symbol table name_table in pursuit of already defined objects with that identifier in the same scope, but we have to wait until we have our AST complete to do other kinds of checking:

  • References to non-existing objects:

    As we said before, we have both cronological and acronological name scopes, so we can’t check if an indentifier is undeclared until the symbol table is complete. Once all the identifiers have been registered, we bind all the references that share a name in a first pass through the symbol table and if, given an identifier, we cannot determine what object does it refer to (i.e. there is no object defined with that id in this scope or in a superior one), we report that as an unresolved identifier.

  • Repeated functions:

    Due to function overloading, we cannot know if a function is being redefined only because of its name. Instead, we have to check the types of his argument to do so. This, together with the fact that references to types are not resolved yet, is why we need to check this in an independent second pass through the symbol table.

  • Usage of objects and type checking:

    Obviously, we cannot check if an object is being correctly used until all identifiers are resolved, because we can only know how the object is going to be used through the references that refers to it. Hence, once all the identifiers have been resolved, we can check if the object they refer to is intended to be used the way that the expression tries to. We do this through a visitor named name_resolver, that passes through all the AST checking some typical semantic information:

    • identifiers used as variables (types) effectively refer to variables (types).

    • l-value and r-value have the same type in assignments.

    • constants variables are not being used in assignments.

    • expressions in if, while and for statements are of int type.

    • arguments of write and read are primitive types (the only ones that are printable so far).

    • given a list of arguments for a function name, it exists a function with that name and with the types in its parameters in the same order and content that the types of the arguments given. As we treat operators the same way as functions, for both cases we use the same method to get the overload resolution done, i.e. obtain the object given the parameters and the id/operator.

Code Generation

Once we have checked all the possible errors in the input code, and provided that there is no error, we generate the code taking advantage of the name resolution that has been done while looking for errors. This is, all the nodes that represented references to objects are now actually references to those objects.

With the AST ready, we use another visitor named translator to generate the LLVM code of the program represented by the AST, so each node corresponds to a little piece of code that is written through a given std::ostream. This visitor is defined in the file llvm.hpp.

Declaration Translation

The grammar for our language doesn’t allow to declare global variables. Instead, each function is preceded by local declarations. Nevertheless, value literals are like an implicit declaration, in the sense that memory must be reserved. Llvm allows to use integer values inside the code. Hence, only str literals are allocated globally.

Statement Translation

The assignment, if, while and for statements are translated as follows:

id := exp

code for exp
store type_llvm_name exp_value , type_llvm_name * % id , align type_def_alignment

if cond then
statement1
else
statement2

code for cond
temp_var = icmp ne i32 cond_value , 0
br i1 temp_var , label then_label , label else_label
then_label :
code for statement1
br label fi_label
else_label :
code for statement2
br label fi_label
fi_label :

while cond do
statement

br label comp_label
comp_label :
code for cond
temp_var = icmp ne i32 cond_value , 0
br i1 temp_var , label loop_label , label afterwhile_label
loop_label :
code for statement
br label comp_label
afterwhile_label :

for id := start_exp to end_exp do
statement

code for start_exp
store i32 start_exp_value , i32* % id , align 4
br label comp_label
comp_label :
code for end_exp
code for id
temp_var = icmp ne i32 end_exp_value , id_value
br i1 temp_var , label loop_label , label afterfor_label
loop_label :
code for statement
id increment
br label comp_label
afterfor_label :

Write and Read Statements

Write and read statements make use of C’s stdio library.

Expression Translation

exp1 op exp2

code for exp1
code for exp2
code for operator

op exp

code for exp
code for operator

id

id_val = load type_llvm_name , type_llvm_name * % id , align type_def_alignment

id ( exp1 , exp2 …​ )

code for expressions
call_val = return_type_llvm_name @ id ( exp1_type_llvm_name exp1_val , exp2_type_llvm_name …​ )

Function Inlining

The function class Fun is responsible for translating a function call. Our builtin operators override this function and instead inline the code.

Conclusions

While doing this project, we have obtained a fairly general view of the organization of translation programs. At the same time, we have learned the necessary techniques for solving problems that appear when designing a translator for a programming language, especially abstraction mechanisms such as the abstract syntax tree.

It has also allowed us to reflect on certain characteristics of languages and the value of the decisions made about them. In particular, those related to the different forms of binding, visibility and scope of a variable. As well as the great importance of separators when carrying out a complete analysis of the code of a program. We are left with the desire to continue the project, because even though we had already thought of several aspects to improve and features to add, we ran out of time. Time that we have to share with other subjects and projects.

All in all, we finish this project with great enthusiasm and we would enjoy working on something similar.

How to continue the project

Now that we have finished the project, we would like our teachers to consider the possibility of allowing future students to continue this project instead of starting from scratch. The design of the application could receive a few improvements which we have marked with //IMPROVEMENT comments in our source code. The project can also be used as a reference for future students or as a project skeleton after removing some parts of code.

These are some of the improvements this compiler could receive

Better Encapsulation

Although the main parts of the project remain modularized, we still maintain some global variables, like a collection of the program string literals. The AST’s root, the struct Prog, could be upgraded to a class which would maintain this kind of state. State which could also be queried and modified by passes over the AST.

Memory Management

Right now, the program relies on the AST being freed at the end of execution. We would encourage future contributors to understand and modify the code to represent memory ownership.

Output Optimization

This would be a fresh topic, since we have not implemented any optimization pass.

Llvm Libraries

We have implemented a direct translation making use of the llvm language specification. Instead, it would be useful to translate the program’s AST to llvm’s AST and make use of all the available libraries to optimize this code.

New Features

And of course, complementing the language with new features is important. In order of importance, we miss

  • The ability to declare arrays

  • User declared types

  • A unique feature, such as templates are for C++, some kind of pattern matching or even an inheritance system.

References

  • [1] Charles N. Fischer, Ron K. Cytron & Richar J. LeBlanc, Jr. Crafting a Compiler. Addison-Wesley. 2010.