*lambda语言编译器原理 English Version*

*Note*: this is a big refactor of a previous post. I was planning on making this into a series, when I realised that first I had to make some improvements to this initial post to have a better structure. Well, this is the result, and I hope you enjoy. Alternatively, the original version can be found here.

I had heard about lambda calculus, but it wasn’t until recently, when I started reading Types and Programming Languages, that I could really see the beauty of it. So simple, and yet so powerful.

In fact, it’s such a simple language that its full implementation fits in a single article, which makes it great to learn/teach the basics of interpretation.

But before we start, what is lambda calculus anyway? Here’s the Wikipedia description:

Lambda calculus (also written as λ-calculus) is a formal system in mathematical logic for expressing computation based on function abstraction and application using variable binding and substitution. It is a universal model of computation that can be used to simulate any single-taped Turing machine and was first introduced by mathematician Alonzo Church in the 1930s as part of an investigation into the foundations of mathematics.

And to illustrate, here’s what a very simple λ-calculus program might look like:

```
(λx. λy. x) (λy. y) (λx. x)
```

You only have two constructs in λ-calculus: Function abstractions (i.e. a lambda expression) and applications (i.e. function calls), and yet, you can perform any computation with it!

## 1. Structure

An interpreter takes the source code of a program and executes it. In order to transform the program string into an executable format the interpreter has to do a lot of work, and this work is usually organised into *phases*. Compilers and interpreters might have many phases, ours will have two:

- Lexical analysis: The lexical analyser (or lexer for short) breaks sequences of characters into meaningful tokens, e.g. the input
`x`

will result into a identifier token (`LCID`

in our case), with the semantic value`"x"`

associated to it. - Syntactic analysis: The syntax analyser (or parser for short) consumes the tokens generated by the lexer, verifies if the input is correct according to the source language’s syntax and builds an Abstract Syntax Tree (AST).

In addition, there are many kinds of interpreters, we’ll write an AST interpreter, as we can take the result of our lexical analysis and feed it directly into the interpreter. Other kinds of interpreters will require extra phase(s) to transform the AST into the interpreter’s input format.

## 2. Grammar

Before writing a parser, the first thing we need to define is the grammar for the language we’ll be parsing, here’s the BNF grammar for the λ-calculus:

```
term ::= application
| LAMBDA LCID DOT term
application ::= application atom
| atom
atom ::= LPAREN term RPAREN
| LCID
```

### 2.1. Structure of the grammar

The *grammar* consists of a set of *productions*.

A *production* is a substitution rule: it indicates that we can replace its *left-hand side* with its *right-hand side*.

The *left-hand side* of a production consists of a single *nonterminal*.

The *right-hand* side of a production consists of one or more *alternative* substitutions separated by pipes ( `|`

).

An *alternative* is any sequence of zero or more *terminals* and *nonterminals*.

*Nonterminals* appear in lowercase and are used to refer to the *right-hand side* of the production where it was defined.

*Terminals* appear in uppercase and represent some pattern in the actual text input.

### 2.2. Understanding the grammar

The grammar starts with a `term`

, as it’s the first rule of the grammar.

A `term`

might consist of either an `application`

or an `abstraction`

(defined inline).

An `application`

can be made of either another `application`

followed by an `atom`

or alternatively by a single `atom`

. This is a left-recursive rule (recursive because it refers to itself, left-recursive because the recurring term appears in the first position), which means that it’s left-associative (i.e. an application with the form `a b c`

, should be grouped as `(a b) c`

rather than `a (b c)`

).

`atoms`

can either be a lowercase identifier (`LCID`

) or a `term`

wrapped in parentheses (for disambiguation).

## 3. Tokens

The terminals in the grammar are exactly the tokens that our lexer has to produce (and that later our parse will consume). Here are the rules for creating each token:

```
LPAREN: '('
RPAREN: ')'
LAMBDA: 'λ' /* we'll also allow `\` for convenience */
DOT: '.'
// lowercase identifier: lowercase letter followed by any letters
LCID: /[a-z][a-zA-Z]*/
```

To represent the tokens we’ll create `Token`

class to hold the `type`

(one of the above) and an optional semantic value (e.g. the string matched by `LCID`

).

```
class Token {
constructor(type, value) {
this.type = type;
this.value = value;
}
};
```

## 4. Lexer

Now we can use the tokens defined above to write a `Lexer`

, providing a nice API for the parser to consume the tokens.

The token creation part of the Lexer is not very exciting: it’s one big switch statement that checks the next char in the source code:

```
_nextToken() {
switch (c) {
case 'λ':
case '\\':
this._token = new Token(Token.LAMBDA);
break;
case '.':
this._token = new Token(Token.DOT);
break;
case '(':
this._token = new Token(Token.LPAREN);
break;
/* ... */
}
}
```

The lexer will provide the following API for the parser:

`next(t)`

: returns a boolean indicating whether the next token is of type`t`

;`skip(t)`

: same as`next`

, but skips the token if it matches;`match(t)`

: assert that`next`

is true, and`skip`

;`token(t)`

: assert that`next`

is true, and return the token’s semantic value.

Okay, now to the `Parser`

!

## 5. Parser

Before we start writing the parser, there’s only one tricky bit about this grammar: handwritten parsers are usually recursive descent (the one we’re writing will be), and they can’t handle left recursion.

But luckily left recursions can be removed with one simple trick:

We take the original production…

```
application ::= application atom
| atom
```

…and break it into two:

```
application ::= atom application'
application' ::= atom application'
| ε /* empty */
```

Now we have a right-recursive rule which can only expand `application'`

while there’s another `atom`

next, otherwise we choose the alternative right-hand side (`ε`

) which matches nothing, i.e. we stop.

When implementing the parser we just have to be careful to keep the application rule left-associative, otherwise it will change the order in which the applications are executed and our interpreter will be incorrect.

### 5.1. AST

From our grammar we can also derive the nodes for our AST: the right-hand side alternatives that have more than one nonterminal or that have any terminals will *most of the time* require a new node. (The exception in our case is the `LPAREN term RPAREN`

alternative, which is simply a `term`

wrapped in parentheses, so we just need the `term`

.)

We then create three nodes for three alternatives:

`Abstraction`

for`LAMBDA LCID DOT term`

, holding`LCID`

which is the parameter of the abstraction and`term`

which is the body.`Application`

for`application atom`

, which holds both values as`lhs`

and`rhs`

.`Identifier`

for`LCID`

, which holds the parsed lowercase identifier.

Here’s a simple program with its AST:

```
(λx. x) (λy. y)
Application {
lhs: Abstraction {
param: 'x',
body: Identifier { name: 'x' }
},
rhs: Abstraction {
param: 'y',
body: Identifier { name: 'y' }
}
}
```

### 5.2. Parser implementation

Now that we have our AST nodes, we can use them to construct the actual tree.

We can also derive our parser implementation from the grammar using the following steps:

- Create one method for each production, named after it’s left-hand side nonterminal.
- The body of the method will choose the appropriate alternative based on the next token (i.e. one token of lookahead).
- For each
`x`

in the chosen alternative:- if
`x`

is a nonterminal, call method`x`

; - if
`x`

is a terminal, consume token`x`

.

- if

Here’s the actual code:

```
term() {
// term ::= LAMBDA LCID DOT term
// | application
if (this.lexer.skip(Token.LAMBDA)) {
const id = this.lexer.token(Token.LCID);
this.lexer.match(Token.DOT);
const term = this.term();
return new AST.Abstraction(id, term);
} else {
return this.application();
}
}
application() {
// application ::= atom application'
let lhs = this.atom();
while (true) {
// application' ::= atom application'
// | ε
const rhs = this.atom();
if (!rhs) {
return lhs;
} else {
lhs = new AST.Application(lhs, rhs);
}
}
}
atom() {
// atom ::= LPAREN term RPAREN
// | LCID
if (this.lexer.skip(Token.LPAREN)) {
const term = this.term(Token.RPAREN);
this.lexer.match(Token.RPAREN);
return term;
} else if (this.lexer.next(Token.LCID)) {
const id = new AST.Identifier(this.lexer.token(Token.LCID));
return id;
}
}
```

## 6. Evaluation

Now that we have our AST we can use it to evaluate the program.

In order to know how should our interpreter evaluate each construct we first need to look at the λ-calculus’ evaluation rules.

### 6.1. Evaluation rules

First we need to define what are our terms and which of these are values.

We can also derive our terms from the grammar:

```
t1 t2 # Application
λx. t1 # Abstraction
x # Identifier
```

Yes, these are exactly the same as the nodes from our AST! But which are values?

Values are terms that are in its final form, i.e. they can’t be evaluated any further. In this case the only terms that are values are abstractions. (You can’t evaluate a function unless it’s called.)

The actual evaluation rules are as following:

```
1) t1 → t1'
—————————————————
t1 t2 → t1' t2
2) t2 → t2'
—————————————————
v1 t2 → v1 t2'
3) (λx. t12) v2 → [x ↦ v2]t12
```

Here’s how we can read each rule:

- If
`t1`

is a term that evaluates to`t1'`

,`t1 t2`

will evaluate to`t1' t2`

. i.e. the left-hand side of an application is evaluated first. - If
`t2`

is a term that evaluates to`t2'`

,`v1 t2`

will evaluate to`v1 t2'`

. Notice that the left-hand side is`v1`

instead of`t1`

, that means that it’s a value and therefore can’t be evaluated any further. i.e. only when we’re done evaluating the left-hand side of the application we’ll start evaluating the right one. - The result of application
`(λx. t12) v2`

is the same as effectively replacing all occurrences of`x`

in`t12`

with`v2`

. Notice that both sides have to be values before evaluating an application.

### 6.2. Interpreter

The interpreter follows the evaluation rules to reduce a program to a value.

All we have to do now is translate the rules above into JavaScript:

```
const eval = ast => {
while (true) {
if (ast instanceof AST.Application) {
if (isValue(ast.lhs) && isValue(ast.rhs)) {
ast = substitute(ast.rhs, ast.lhs.body);
} else if (isValue(ast.lhs)) {
ast.rhs = eval(ast.rhs);
} else {
ast.lhs = eval(ast.lhs);
}
} else {
return ast;
}
}
};
```

*NOTE: the original version of this article presented a slightly different version of the eval function which stored the variable bindings in a context map. The function was a little bit bigger and less clear, but required less code around it, making the general project simpler. However, there was a bug in the implementation and I decided to switch to the version above, which more closely matches the evaluation rules (and also the version presented in TaPL)*

I hope the correspondence between the evaluation rules and the evaluation function is clear, but in any case, here is it in written form:

- First we check if it’s an application: if it is, we can evaluate it.
- If both sides of the abstraction are values, we can simple replace all the ocurrences of
`x`

with the value being applied to the abstraction; (rule 3) - Otherwise, if the left-hand side is value, we evaluate right-hand side of the application; (rule 2)
- If none of the above applies, we just evaluate the left-hand side of the application. (rule 1)

- If both sides of the abstraction are values, we can simple replace all the ocurrences of
- Lastly, if no rules applies to the AST, that means that it’s already a value, and we are done reducing it.

## 6.3 De Bruijn index

The implementation of the `substitute`

function uses the De Bruijn index, which assigns to every variable the distance between its occurrence and its definition.

For example, the following implementation of `true`

and `false`

(see Church Encoding for reference):

```
(λx. (λy. x)) (λf. (λg. g))
```

becomes:

```
(λx. (λy. 1)) (λf. (λg. 0))
```

Note that the parameter names are no longer necessary, but I’ll keep them in order to highlight the indices, keeping everything else the same.

We can generate the indices during parsing, by augmenting the parser with a context, which will work as a stack of variable names. In order to find the De Bruijn index of a variable, we get its offset in the context stack.

Here are the main rules affected (the other ones will simply pipe the context variable through):

```
term(ctx) {
if (this.lexer.skip(Token.LAMBDA)) {
const id = this.lexer.token(Token.LCID);
this.lexer.match(Token.DOT);
/* Push the parameter into the context stack when parsing the
* abstraction's body (without mutating the current context)
*/
const term = this.term([id].concat(ctx));
return new AST.Abstraction(id, term);
}
/* ... */
}
atom(ctx) {
/* ... */
if (this.lexer.next(Token.LCID)) {
const id = this.lexer.token(Token.LCID)
// Use the builtin indexOf to get the distance in the stack
return new AST.Identifier(ctx.indexOf(id), ctx.length);
}
/* ... */
}
```

## 6.4 Substitution

I personally find the substitution algorithm a little bit confusing, and that’s why I didn’t include it in the original version of the article, but I’ll try and do my best to explain it.

Substitute is actually in built on top of 2 operations: `shift`

and the actual `subst`

:

`shift(x, node)`

: shifts all free variables in`node`

by`x`

.`subst(value, node)`

: substitutes all the variables in`node`

that have a De Bruijn index of zero by`value`

.

### Let’s look at a step by step example:

- If we start with the following source:
`(λx. (λy. y x)) (λt. a)`

- Assuming that
`a`

is a free variable that is one level above, it will look like the following with De Bruijn indices:`(λx. (λy. 0 1)) (λt. 1)`

- Now we evaluate using rule 3: we replace every variable pointing to
`x`

with`(λt. 1)`

:`(λy. 0 (λt. 1)))`

- Can you spot the problem?
`(λt. 1)`

was originally referring to`a`

, but after the substitution it refers to`y`

.That happened because we moved`(λt. 1)`

one level deeper than it was before, but didn’t update the free variables. For that reason substitution is defined as follows:`↓([x ↦ ↑value] node)`

*This means: shift down the result of replacing all occurrences of*`x`

in`node`

with`value`

shifted up. - Breaking it down:5.1. First we shift the free variables in the value up by one:
`(λx. (λy. 0 1)) (λt. 2)`

5.2. Then we replace the occurrences of`x`

with the new value:`(λx. (λy. 0 (λt. 3)))`

*notice that we bump the index that refers to*5.3 Then we drop the enclosing abstraction:`a`

from`2`

to`3`

. That happens because every time we increase the depth of the node (by placing it inside a closure) we need to shift it by the depth difference, which is one in this case (it’s only inside one extra closure).`(λy. 0 (λt. 3))`

5.4. And shift all the free variables down by one:`(λy. 0 (λt. 2))`

- Finally we just need to lookup the parameters’ names at the right offsets, and we get a proper output:
`(λy. y (λt. a))`

### 6.5. Notes on substitution

The substitution logic is recursive, so it requires implementing functions that visit the AST recursively. The full implementation includes an utility function for writing AST visitors and two visitors: `shift`

and `subst`

. The code for that is long and not very interesting, so I skipped it here, but it will be available at the end with the full implementation.

## 7. Printing

We are almost done now: we can already reduce a program to a value, all we have left now is to implement the logic to print this value.

An easy way of doing that is by adding a `toString`

method to every AST node, the only thing we have to be careful with is that we have to map the De Bruijn indices back to the original variable names, as shown in item 6 above.

Here’s the code:

```
/* Abstraction */ toString(ctx=[]) {
return `(λ${this.param}. ${this.body.toString([this.param].concat(ctx))})`;
}
/* Application */ toString(ctx) {
return `${this.lhs.toString(ctx)} ${this.rhs.toString(ctx)}`;
}
/* Identifier */ toString(ctx) {
return ctx[this.value];
}
```

We know that the interpreter will reduce the program to a value, we also know that this value will be an AST node, and according to the evaluation rules the only value in λ-calculus is an abstraction. We can then conclude that the evaluation result will be an abstraction, and hence the default value for the context argument in the `Abstraction.toString`

method, so we can call it from outside without providing a context.

Now we can just call `toString`

in the resulting node, and it’ll print all of its children recursively (passing the context) in order to generate its string representation.

## 8. Putting it all together

We’ll need a runner script that will wire all this parts together, the code should be something like:

```
// assuming you have some source
const source = '(λx. λy. x) (λx. x) (λy. y)';
// wire all the pieces together
const lexer = new Lexer(source);
const parser = new Parser(lexer);
const ast = parser.parse();
const result = Interpreter.eval(ast);
// stringify the resulting node and print it
console.log(result.toString());
```

## Source code

The full implementation can be found on Github: github.com/tadeuzagallo/lc-js

#### That’s all!

Thanks for reading, and as usual, any feedback is more than welcome! 😊