A Hands-on Guide to Implementing DeBruijn Indicies (1)

Cite: The material in this blog post and subsequent related ones are very loosely based on (Abadi et al., 1991) and my own notes previously from taking 15-417 HOT Compilation. I have contributed no original ideas besides some code.

So I thought about just calling this article “Suffering with Substitutions”. That’s a equally suitable title at first glance since much of this code is very painful to write. Unsurprisingly I have yet to hear anyone express fondness with DeBruijn indicies, given the sheer amount of work it requires to prove a bajillion substitution theorems.

But given that I am getting old and I haven’t written any real serious code in almost 8 years, it turns out that the less you write code the worse you get (surprise!). So I thought it would be a fun (?) coding challenge to at least hone my skills a little bit, and if anything hopefully this would be educational for at least someone out there who is doing this for the first time.

So here I will develop the material from scratch, explaining what DeBruijn indicies are and how to actually code them up, in perhaps way too much detail. I assume very little in terms of prerequisites, besides some basic beginner knowledge in programming languages that is taught at an undergraduate level.

The Problem

As is tradition, we start with the simply typed lambda calculus. The reason we start here is that this is a particularly simple case (first order) to implement, at least relatively speaking. We will work with two sorts: types ($\tau$) and terms ($e$), and two types: the unit type and arrow type.

\begin{equation} \tau ::= \mathbb{1} \space | \space \tau \rightarrow \tau’ \end{equation} \begin{equation} e ::= () \space | \space x \space | \space \lambda (x : \tau). e \space | \space app(e, e’) \end{equation}

Note that the abstract syntax for lambda is $\lambda(\tau, x.e)$. For those who are not used to this presentation, it just means that $x$ is bound in $e$ and not in $\tau$.

With the following rules for the typing judgement $\bbox[5px, border: 2px solid blue]{\Gamma \vdash e : \tau}$:

var:

\[\begin{align} x : \tau \in \Gamma \nonumber \\ \hline \nonumber \\ \Gamma \vdash x : \tau \nonumber \\ \end{align}\]

$\mathbb{1}$-intros:

\[\begin{align} \nonumber \\ \hline \nonumber \\ \Gamma \vdash () : \mathbb{1} \nonumber \\ \end{align}\]

arrow-intros:

\[\begin{align} & \Gamma , x : \tau \vdash e : \tau' \nonumber \\ & \hline \nonumber \\ & \Gamma \vdash \lambda (x : \tau). e : \tau \rightarrow \tau' \nonumber \\ \end{align}\]

arrow-elim:

\[\begin{align} & \Gamma \vdash e_1 : \tau \rightarrow \tau' & \Gamma \vdash e_2 : \tau \nonumber \\ & \hline \nonumber \\ & \Gamma \vdash app(e_1, e_2) : \tau' \nonumber \\ \end{align}\]

Simple enough? Now let’s look at problems in terms of implementation.

I’m sure that most people reading this would be able to state and quickly prove some very simple metatheoretic properties, at least on paper. Now the question here is how do you code this up. The most common (and often, painful) issue you would encounter is the representation of variables, binding and substitution.

A particularly bad way of representing variables is a naive implementation of variables using names (like some unique integer, or even worse, strings). This is almost always a bad idea, because in particular, we respect equality on trees up to $\alpha$-equivalence, so:

\[\lambda (x : \tau).x =_\alpha \lambda (y : \tau).y\]

need to hold (we will just write $=$ instead of $=_\alpha$ in the future, since when doing metatheory, equality is always up to $\alpha$-equivalence with no exceptions).

This is highly inconvenient to say the least, since you will need to end up stating all your theorems such that they hold up to a custom implementation of equality itself.

In a compiler implementation, I personally think it is somewhat more acceptable if you represent variables as globally fresh names (usually by just integers), and do a little bit of tweaking in checking $\alpha$-equivalence, in a language such as ML where any form of definitional equality doesn’t exist anyway. However, relying on such a global state means that your typechecking implementation can’t be parallelized. And then there are some optimization passes you can do that potentially require comparing whether two terms are equal, and that is a huge hassle to implement correctly, though admittedly these use cases are niche.

So that aside, what else can you do? Well to my knowledge, you can alternatively:

  • Mechanize it in LF type theory (Harper et al., 1993).
  • Nominal techniques (Pitts, 2016). I admit I didn’t look very deep into this methodology. One day I will get around to it.
  • For certain very simple calculus like this one, intrinsic encodings would also work. For complex type theories where types have nontrivial equalities, it won’t, and hence we won’t discuss it here.

So Why Do You Care About DeBruijn Indicies?

Using HOAS encoding in LF Type Theory works for something simple like this, but as far as I understand it (correct me if I’m wrong), the $\Pi2$ restriction can be pretty limiting. In particular it is not very clear how you would do logical relations proofs in it. There are also a few proofs where you would need some tricks to do (Crary, 2009).

Then there is a much more practical issue: if you only care about mechanizing metatheory, that’s fine and all, but what if you need to write a compiler? Representing bindings as internal functions in the metalanguage doesn’t work very well in compilers since you will often need to peek into the function to inspect it. This would make functions practically impossible to compile. Hence you need to learn some alternative representation technique regardless.

Which brings us to DeBruijn indicies, the representation that makes everyone sad.

DeBruijn Indicies in a Nutshell

The idea for DeBruijn Indicies is simple: we represent variables as how many “hops” it takes to find its binder. For example, for the expression:

\[\lambda x. \lambda y. app(y, x)\]

There is a binder at $\lambda x.$, a binder at $\lambda y.$. Within the application, $y$ here refers to the closest binder and requires $0$ hops. $x$ here needs to jump across $\lambda y.$ to refer to the closest binder, so it requires $1$ hop. Hence, the representation looks like:

\[\lambda . \lambda . app(0, 1)\]

As you can tell, for sufficiently complicated expressions it can be quite hard to read due to needing to do these jumps yourself. Regardless, it should be clear that because we have wiped out all names, $\alpha$-equivalence is exactly just equality on trees.

So this is fine for closed expressions, but what about free variables? There are actually two ways to go about it:

Locally Nameless Forms

We can treat free variables completely differently from bound variables. For bound variables we use their DeBruijn index. For free variables, we can use something like a name that is globally unique. This is called Locally Namelss Forms.

This is a really good solution in general, because this means we can encode contexts as just unordered maps (due to structurality), so structurality theorems simply hold by very trivial arguments (exchange, weaken, etc). Substitution is also a lot easier to implement because for free variables it is merely a matter of comparing names. $\alpha$-equivalence still holds trivially, so that’s good.

The downside is that again you end up with a bunch of theorems to prove about fresh variables, etc, but at least those are easy, if not annoying. A typical implementation would use something like views to “open up” bindings to turn bound variables free, and that can be mildly annoying to implement and not possible to pattern match.

The bigger downside here is performance. If you look at the typechecking rule for lambda-intros, it is quite clear that a typical run of typechecking would almost certainly require “opening up” basically every single binder, and grant each of them fresh names. I suspect that the performance hit isn’t that bad, but it is there.

And then for systems where not all the structurality theorems (like exchange) hold, or at least not trivially, in my mind the benefits seem to be not as great since the representation of contexts require assumptions to be ordered anyway. This seems to be a noticable issue for implementation of dependent type theories.

In any case, this is the “easy” implementation, and not what we are going to do.

The Hard Way

The hard way is to treat free variables in a term as DeBruijn indicies relative to some context. To illustrate this, consider this judgement:

\[\Gamma , x : \tau , y : \tau \rightarrow \tau' \vdash app (y , x) : \tau'\]

If we consider the context itself as ordered and contains bindings, then within the application, $y$ needs to jump $0$ hops to get to the closest binding in the context, whereas $x$ needs to jump $1$ hop across $y : \tau$. Hence we have:

\[\Gamma , \tau , \tau \rightarrow \tau' \vdash app (0 , 1) : \tau'\]

Here, even though $0$ is free within the expression, it is “bound” in a sense relative to the judgement it is in.

The benefit here is that “opening up” a binder is free. For example in the lambda-intros rule, if we write it in DeBruijn indicies, it just looks like:

\[\begin{align} & \Gamma , \tau \vdash e : \tau' \nonumber \\ & \hline \nonumber \\ & \Gamma \vdash \lambda (\tau). e : \tau \rightarrow \tau' \nonumber \\ \end{align}\]

The indicies line up because the references to the variable $x$ within $e$ have the same DeBruijn index, and so does everything else. Hence this rule is very cheap to evaluate.

On the downside, it can be hard to keep track of and certainly very hard to read. Even from a mile away, you might be able to see that this can cause a bit of a headache (well, a lot) if we want to prove something like exchange.

We haven’t talked about actual substitution yet, but we can already write some starter code.

The Code

We shall code this up in Coq. We represent variables as natural numbers corresponding to their DeBruijn index.

Require Import List.Lists.

Inductive ty : Set :=
| Ty_one : ty
| Ty_arrow : ty -> ty -> ty
.

Inductive term : Set :=
| Term_var : nat -> term
| Term_unit : term
| Term_lam : ty -> term -> term
| Term_app : term -> term -> term
.

Note that because we don’t have variable names anymore, Term_lam here implicitly binds a variable in the subterm that it takes.

We also represent contexts as lists of types, but reversed.

Definition ctx := list ty.

Hopefully none of this so far is too surprising.

Now to encode the typing judgement.

Inductive checks : ctx -> term -> ty -> Set :=
| Checks_unit_intros :
  forall C, checks C Term_unit Ty_one
| Checks_lam_intros :
  forall C tau tau2 e,
    checks (tau :: C) e tau2 ->
    checks C (Term_lam tau e) (Ty_arrow tau tau2)
| Checks_lam_elim :
  forall C tau tau2 e e2,
    checks C e (Ty_arrow tau tau2) ->
    checks C e2 tau ->
    checks C (Term_app e e2) tau2
.

So far so good, again in the lambda-intros rule, indicies within $e$ have not been shifted, giving us a straightforward implementation.

Now all we are missing is the variable rule. This one is a bit tricky. Note that because we are using the list type, by the type itself it is technically possible to have DeBruijn indicies that overflow the list. We could consider using a length-indexed list, but this would also require us to make statements on, say, the largest DeBruijn index within the term to avoid the bounds check, which could potentially mean indexing the terms on some natural number representing the largest free DeBruijn index. And if we add products to the language, we’d also need to combine this index by taking the max of the two, which means we’d need to end up reasoning a lot about orders on natural numbers anyway.

So instead, we take a much more simplistic approach by just encoding the requirement within the judgement. A bit of a very mild performance hit for wrapping something into an Option, but that’s probably fine.

Inductive checks : ctx -> term -> ty -> Set :=
| Checks_unit_intros :
  forall C, checks C Term_unit Ty_one
| Checks_lam_intros :
  forall C tau tau2 e,
    checks (tau :: C) e tau2 ->
    checks C (Term_lam tau e) (Ty_arrow tau tau2)
| Checks_lam_elim :
  forall C tau tau2 e e2,
    checks C e (Ty_arrow tau tau2) ->
    checks C e2 tau ->
    checks C (Term_app e e2) tau2
| Checks_var :
  forall C n tau,
    List.nth_error C n = Some tau ->
    checks C (Term_var n) tau
.

Next Up

I just realised I’ve written quite a bit, so everything else will have to be next time. Next time I plan on proving some basic structurality theorems (painful), and discuss implementation of substitution.

References

  1. Explicit substitutions
    M. Abadi, L. Cardelli, P.-L. Curien, and 1 more author
    Journal of Functional Programming, 1991
  2. A framework for defining logics
    Robert Harper, Furio Honsell, and Gordon Plotkin
    J. ACM, Jan 1993
  3. Nominal techniques
    Andrew Pitts
    ACM SIGLOG News, Feb 2016
  4. Explicit Contexts in LF (Extended Abstract)
    Karl Crary
    Electron. Notes Theor. Comput. Sci., Jan 2009