Declaration & resolution

2022 JUNE 19 LANG, Z

Consider a fairly simple imperative language, which consists of declarations and statements. Declarations declare and define items, like constants, functions and types, and statements change the runtime program state in some way. Declarations may refer to each other and themselves in any arbitrary order, while statements are executed sequentially and may only refer to items created “before” themselves.

fun main
  for i in 0 upto 100 when odd?(i)
    do println(i)
  end
end

fun odd?(x: Int) =>
  x mod 2 = 1

Because of this setup, name resolution must probably be a two-pass thing, where the compiler first finds all the names of declared items and only then resolves the names.

For a batch compiler, that's straightforward enough, but for a query-based compiler, like you might find in a language server, things are a bit more complicated.

Ideally, we want a query-based pass to be a pure-ish, idempotent operation such that we can easily cache them and only rerun the parts that have changed. Additionally, we probably don't want to run the declaration pass over every single file simultaneously, in order to be able to cache at a more granular level.

This brings us to an important first clue. When we are just declaring the items in a given file, we do not need to know what is going on in other files .¹ The big win here is that name declaration can happen file-by-file. This seems to imply that (part of) the declaration pass should look like this:

decls_in_file: List(Ast) -> List(Name)

with the AST list containing all of the trees in a given file. This function can be a dead simple recursive pattern match, and is easily pure and idempotent.

Note also that this function also doesn't need to care about statements and the names declared within those. Because of their sequential and progressively scoped nature, names declared in a statement block are very naturally handled by the resolution pass.

This isn't the whole story, however. If our language was a simple statements-within-declarations kind of language, then the list of names is a perfectly fine thing for the decls_in_file function to return (since there's only the top-level declaration block). If we want to support inline modules or declarative sections within a statement block, however, then things will get a bit more hairy.

Consider:

fun main
  let x = hidden()
  do println(x)
where 
  fun hidden: Int
    let b, c = 2, 3
    then
      return d
    where
      let d = a + b + c
    end
  end

  let a = 4
end

Here, we have a declaration (of main), which contains some statements (the let and do), but also another declarative section (introduced by the where keyword). Inside that section, two more items are declared (hidden and a). The function itself contains some statements, one of which is a block statement that contains a declarative section, and that declarative section is allowed to reference items created by statements preceeding this one.

This example is needlessly convoluted, but the ability to declare function local types and items can be quite a useful method of encapsulation, as something slightly more lightweight than a private module, for example.

Regardless, the big takeaway is that even within a file, the declared items should form something closer to a tree than a list. For the example above, we may want to end up with something like this:

(file)
└─ main
   ├─ a
   ├─ hidden
   │  └─ (statements)
   │    ├─ b
   │    ├─ c
   │    └─ (then)
   │       ├─ d
   │       └─ (statements)
   └─ (statements)
      └─ x

Resolving

With that information, we should be able to resolve names. This pass may no longer be file local, since we might be importing items from other modules in other files. However, that is fine – we can just issue a query that runs the declaration pass on the appropirate file(s) whenever we hit an import. That way, we can ensure that the resolver pass doesn't need any more non-file-local information than absolutely necessary, which keeps the caching granular and nice.

But how exactly do we approach the resolution? When we are recursing through the syntax trees, how do we keep track of where in the declaration tree we are? This gets especially tricky with declaration blocks, like in the example, where the actual declarative section is without a name and within some statement.

We can employ another key observation here: we very rarely care about looking inside a scope, but we very often do care about looking outside one. Pretty much only in the case of modules do we even care about going inside of a scope to find more names, and it's pretty much always explicit when we're doing that (because of imports or Module.name syntax).

This seems to suggest that the important information is not which scopes a scope contains, but which scope is the parent of some other scope.

We still need some way of figuring out which scope we are in while traversing a syntax tree, but that can be achieved by assosciating each scope with a unique id and each scope-opening construct (block, declaration, etc) with the id of its scope.

file[0]:
  fun[1] main:
    decls:
      fun[2] hidden:
        stmts[3]:
          let[4] b = 2
          let[5] c = 3
          then[6]:
            decls:
              let[7] d = a + b + c
            stmts:
              return d
      let[8] a = 4
    stmts[9]:
      let[10] x = hidden()
      do println(x)

 0: (file)
    - main
 1: main         [in 0]
    - a
    - hidden
 2: hidden       [in 1]
 3: (statements) [in 2]
    - b
    - c
 4: b            [in 3]
 5: c            [in 3]
 6: (then)       [in 3]
    - d
 7: d            [in 6]
 8: a            [in 1]
 9: (statements) [in 1]
    - x
10: x            [in 9]

Note that each declaring construct (fun, let, etc.) has been assosciated with the id of the scope it introduces. Each scope also contains a list of the names declared in that scope in the form it appears in the source. Ultimately, we end up with an AST type that looks something like this:

enum Decl {
    Fun(ScopeId, Name, Vec<Decl>, Stmts),
    Let(ScopeId, Name, Expr),
    // ...
}

struct Stmts {
    scope: ScopeId,
    stmts: Vec<Stmt>,
}

enum Stmt {
    Do(Expr),
    Let(ScopeId, Name, Expr),
    Then(ScopeId, Vec<Decl>, Stmts),
    // ...
}

// ...

And a file scope type that looks like this:

enum ScopeName {
    File,
    StatementList,
    Block,
    Name(Name),
}

struct Scope {
    name: ScopeName,
    parent: Option<ScopeId>,
    children: Vec<(String, Name)>,
}

struct Scopes {
    scopes: HashMap<ScopeId, Scope>,
}

Resolving a name just involves keeping track of the nearest scope id, and consulting the appropriate Scopes.

impl Scopes {
    fn find_name(&self, at: ScopeId, name: &str) -> Option<Name> {
        let scope = self.scopes.get(id).unwrap();

        // search this scope
        for (other, res) in scope.children.iter() {
            if other == name {
                return Some(res);
            }
        }

        // if not found, search the parent scope
        scope.parent
            .map(|at| self.find_name(at, name)
    }
}

There's some extra care needed to handle statements and statement scopes properly, but that's the general idea.