Verona is a new language being developed by Microsoft Research Cambridge which explores the concept of concurrent ownership. Accessing shared memory in a thread-safe manner needs all sorts of atomic access controls. While modern CPUs implement cache locality and atomic instructions (in addition to previous generation’s inefficient catch-all barriers), such fine control is only meaningful for small objects (usually register sized).

If you want to share large areas of (concurrently) mutable memory, you generally introduce the idea of locks, and with that, comes dead-locks, live-locks and all sorts of memory corruption problems. Alternatives, like message passing, exist for a number of decades, but they’re usually efficient for (again) small objects. It’s quite inefficient to send MB or GB sized mutable blobs in messages.

Verona aims to fix that by passing the ownership of a mutable region of memory as a message instead. The message queues are completely lock-free (using Atomic-Swap lock-free data structures) and only pass the entry-point to a region (an isolated reference to the whole mutable memory blob) as the unique ownership. Meaning only the thread that has that isolated object can access any memory in the region contained within. So each thread has the strong guarantee that no one else is accessing the entire region and there are no chances of concurrent mutation and no need for locks.

MLIR is a multi-level intermediate representation developed within the TensorFlow project but later migrated under the LLVM umbrella. As a core part of the LLVM project, MLIR is being used in a lot more than just ML representations and is gaining traction as the high-level representation of some language front-ends (Fortran, Hardware description) and other experiments bringing Clang to use MLIR as its own intermediate language.

The main benefit of using MLIR for language lowering is that you can keep the language semantics as high level as you need, by constructing dialect operations that encode the logic you want, and then creating passes that use those operations to infer behaviour and types or to transform into a more optimal format, before lowering to the standard dialects and further down, LLVM IR.

This fixes the two big problems in front-ends: on the one hand, it’s a lot easier to work with flexible IR operations than AST (abstract syntax tree) nodes, and on the other hand, we only lower to LLVM IR (which is very low level) when we’re comfortable we can’t extract any more special behaviour from the language semantics. Other existing dialects add to the list of benefits, as they already have rich semantics and existing optimisation passes we can use for free.

Why Verona needs MLIR

Verona aims to be easy to program but powerful to express concurrent and rich type semantics without effort. However, Verona’s type system is far from simple. C-like languages usually have native types (integer, float, boolean) that can be directly represented in hardware, or are simple to be operated by a runtime library; and container types (lists, sets, queues, iterators), which offer different views and access on native types. Some languages also support generics, which is a parametrisation of some types on other types, for example, a container of any type, to be defined later.

Verona has all of that, plus:

  1. Type capabilities (mutable, immutable, isolated). Controlling the access to objects with regards to mutability (ex. immutable objects are stored in immutable memory outside of mutable regions) as well as region sentinels (isolated) that cannot be held by more than one reference from outside the region.
  2. Type unions (A | B). This feature allows users to create functionality that works with multiple types, allowing easy restriction on the types passed and matching specific types in the code (via keyword match) with the guarantee that the type will be one of those.
  3. Type intersections (A & B). This allows restricting types with capabilities, making it harder to have unexpected access, for example, returning immutable references or identifying isolated objects on creation. It can also help designing interfaces, creating requirements on objects (ex. to be both random-access and an ordered collection). But also as function arguments, controlling the access to received objects.
  4. Strong inferred types. The compiler will emit an error if types cannot be identified at compile time, but users don’t need to declare them everywhere, and sometimes they can’t even be known until the compiler runs its own type inference pass (ex. generics, unions, or lambdas).

Verona currently uses a PEG parser that produce the AST and is quite straight forward, but once it’s constructed, working with ASTs (more specifically creating new nodes, replacing or updating existing ones) is quite involved and error prone. MLIR has some nice properties (SSA form, dialect operations, regions) that make that work much easier to change. But more importantly, the IR has explicit control flow (CFG), which is important for tracking where variables come from, pass through all combinations of paths, and ultimately end up at. To infer types and check safety, this is fundamental to make sure the code can’t get to an unknown state through at least one of the possible paths.

So, the main reason why we chose MLIR to be our representation is so we can do our type inference more easily.

The second reason is that MLIR allows us to mix any number of dialects together. So we can lower the AST into a mix of Verona dialect and other standard dialects, and passes that can only see Verona operations will ignore the others and vice-versa. It also allows us to partially lower parts of the dialect into other dialects without having to convert the whole thing. This keep the code clean (short, to-the-point passes) and allows us to slowly build more information, without having to run a huge analysis pass followed by a huge transformation pass, only to lose information in the middle.

An unexpected benefit of MLIR was that it has native support for opaque operations, ie. function-call-like operations that don’t need defining anywhere. This allowed us to prototype the dialect even before it existed, and was the precursor of some of our current dialect operations. We’re still using opaque nodes where the dialect is not complete yet, allowing us to slowly build the dialect without having to rush through (and fail at) a long initial design phase.

Where are we at

Right now, we have a dialect, a simple and incomplete MLIR generator and a few examples. None of those examples can be lowered to LLVM IR yet, as we don’t have any partial conversion to other standard dialects. Once we have a bit more support for the core language, and we’re comfortable that the dialect is at the right level of abstraction, we’ll start working on the partial conversion.

But, like other examples in MLIR, we’ll have to hold on to strings and printing in our own dialect until it can be lowered directly to the LLVM dialect. This is because MLIR has no native string representation and there is no sane way of representing all types of strings in the existing types.

Other missing important functionality are:

  • when keyword, which controls access to the regions (requests ownership),
  • match keyword, which controls the types (ex. from a type union) in the following block,
  • lambda and literal objects, which will create their own capture context via anonymous structures and expose a function call.

We also need to expose some minimal runtime library written in Verona to operate on types (ex. integer/floating-point arithmetic and logic), and we need those classes compiled and exposed to user code as an MLIR module, so that we can look at the code as a whole and do more efficient optimisations (like inlining) as well as pattern-matching known operations, like addition, and lower them to native LLVM instructions (ex. add or fadd).

Here be dragons

While we’re always happy to accept issues and pull requests, the current status of the language is raw. We’re co-designing the language, the compiler and the runtime library, as well as its more advanced features such as its clang interface and process sandbox. All of which would need multiple additional blog posts to cover, and all in continuous discussions to define both syntax and semantics.

By the time we have some LLVM IR generated and hopefully some execution of a simple program, the parser, compiler and libraries will be a bit more stable to allow external people to not only play with it by contribute back.

What would be extremely helpful, though, are tough questions about the language’s behaviour, our own expectations and the API that is exposed to programmers by the runtime. There are a lot of unknowns and until we start writing some serious Verona code, we won’t know for sure what works better. If you’re feeling brave, and would like to create issues with examples of what you would like to see in the language, that’d be awesome.

Also issues (and even pull requests) on the existing implementation would be nice, with recommendations of better patterns on our usage of the external libraries, for example MLIR, which is in constant evolution on its own, and we can’t keep up with every new shiny feature.

So, patches welcome, but bring your dragon-scale armour, shield and fire resistance spells.