On using MLIR for Verona

Verona is a new language being developed by Microsoft Research Cambridge which explores the concept of concurrent ownership. Accessing shared memory in a thread-safe manner needs all sorts of atomic access controls. While modern CPUs implement cache locality and atomic instructions (in addition to previous generation’s inefficient catch-all barriers), such fine control is only meaningful for small objects (usually register sized).

If you want to share large areas of (concurrently) mutable memory, you generally introduce the idea of locks, and with that, comes dead-locks, live-locks and all sorts of memory corruption problems. Alternatives, like message passing, exist for a number of decades, but they’re usually efficient for (again) small objects. It’s quite inefficient to send MB or GB sized mutable blobs in messages.

Verona aims to fix that by passing the ownership of a mutable region of memory as a message instead. The message queues are completely lock-free (using Atomic-Swap lock-free data structures) and only pass the entry-point to a region (an isolated reference to the whole mutable memory blob) as the unique ownership. Meaning only the thread that has that isolated object can access any memory in the region contained within. So each thread has the strong guarantee that no one else is accessing the entire region and there are no chances of concurrent mutation and no need for locks.

MLIR is a multi-level intermediate representation developed within the TensorFlow project but later migrated under the LLVM umbrella. As a core part of the LLVM project, MLIR is being used in a lot more than just ML representations and is gaining traction as the high-level representation of some language front-ends (Fortran, Hardware description) and other experiments bringing Clang to use MLIR as its own intermediate language.

The main benefit of using MLIR for language lowering is that you can keep the language semantics as high level as you need, by constructing dialect operations that encode the logic you want, and then creating passes that use those operations to infer behaviour and types or to transform into a more optimal format, before lowering to the standard dialects and further down, LLVM IR.

This fixes the two big problems in front-ends: on the one hand, it’s a lot easier to work with flexible IR operations than AST (abstract syntax tree) nodes, and on the other hand, we only lower to LLVM IR (which is very low level) when we’re comfortable we can’t extract any more special behaviour from the language semantics. Other existing dialects add to the list of benefits, as they already have rich semantics and existing optimisation passes we can use for free.

Why Verona needs MLIR

Verona aims to be easy to program but powerful to express concurrent and rich type semantics without effort. However, Verona’s type system is far from simple. C-like languages usually have native types (integer, float, boolean) that can be directly represented in hardware, or are simple to be operated by a runtime library; and container types (lists, sets, queues, iterators), which offer different views and access on native types. Some languages also support generics, which is a parametrisation of some types on other types, for example, a container of any type, to be defined later.

Verona has all of that, plus:

  1. Type capabilities (mutable, immutable, isolated). Controlling the access to objects with regards to mutability (ex. immutable objects are stored in immutable memory outside of mutable regions) as well as region sentinels (isolated) that cannot be held by more than one reference from outside the region.
  2. Type unions (A | B). This feature allows users to create functionality that works with multiple types, allowing easy restriction on the types passed and matching specific types in the code (via keyword match) with the guarantee that the type will be one of those.
  3. Type intersections (A & B). This allows restricting types with capabilities, making it harder to have unexpected access, for example, returning immutable references or identifying isolated objects on creation. It can also help designing interfaces, creating requirements on objects (ex. to be both random-access and an ordered collection). But also as function arguments, controlling the access to received objects.
  4. Strong inferred types. The compiler will emit an error if types cannot be identified at compile time, but users don’t need to declare them everywhere, and sometimes they can’t even be known until the compiler runs its own type inference pass (ex. generics, unions, or lambdas).

Verona currently uses a PEG parser that produce the AST and is quite straight forward, but once it’s constructed, working with ASTs (more specifically creating new nodes, replacing or updating existing ones) is quite involved and error prone. MLIR has some nice properties (SSA form, dialect operations, regions) that make that work much easier to change. But more importantly, the IR has explicit control flow (CFG), which is important for tracking where variables come from, pass through all combinations of paths, and ultimately end up at. To infer types and check safety, this is fundamental to make sure the code can’t get to an unknown state through at least one of the possible paths.

So, the main reason why we chose MLIR to be our representation is so we can do our type inference more easily.

The second reason is that MLIR allows us to mix any number of dialects together. So we can lower the AST into a mix of Verona dialect and other standard dialects, and passes that can only see Verona operations will ignore the others and vice-versa. It also allows us to partially lower parts of the dialect into other dialects without having to convert the whole thing. This keep the code clean (short, to-the-point passes) and allows us to slowly build more information, without having to run a huge analysis pass followed by a huge transformation pass, only to lose information in the middle.

An unexpected benefit of MLIR was that it has native support for opaque operations, ie. function-call-like operations that don’t need defining anywhere. This allowed us to prototype the dialect even before it existed, and was the precursor of some of our current dialect operations. We’re still using opaque nodes where the dialect is not complete yet, allowing us to slowly build the dialect without having to rush through (and fail at) a long initial design phase.

Where are we at

Right now, we have a dialect, a simple and incomplete MLIR generator and a few examples. None of those examples can be lowered to LLVM IR yet, as we don’t have any partial conversion to other standard dialects. Once we have a bit more support for the core language, and we’re comfortable that the dialect is at the right level of abstraction, we’ll start working on the partial conversion.

But, like other examples in MLIR, we’ll have to hold on to strings and printing in our own dialect until it can be lowered directly to the LLVM dialect. This is because MLIR has no native string representation and there is no sane way of representing all types of strings in the existing types.

Other missing important functionality are:

  • when keyword, which controls access to the regions (requests ownership),
  • match keyword, which controls the types (ex. from a type union) in the following block,
  • lambda and literal objects, which will create their own capture context via anonymous structures and expose a function call.

We also need to expose some minimal runtime library written in Verona to operate on types (ex. integer/floating-point arithmetic and logic), and we need those classes compiled and exposed to user code as an MLIR module, so that we can look at the code as a whole and do more efficient optimisations (like inlining) as well as pattern-matching known operations, like addition, and lower them to native LLVM instructions (ex. add or fadd).

Here be dragons

While we’re always happy to accept issues and pull requests, the current status of the language is raw. We’re co-designing the language, the compiler and the runtime library, as well as its more advanced features such as its clang interface and process sandbox. All of which would need multiple additional blog posts to cover, and all in continuous discussions to define both syntax and semantics.

By the time we have some LLVM IR generated and hopefully some execution of a simple program, the parser, compiler and libraries will be a bit more stable to allow external people to not only play with it by contribute back.

What would be extremely helpful, though, are tough questions about the language’s behaviour, our own expectations and the API that is exposed to programmers by the runtime. There are a lot of unknowns and until we start writing some serious Verona code, we won’t know for sure what works better. If you’re feeling brave, and would like to create issues with examples of what you would like to see in the language, that’d be awesome.

Also issues (and even pull requests) on the existing implementation would be nice, with recommendations of better patterns on our usage of the external libraries, for example MLIR, which is in constant evolution on its own, and we can’t keep up with every new shiny feature.

So, patches welcome, but bring your dragon-scale armour, shield and fire resistance spells.

LLVM Vectorizer

Now that I’m back working full-time with LLVM, it’s time to get some numbers about performance on ARM.

I’ve been digging the new LLVM loop vectorizer and I have to say, I’m impressed. The code is well structured, extensible and above all, sensible. There are lots of room for improvement, and the code is simple enough so you can do it without destroying the rest or having to re-design everything.

The main idea is that the loop vectorizer is a Loop Pass, which means that if you register this pass (automatically on -O3, or with -loop-vectorize option), the Pass Manager will run its runOnLoop(Loop*) function on every loop it finds.

The three main components are:

  1. The Loop Vectorization Legality: Basically identifies if it’s legal (not just possible) to vectorize. This includes checking if we’re dealing with an inner loop, and if it’s big enough to be worth, and making sure there aren’t any conditions that forbid vectorization, such as overlaps between reads and writes or instructions that don’t have a vector counter-part on a specific architecture. If nothing is found to be wrong, we proceed to the second phase:
  2. The Loop Vectorization Cost Model: This step will evaluate both versions of the code: scalar and vector. Since each architecture has its own vector model, it’s not possible to create a common model for all platforms, and in most cases, it’s the special behaviour that makes vectorization profitable (like 256-bits operations in AVX), so we need a bunch of cost model tables that we consult given an instruction and the types involved. Also, this model doesn’t know how the compiler will lower the scalar or vectorized instructions, so it’s mostly guess-work. If the vector cost (normalized to the vector size) is less than the scalar cost, we do:
  3. The Loop Vectorization: Which is the proper vectorization, ie. walking through the scalar basic blocks, changing the induction range and increment, creating the prologue and epilogue, promote all types to vector types and change all instructions to vector instructions, taking care to leave the interaction with the scalar registers intact. This last part is a dangerous one, since we can end up creating a lot of copies from scalar to vector registers, which is quite expensive and was not accounted for in the cost model (remember, the cost model is guess-work based).

All that happens on a new loop place-holder, and if all is well at the end, we replace the original basic blocks by the new vectorized ones.

So, the question is, how good is this? Well, depending on the problems we’re dealing with, vectorizers can considerably speed up execution. Especially iterative algorithms, with lots of loops, like matrix manipulation, linear algebra, cryptography, compression, etc. In more practical terms, anything to do with encoding and decoding media (watching or recording videos, pictures, audio), Internet telephones (compression and encryption of audio and video), and all kinds of scientific computing.

One important benchmark for that kind of workload is Linpack. Not only Linpack has many examples of loops waiting to be vectorized, but it’s also the benchmark that defines the Top500 list, which classifies the fastest computers in the world.

Benchmarks

So, both GCC and Clang now have the vectorizers turned on by default with -O3, so comparing them is as simple as compiling the programs and see them fly. But, since I’m also interested in seeing what is the performance gain with just the LLVM vectorizer, I also disabled it and ran a clang with only  -O3, no vectorizer.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler    Opt    MFLOPS  Diff
   Clang    -O3      2413  0.0%
     GCC  -O3 vec    2421  0.3%
   Clang  -O3 vec    3346 38.6%

This is some statement! The GCC vectorizer exists for a lot longer than LLVM’s and has been developed by many vectorization gurus and LLVM seems to easily beat GCC in that field. But, a word of warning, Linpack is by no means representative of all use cases and user visible behaviour, and it’s very likely that GCC will beat LLVM on most other cases. Still, a reason to celebrate, I think.

This boost mean that, for many cases, not only the legality if the transformations are legal and correct (or Linpack would have gotten wrong results), but they also manage to generate faster code at no discernible cost. Of course, the theoretical limit is around 4x boost (if you manage to duplicate every single scalar instruction by a vector one and the CPU has the same behaviour about branch prediction and cache, etc), so one could expect a slightly higher number, something on the order of 2x better.

It depends on the computation density we’re talking about. Linpack tests specifically the inner loops of matrix manipulation, so I’d expect a much higher ratio of improvement, something around 3x or even closer to 4x. VoIP calls, watching films and listening to MP3 are also good examples of densely packet computation, but since we’re usually running those application on a multi-task operating system, you’ll rarely see improvements higher than 2x. But general applications rarely spend that much time on inner loops (mostly waiting for user input and then doing a bunch of unrelated operations, hardly vectorizeable).

Another important aspect of vectorization is that it saves a lot of battery juice. MP3 decoding doesn’t really matter if you finish in 10 or 5 seconds, as long as the music doesn’t stop to buffer. But taking 5 seconds instead of 10 means that on the other 5 seconds the CPU can reduce its voltage and save battery. This is especially important in mobile devices.

What about ARM code?

Now that we know the vectorizer works well, and the cost model is reasonably accurate, how does it compare on ARM CPUs?

It seems that the grass is not so green on this side, at least not at the moment. I have reports that on ARM it also reached the 40% boost similar to Intel, but what I saw was a different picture altogether.

On a Samsung Chromebook (Cortex-A15) I got:

Compiler   Opt    MFLOPS   Diff
  Clang    -O3       796   0.0%
    GCC  -O3 vec     736  -8.5%
  Clang  -O3 vec     773  -2.9%

The performance regression can be explained by the amount of scalar code intermixed with vector code inside the inner loops as a result of shuffles (movement of data within the vector registers and between scalar and vector registers) not being lowered correctly. This most likely happens because the LLVM back-end relies a lot on pattern-matching for instruction selection (a good thing), but the vectorizers might not be producing the shuffles in the right pattern, as expected by each back-end.

This can be fixed by tweaking the cost model to penalize shuffles, but it’d be good to see if those shuffles aren’t just mismatched against the patterns that the back-end is expecting. We will investigate and report back.

Update

Got results for single precision floating point, which show a greater improvement on both Intel and ARM.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler   Opt   MFLOPS  Diff
   Clang   -O3     2530   0.0%
     GCC -O3 vec   3484  37.7%
   Clang -O3 vec   3996  57.9%

On a Samsung Chromebook (Cortex-A15) I got:

Compiler   Opt   MFLOPS  Diff
   Clang   -O3      867   0.0%
     GCC -O3 vec    788,  9.1%
   Clang -O3 vec   1324  52.7%

Which goes on to show that the vectorizer is, indeed, working well for ARM, but the costs of using the VFP/NEON pipeline outweigh the benefits. Remember than NEON vectors are only 128-bits wide and VFP only 64-bit wide, and NEON has no double precision floating point operations, so they’ll only do one double precision floating point operations per cycle, so the theoretical maximum depends on the speed of the soft-fp libraries.

So, in the future, what we need to be working is the cost model, to make sure we don’t regress in performance, and try to get better algorithms when lowering vector code (both by making sure we match the patterns that the back-end is expecting, and by just finding better ways of vectorizing the same loops).

Conclusion

Without further benchmarks it’s hard to come to a final conclusion, but it’s looking good, that’s for sure. Since Linpack is part of the standard LLVM test-suite benchmarks, fixing this and running it regularly on ARM will at least avoid any further regressions… Now it’s time to get our hands dirty!

 

2011 European User Group Meeting

For all LLVM users in Europe that didn’t catch the announcement on the main list, be warned that we’re organising the European LLVM User Group Meeting in September, 16th, in London. All information necessary to register and submit your work is on the link.

It’s free of charge and we’ll also provide a happy-hour after the meeting for an extended networking. If you’re just curious about LLVM or work with it day to day or is thinking of starting your PhD with LLVM, please register and show up. You won’t regret. πŸ˜‰

More information will be available on the link above and through the main mailing list. Stay tuned!

Zero-cost exception handling in C++

C++ exception handling is a controversial subject. The intricate concepts are difficult to grasp, even to seasoned programmers but, in complexity, nothing compares to its implementation. I can’t remember any real-world (commercial or otherwise) product that makes heavy use of exception handling. Even STL itself only throws a couple of them and only upon terminal failures.

Java exceptions, on the other hand, are much more common. Containers throw them at will (ex. OutOfBounds) and creating your own exceptions is very easy, developers are encouraged to use them. But still, some C++ projects’ coding standards hint (like LLVM’s) or specifically disallow (like Google’s) the use of exceptions. The Android NDK, for example, doesn’t even have (nor plans to) support exceptions.

Some of the cons commonly listed are the encouragement of bad style (throwing at will, like in Java), compilation problems with old code and all the gotchas that exception handling can bring to your code, making it a nightmare to understand the real flow of your program and to figure out the critical DON’Ts of exception handling (like throwing from inside a destructor). For a deeper discussion on that subject, read Sutter’s book, for the implementation details of exception handling from a compiler point of view, keep reading.

Three distant worlds

Exception handling is not just a language construction, like if, it’s an intricate interaction between three distant worlds. All the complications that exception handling brings to us have to be implemented by a three-layer infrastructure: the user code, the run-time library and the compiler. The communication between them has to be absolutely perfect, one bit different and the control flow goes south, you start executing random code and blow up your program for good.

Not only it’s three very different places, but the code is written by three different kinds of people, at different times. The only formal document that binds them together is the ABI (ex. the Itanium C++ ABI), which is, to say the least, vague. Luckily, the guys who write the run-time libraries are close enough to the guys writing the compiler (and hopefully, but not necessarily, the same writing the ABI).

In the end, if the compiler implementation is correct, the user only has to care about the rules imposed by the C++ standard. But that’s not too comforting, the C++ standard is complex and frequently driven by previous implementations, rather than a fix and clear rule of thumb. Most users don’t precisely understand why you can’t throw from inside a destructor or why throwing a newly created object is calling terminate on their code, they just avoid it at all costs.

Dirty and expensive

The easy exception handling implementation is called SetJump/LongJump. It literally sets jumps between your code and exception handling code (created by the compiler). If an exception was thrown, the flow jumps directly to the exception handling code. This is very easy, but extremely expensive. Usually, if you follow good practices, you’d expect that your program would spend most of the time doing real computation and only occasionally handling exceptions. But if for every action you create a bunch of exception handling data, you’re wasting a lot of resources to check for something that might not even happen during the whole run of your program.

When an exception happens, you need to go back down the call graph and inspect if any of those functions (including the one you just thrown) is catching that exception. This is called unwinding the stack, and in a SetJump/LongJump exception handling this is done by inspecting if any LongJump was called by a series of SetJump checks. The problem is, that those checks are done even if an exception was not thrown at all.

Zero-cost

So, how is it possible to only execute exception handling code when an exception really happen? Simple, let the compiler/library do it for you. πŸ˜‰

The whole idea of a zero-cost exception handling is to create a set of functions (like __cxa_throw, __cxa_begin_catch) that will deal with the exception handling flow by sending it to a different path, the EH path. Under normal situations, your code flow will be identical with or without exceptions. The flow could never reach the EH blocks and it’s only because the compiler knows those blocks are in the EH path that they don’t get discarded as unreachable code, since there is no explicit flow between user code and EH code.

Only the run-time library knows how to get there, and that’s helped by a table, normally in a read-only data section, specific to exception handling (ex. the .eh_frame ELF section), created by the compiler when reading the user’s exception flow. This is the core interaction between user, compiler and library codes.

When an exception is thrown via __cxa_allocate_exception + __cxa_throw, the first function allocates space and prepare the object to be thrown and the second starts reading the exception handling table. This table contains a list of exceptions that the particular piece of code is catching (often sorted by address, so it’s easy to do a binary search) and the code that will catch the exception.

The code that catches exceptions is called the personality routine. The personality routine library call enables you to use multiple zero-cost EH mechanisms in the same program. Normally, the table entry contains a pointer to the routine and some data it’ll use while unwinding, but some ABIs (like ARM’s EHABI) can add specially crafted tables for their purposes (like having the unwind code as 16-bits instructions embedded in the data section).

All in all, the general flow is the same: the personality routine will find the EH block (also called landing pads, unreachable from normal code) and jump there, where the catching happens. That code will get the exception allocated before the throw and make it available to the user code (inside the catch block) to do what the user requested. If there is no entry in the EH table that catches that exception, the stack will be unwound again and the library will read the caller’s EH table, until it reaches a table that does catch it or terminate if none does.

Clean-ups

So far so good, but what happens with the local variables and the temporaries (stack-based data)? If they have non-trivial destructors, they must be called, which could chain a lot of cleaning up to do. Clean-up code is part in the unreachable blocks (when you’re cleaning up EH stuff) and part in normal code, for automatic destruction at the end of a lexical block. Although they do similar things, their purpose is completely different.

Normal flow clean-up blocks are always called and always clean up all objects. But EH clean-up code has to clean up only the objects that have been allocated so far, in the current lexical block or in any parent until the root block of the function. If you can imagine deeply nested blocks, creating objects at will, with a throw inside in the increment part of a for loop, where the object beeing incremented has a non-trivial destructor, you got the picture.

What happens if you throw an exceptions when other was being caught? Here is where the complexity met reality and the boundaries were settled. If you are unwinding one exception, and have a flow being dealt with by the run-time library, and another exception is thrown, you can’t distinguish which exception you are handling. Since there is space for one exception been unwounded at a time, whenever a second exception is thrown, you have to terminate.

Of course, you could argue to create multiple lanes for exception handling, where in every exception handling is handled separately, so you know that one exception was caused when throwing another. That may be easy to implement in Java (or other interpreted/bytecode languages), because the VM/interpreter has total control of the language, but as stated earlier, exception handling in C++ is just a contract between three parties: compiler, run-time library and the user, and often these three parties are completely different groups in completely different space-time realities.

Conclusion

This is just a very brief, shallow demonstration of how complex exception handling can be. Many other pitfalls appear, especially when dealing with specific user code running on a specific hardware platform with a specific binary configuration for the object file. It’s impossible to test every single combination (including room temperature, or annual rainfall forecast) to make it an implementation final and weird bugs creep up every once in a while.

Enhancing the ABI to allow that is possible, but the complexities of doing so, together with the very scarce use of exception handling in C++, makes it a very unlikely candidate to be changed in the near future. It’d have to be done in so many levels (C++ standard, ABIs, compilers, run-time libraries) that it’s best left alone.

SLC 0.2.0

My pet compiler is now sufficiently stable for me to advertise it as a product. It should deal well with the most common cases if you follow the syntax, as there are some tests to assure minimum functionality.

The language is very simple, called “State Language“. This language has only global variables and static states. The first state is executed first, all the rest only if you call goto state;. If you exit a state without branching, the program quits. This behaviour is consistent with the State Pattern and that’s why implemented this way. You can solve any computer problem using state machines, therefore this new language should be able to tackle them all.

The expressions are very simple, only binary operations, no precedence. Grouping is done with brackets and only the four basic arithmetic operations can be used. This is intentional, as I don’t want the expression evaluator to be so complex that the code will be difficult to understand.

As every code I write on my spare time, this one has an educational purpose. I learn by writing and hopefully teach by providing the source, comments and posts, and by letting it available on the internet so people can find it through search engines.

It should work on any platform you can compile to (currently only Linux and Mac binaries provided), but the binary is still huge (several megabytes) because they contain all LLVM libraries statically linked to it.

I’m still working on it and will update the status here at every .0 release. I hope to have binary operations for if statements, print string and all PHI calculated for the next release.

The LLVM compilation infrastructure

I’ve been playing with LLVM (Low-Level Virtual Machine) lately and have produced a simple compiler for a simple language.

The LLVM compilation infrastructure (much more than a simple compiler or virtual machine), is a collection of libraries, methods and programs that allows one to create a simple, robust and very powerful compilers, virtual machines and run-time optimizations.

As GCC, it’s roughly separated into three layers: the front-end, which parses the files and produce intermediate representation (IR), the independent optimization layer, which acts on the language-independent IR and the back-end, which turns the IR into something executable.

The main difference is that, unlike GCC, LLVM is extremely generic. While GCC is struggling to fit broader languages inside the strongly C-oriented IR, LLVM was created with a very extensible IR, with a lot of information on how to represent a plethora of languages (procedural, object-oriented, functional etcetera). This IR also carries information about possible optimizations, like GCC’s but to a deeper level.

Another very important difference is that, in the back-end, not only code generators to many platforms are available, but Just-In-Time compilers (somewhat like JavaScript), so you can run, change, re-compile and run again, without even quitting your program.

The middle-layer is where the generic optimizations are done on the IR, so it’s language-independent (as all languages wil convert to IR). But that doesn’t mean that optimizations are done only on that step. All first-class compilers have strong optimizations from the time it opens the file until it finishes writing the binary.

Parser optimizations normally include useless code removal, constant expression folding, among others, while the most important optimizations on the back-end involve instruction replacement, aggressive register allocation and abuse of hardware features (such as special registers and caches).

But the LLVM goes beyond that, it optimizes during run-time, even after the program is installed on the user machine. LLVM holds information (and the IR) together with the binary. When the program is executed, it profiles automatically and, when the computer is idle, it optimizes the code and re-compile it. This optimization is per-user and means that two copies of the same software will be quite different from each other, depending on the user’s use of it. Chris Lattner‘s paper about it is very enlightening.

There are quite a few very important people and projects already using LLVM, and although there is still a lot of work to do, the project is mature enough to be used in production environments or even completely replace other solutions.

If you are interested in compilers, I suggest you take a look on their website… It’s, at least, mind opening.