On using MLIR for Verona

Verona is a new language being developed by Microsoft Research Cambridge which explores the concept of concurrent ownership. Accessing shared memory in a thread-safe manner needs all sorts of atomic access controls. While modern CPUs implement cache locality and atomic instructions (in addition to previous generation’s inefficient catch-all barriers), such fine control is only meaningful for small objects (usually register sized).

If you want to share large areas of (concurrently) mutable memory, you generally introduce the idea of locks, and with that, comes dead-locks, live-locks and all sorts of memory corruption problems. Alternatives, like message passing, exist for a number of decades, but they’re usually efficient for (again) small objects. It’s quite inefficient to send MB or GB sized mutable blobs in messages.

Verona aims to fix that by passing the ownership of a mutable region of memory as a message instead. The message queues are completely lock-free (using Atomic-Swap lock-free data structures) and only pass the entry-point to a region (an isolated reference to the whole mutable memory blob) as the unique ownership. Meaning only the thread that has that isolated object can access any memory in the region contained within. So each thread has the strong guarantee that no one else is accessing the entire region and there are no chances of concurrent mutation and no need for locks.

MLIR is a multi-level intermediate representation developed within the TensorFlow project but later migrated under the LLVM umbrella. As a core part of the LLVM project, MLIR is being used in a lot more than just ML representations and is gaining traction as the high-level representation of some language front-ends (Fortran, Hardware description) and other experiments bringing Clang to use MLIR as its own intermediate language.

The main benefit of using MLIR for language lowering is that you can keep the language semantics as high level as you need, by constructing dialect operations that encode the logic you want, and then creating passes that use those operations to infer behaviour and types or to transform into a more optimal format, before lowering to the standard dialects and further down, LLVM IR.

This fixes the two big problems in front-ends: on the one hand, it’s a lot easier to work with flexible IR operations than AST (abstract syntax tree) nodes, and on the other hand, we only lower to LLVM IR (which is very low level) when we’re comfortable we can’t extract any more special behaviour from the language semantics. Other existing dialects add to the list of benefits, as they already have rich semantics and existing optimisation passes we can use for free.

Why Verona needs MLIR

Verona aims to be easy to program but powerful to express concurrent and rich type semantics without effort. However, Verona’s type system is far from simple. C-like languages usually have native types (integer, float, boolean) that can be directly represented in hardware, or are simple to be operated by a runtime library; and container types (lists, sets, queues, iterators), which offer different views and access on native types. Some languages also support generics, which is a parametrisation of some types on other types, for example, a container of any type, to be defined later.

Verona has all of that, plus:

  1. Type capabilities (mutable, immutable, isolated). Controlling the access to objects with regards to mutability (ex. immutable objects are stored in immutable memory outside of mutable regions) as well as region sentinels (isolated) that cannot be held by more than one reference from outside the region.
  2. Type unions (A | B). This feature allows users to create functionality that works with multiple types, allowing easy restriction on the types passed and matching specific types in the code (via keyword match) with the guarantee that the type will be one of those.
  3. Type intersections (A & B). This allows restricting types with capabilities, making it harder to have unexpected access, for example, returning immutable references or identifying isolated objects on creation. It can also help designing interfaces, creating requirements on objects (ex. to be both random-access and an ordered collection). But also as function arguments, controlling the access to received objects.
  4. Strong inferred types. The compiler will emit an error if types cannot be identified at compile time, but users don’t need to declare them everywhere, and sometimes they can’t even be known until the compiler runs its own type inference pass (ex. generics, unions, or lambdas).

Verona currently uses a PEG parser that produce the AST and is quite straight forward, but once it’s constructed, working with ASTs (more specifically creating new nodes, replacing or updating existing ones) is quite involved and error prone. MLIR has some nice properties (SSA form, dialect operations, regions) that make that work much easier to change. But more importantly, the IR has explicit control flow (CFG), which is important for tracking where variables come from, pass through all combinations of paths, and ultimately end up at. To infer types and check safety, this is fundamental to make sure the code can’t get to an unknown state through at least one of the possible paths.

So, the main reason why we chose MLIR to be our representation is so we can do our type inference more easily.

The second reason is that MLIR allows us to mix any number of dialects together. So we can lower the AST into a mix of Verona dialect and other standard dialects, and passes that can only see Verona operations will ignore the others and vice-versa. It also allows us to partially lower parts of the dialect into other dialects without having to convert the whole thing. This keep the code clean (short, to-the-point passes) and allows us to slowly build more information, without having to run a huge analysis pass followed by a huge transformation pass, only to lose information in the middle.

An unexpected benefit of MLIR was that it has native support for opaque operations, ie. function-call-like operations that don’t need defining anywhere. This allowed us to prototype the dialect even before it existed, and was the precursor of some of our current dialect operations. We’re still using opaque nodes where the dialect is not complete yet, allowing us to slowly build the dialect without having to rush through (and fail at) a long initial design phase.

Where are we at

Right now, we have a dialect, a simple and incomplete MLIR generator and a few examples. None of those examples can be lowered to LLVM IR yet, as we don’t have any partial conversion to other standard dialects. Once we have a bit more support for the core language, and we’re comfortable that the dialect is at the right level of abstraction, we’ll start working on the partial conversion.

But, like other examples in MLIR, we’ll have to hold on to strings and printing in our own dialect until it can be lowered directly to the LLVM dialect. This is because MLIR has no native string representation and there is no sane way of representing all types of strings in the existing types.

Other missing important functionality are:

  • when keyword, which controls access to the regions (requests ownership),
  • match keyword, which controls the types (ex. from a type union) in the following block,
  • lambda and literal objects, which will create their own capture context via anonymous structures and expose a function call.

We also need to expose some minimal runtime library written in Verona to operate on types (ex. integer/floating-point arithmetic and logic), and we need those classes compiled and exposed to user code as an MLIR module, so that we can look at the code as a whole and do more efficient optimisations (like inlining) as well as pattern-matching known operations, like addition, and lower them to native LLVM instructions (ex. add or fadd).

Here be dragons

While we’re always happy to accept issues and pull requests, the current status of the language is raw. We’re co-designing the language, the compiler and the runtime library, as well as its more advanced features such as its clang interface and process sandbox. All of which would need multiple additional blog posts to cover, and all in continuous discussions to define both syntax and semantics.

By the time we have some LLVM IR generated and hopefully some execution of a simple program, the parser, compiler and libraries will be a bit more stable to allow external people to not only play with it by contribute back.

What would be extremely helpful, though, are tough questions about the language’s behaviour, our own expectations and the API that is exposed to programmers by the runtime. There are a lot of unknowns and until we start writing some serious Verona code, we won’t know for sure what works better. If you’re feeling brave, and would like to create issues with examples of what you would like to see in the language, that’d be awesome.

Also issues (and even pull requests) on the existing implementation would be nice, with recommendations of better patterns on our usage of the external libraries, for example MLIR, which is in constant evolution on its own, and we can’t keep up with every new shiny feature.

So, patches welcome, but bring your dragon-scale armour, shield and fire resistance spells.

Going ARM

Last week I attended the Arm Research Summit, and on Tuesday there was a panel session on using Arm hardware for performance, an effort known as GoingArm. You can watch the whole panel on youtube, from 2:12 onward (~1h30). All videos are available on the website.

It was interesting to hear the opinions of people in the front lines or ARM HPC, as much as the types of questions that came, which made me believe even more strongly that now is the time to re-think about general computing in general. For a long time we have been coasting on the fact that cranking up the power and squeezing down the components were making the lives of programmers easier. Scaling up to multiple cores was probably the only big change most people had to deal with, and even so, not many people can actually take real advantages of it.

Continue reading “Going ARM”

OpenHPC on AArch64

My current effort is to drive ARM hardware into the data centre, through HPC deployments, for the exascale challenge. Current Top500 clusters start from 0.5 petaflops, but the top 10 are between 10 and 90 petaflops, using as much as close to 20MW of combined power. Scaling up 10 ~ 100 times in performance towards exascale would require a whole power plant (coal or nuclear) for each cluster. So, getting to ExaFLOPS involves not only getting server-grade ARM chips into a rack mount, but baking the whole ecosystem (PCB, PCIe, DRAM, firmware, drivers, kernel, compilers, libraries, simulations, deep learning, big data) that can run at least at 10x better performance than existing clusters at, hopefully, a fraction of the power budget.

The first step then, is to find a system that can glue most of those things together, so we can focus on real global solutions, not individual pieces together. It also means we need to fix all problems in the right places, rather than deferring that to external projects because it’s not our core business. At Linaro we tend to look at computing problems from an holistic point of view, so if things don’t work the way they should, we make it our job to go and fix the problem where it’s most meaningful, finding the appropriate upstream project and submitting pull requests there, then back-porting solutions internally and gluing them together into a proper solution.

And that’s why we came to OpenHPC, a project that aims to facilitate HPC cluster deployment. Essentially, it’s a package repository that glues functional groups using meta-packages, in additional to recipes and documentation on how to setup a cluster, and a lively community that deploys OpenHPC clusters across different architectures, operating systems and provisioning styles.

The recent version of OpenHPC, 1.3.2, has just been released with some additions proposed and tested by Linaro, such as Plasma, Scotch and SLEPc as well as LLVM with Clang and Flang. But while that works well on x86_64 clusters for now, and they have passed all tests on AArch64, installing a new cluster on AArch64 with automatic provisioning still needs some modification. That’s why it’s still in Tech Preview.


For those who don’t know, warewulf is a cluster provisioning service. As such, it is installed in a master node, which will then keep a database of all the compute nodes, resource files, operating system images and everything that is necessary to get the compute nodes up and running. While you can install the nodes by hand, then install a dispatcher (like Slurm) on each node, warewulf makes that process simple: it creates an image of the installation of a node, produces an EFI image for PXE boot and let the nodes discover themselves as they come up alive.

The OpenHPC documentation explain step by step how you can do this (by installing OpenHPC’s own meta-packages and running a few configuration tasks), and if all goes well, every compute node that boots in PXE mode will soon encounter the DHCP server, then the TFTP and will get its disk-less live installation painlessly. That is, of course, if the PXE process work.

While running this on a number of different ARM clusters, I realised that I was getting an error:

ERROR:  Could not locate Warewulf's internal pxelinux.0! Things might be broken!

While that doesn’t sound good at all, I learnt that people in the ARM community knew about that all along and were using a Grub hack (to create a simple Grub EFI script to jump start the TFTP download and installation). It’s good that it works in some way, but it’s the kind of thing that should just work, or people won’t even try anything further. Turns out the PXELinux folks haven’t bothered much with AArch64 (examples here and here), so what to do?


One of the great strengths of Linaro is that there are a bunch of maintainers of the core open source projects most people use, so I was bound to find someone that knew what to do, or at least who to ask. As it turns out, two EFI gurus (Leif and Ard) work on my extended team, and we begun discussing alternatives when Eric (Arm Research, also OpenHPC collaborator) tipped us that warewulf would be completely replacing PXELinux for iPXE, in view of his own efforts in getting that working.

After a few contributions and teething issues, I managed to get iPXE booting on ARM clusters as smoothly as I would have done for x86_64.

Tech Preview

Even though it works perfectly well, OpenHPC 1.3.2 was release without it.

That’s because it uses an older snapshot of warewulf, while I was using the bleeding edge development branch, which was necessary due to all the fixes we had to do while making it work on ARM.

This has prompted me to replicate their validation build strategy so that I could replace OpenHPC’s own versions of warewulf in-situ, ignoring dependencies, which is not a very user friendly process. And while I have validated OpenHPC (ran the test-suite, all green) with the development branch of warewulf (commit 74ad08d), it is not a documented or even recommended process.

So, despite working better than the official 1.3.2 packages, we’re still in Tech Preview state, and will be until we can pack the development branch of warewulf past that commit into OpenHPC. Given that it has been reported to work on both AArch64 and x86_64, I’m hoping it will be ready for 1.3.3 or 1.4, whatever comes next.

Distributed Compilation on a Pandaboard Cluster

This week I was experimenting with the distcc and Ninja on a Pandaboard cluster and it behaves exactly as I expected, which is a good thing, but it might not be what I was looking for, which is not. 😉

Long story short, our LLVM buildbots were running very slow, from 3 to 4.5 hours to compile and test LLVM. If you consider that at peak time (PST hours) there are up to 10 commits in a single hour, the buildbot will end up testing 20-odd patches at the same time. If it breaks in unexpected ways, of if there is more than one patch on a given area, it might be hard to spot the guilty.

We ended up just avoiding the make clean step, which put us around 15 minutes build+tests, with the odd chance of getting 1 or 2 hours tops, which is a great deal. But one of the alternatives I was investigating is to do a distributed build. More so because of the availability of cluster nodes with dozens of ARM cores inside, we could make use of such a cluster to speed up our native testing, even benchmarking on a distributed way. If we do it often enough, the sample might be big enough to account for the differences.

The cluster

So, I got three Pandaboards ES (dual Cortex-A9, 1GB RAM each) and put the stock Ubuntu 12.04 on them and installed the bare minimum (vim, build-essential, python-dev, etc), upgraded to the latest packages and they were all set. Then, I needed to find the right tools to get a distributed build going.

It took a bit of searching, but I ended up with the following tool-set:

  • distcc: The distributed build dispatcher, which knows about the other machines in the cluster and how to send them jobs and get the results back
  • CMake: A Makefile generator which LLVM can use, and it’s much better than autoconf, but can also generate Ninja files!
  • Ninja: The new intelligent builder which not only is faster to resolve dependencies, but also has a very easy way to change the rules to use distcc, and also has a magical new feature called pools, which allow me to scale job types independently (compilers, linkers, etc).

All three tools had to be compiled from source. Distcc’s binary distribution for ARM is too old, CMake’s version on that Ubuntu couldn’t generate Ninja files and Ninja doesn’t have binary distributions, full stop. However, it was very simple to get them interoperating nicely (follow the instructions).

You don’t have to use CMake, there are other tools that generate Ninja files, but since LLVM uses CMake, I didn’t have to do anything. What you don’t want is to generate the Ninja files yourself, it’s just not worth it. Different than Make, Ninja doesn’t try to search for patterns and possibilities (this is why it’s fast), so you have to be very specific on the Ninja file on what you want to accomplish. This is very easy for a program to do (like CMake), but very hard and error prone for a human (like me).


To use distcc is simple:

  1. Replace the compiler command by distcc compiler on your Ninja rules;
  2. Set the environment variable DISTCC_HOSTS to the list of IPs that will be the slaves (including localhost);
  3. Start the distcc daemon on all slaves (not on the master): distccd --daemon --allow <MasterIP>;
  4. Run ninja with the number of CPUs of all machines + 1 for each machine. Ex: ninja -j6 for 2 Pandaboards.

A local build, on a single Pandaboard of just LLVM (no Clang, no check-all) takes about 63 minutes. With distcc and 2 Pandas it took 62 minutes!

That’s better, but not as much as one would hope for, and the reason is a bit obvious, but no less damaging: The Linker! It took 20 minutes to compile all of the code, and 40 minutes to link them into executable. That happened because while we had 3 compilation jobs on each machine, we had 6 linking jobs on a single Panda!

See, distcc can spread the compilation jobs as long as it copies the objects back to the master, but because a linker needs all objects in memory to do the linking, it can’t do that over the network. What distcc could do, with Ninja’s help, is to know which objects will be linked together, and keep copies of them on different machines, so that you can link on separate machines, but that is not a trivial task, and relies on an interoperation level between the tools that they’re not designed to accept.

Ninja Pools

And that’s where Ninja proved to be worth its name: Ninja pools! In Ninja, pools are named resources that bundle together with a specific level of scalability. You can say that compilers scale free, but linkers can’t run more than a handful. You simply need to create a pool called linker_pool (or anything you want), give it a depth of, say, 2, and annotate all linking jobs with that pool. See the manual for more details.

With the pools enabled, a distcc build on 2 Pandaboards took exactly 40 minutes. That’s 33% of gain with double the resources, not bad. But, how does that scale if we add more Pandas?

How does it scale?

To get a third point (and be able to apply a curve fit), I’ve added another Panda and ran again, with 9 jobs and linker pool at 2, and it finished in 30 minutes. That’s less than half the time with three times more resources. As expected, it’s flattening out, but how much more can we add to be profitable?

I don’t have an infinite number of Pandas (nor I want to spend all my time on it), so I just cheated and got a curve fitting program (xcrvfit, in case you’re wondering) and cooked up an exponential that was close enough to the points and use the software ability to do a best fit. It came out with 86.806*exp(-0.58505*x) + 14.229, which according to Lybniz, flattens out after 4 boards (about 20 minutes).

Pump Mode

Distcc has a special mode called pump mode, in which it pushes with the C file, all headers necessary to compile it solely on the node. Normally, distcc will pre-compile on the master node and send the pre-compiled result to the slaves, which convert to object code. According to the manual, this could improve the performance 10-fold! Well, my results were a little less impressive, actually, my 3-Panda cluster finished in just about 34 minutes, 4 minutes more than without push mode, which is puzzling.

I could clearly see that the files were being compiled in the slaves (distccmon-text would tell me that, while there was a lot of “preprocessing” jobs on the master before), but Ninja doesn’t print times on each output line for me to guess what could have slowed it down. I don’t think there was any effect on the linker process, which was still enabled in this mode.


Simply put, both distcc and Ninja pools have shown to be worthy tools. On slow hardware, such as the Pandas, distributed builds can be an option, as long as you have a good balance between compilation and linking. Ninja could be improved to help distcc to link on remote nodes as well, but that’s a wish I would not press on the team.

However, scaling only to 4 boards will reduce a lot of the value for me, since I was expecting to use 16/32 cores. The main problem is again the linker jobs working solely on the master node, and LLVM having lots and lots of libraries and binaries. Ninja’s pools can also work well when compiling LLVM+Clang on debug mode, since the objects are many times bigger, and even on above average machine you can start swapping or even freeze your machine if using other GUI programs (browsers, editors, etc).

In a nutshell, the technology is great and works as advertised, but with LLVM it might not be yet the thing. It’s still more profitable to get faster hardware, like the Chromebooks, that are 3x faster than the Pandas and cost only marginally more.

Would also be good to know why the pump mode has regressed in performance, but I have no more time to spend on this, so I leave as a exercise to the reader. 😉

Emergent behaviour

There is a lot of attention to emergent behaviour nowadays (ex. here, here, here and here), but it’s still on the outskirts of science and computing.


For millennia, science has isolated each single behaviour of a system (or system of systems) to study it in detail, than join them together to grasp the bigger picture. The problem is that, this approximation can only be done with simple systems, such as the ones studied by Aristotle, Newton and Ampere. Every time scientists were approaching the edges of their theories (including those three), they just left as an exercise to the reader.

Newton has foreseen relativity and the possible lack of continuity in space and time, but he has done nothing to address that. Fair enough, his work was much more important to science than venturing throughout the unknowns of science, that would be almost mystical of him to try (although, he was an alchemist). But more and more, scientific progress seems to be blocked by chaos theory, where you either unwind the knots or go back to alchemy.

Chaos theory exists for more than a century, but it was only recently that it has been applied to anything outside differential equations. The hyper-sensibility of the initial conditions is clear on differential systems, but other systems have a less visible, but no less important, sensibility. We just don’t see it well enough, since most of the other systems are not as well formulated as differential equations (thanks to Newton and Leibniz).

Neurology and the quest for artificial intelligence has risen a strong interest in chaos theory and fractal systems. The development in neural networks has shown that groups and networks also have a fundamental chaotic nature, but more importantly, that it’s only though the chaotic nature of those systems that you can get a good amount of information from it. Quantum mechanics had the same evolution, with Heisenberg and Schroedinger kicking the ball first on the oddities of the universe and how important is the lack of knowledge of a system to be able to extract information from it (think of Schroedinger’s cat).

A network with direct and fixed thresholds doesn’t learn. Particles with known positions and velocities don’t compute. N-body systems with definite trajectories don’t exist.

The genetic code has some similarities to these models. Living beings have far more junk than genes in their chromosomes (reaching 98% of junk on human genome), but changes in the junk parts can often lead to invalid creatures. If junk within genes (introns) gets modified, the actual code (exons) could be split differently, leading to a completely new, dysfunctional, protein. Or, if you add start sequences (TATA-boxes) to non-coding region, some of them will be transcribed into whatever protein they could make, creating rubbish within cells, consuming resources or eventually killing the host.

But most of the non-coding DNA is also highly susceptible to changes, and that’s probably its most important function, adapted to the specific mutation rates of our planet and our defence mechanism against such mutations. For billions of years, the living beings on Earth have adapted that code. Each of us has a super-computer that can choose, by design, the best ratios for a giving scenario within a few generations, and create a whole new species or keep the current one adapted, depending on what’s more beneficial.

But not everyone is that patient…


Sadly, in my profession, chaos plays an important part, too.

As programs grow old, and programmers move on, a good part of the code becomes stale, creating dependencies that are hard to find, harder to fix. In that sense, programs are pretty much like the genetic code, the amount of junk increases over time, and that gives the program resistance against changes. The main problem with computing, that is not clear in genetics, is that the code that stays behind, is normally the code that no one wants to touch, thus, the ugliest and most problematic.

DNA transcriptors don’t care where the genes are, they find a start sequence and go on with their lives. Programmers, we believe, have free will and that gives them the right to choose where to apply a change. They can either work around the problem, making the code even uglier, or they can go on and try to fix the problem in the first place.

Non-programmers would quickly state that only lazy programmers would do the former, but more experienced ones will admit have done so on numerous occasions for different reasons. Good programmers would do that because fixing the real problem is so painful to so many other systems that it’s best to be left alone, and replace that part in the future (only they never will). Bad programmers are not just lazy, some of them really believe that’s the right thing to do (I met many like this), and that adds some more chaos into the game.

It’s not uncommon to try to fix a small problem, go more than half-way through and hit a small glitch on a separate system. A glitch that you quickly identify as being wrongly designed, so you, as any good programmer would do, re-design it and implement the new design, which is already much bigger than the fix itself. All tests pass, except the one, that shows you another glitch, raised by your different design. This can go on indefinitely.

Some changes are better done in packs, all together, to make sure all designs are consistent and the program behaves as it should, not necessarily as the tests say it would. But that’s not only too big for one person at one time, it’s practically impossible when other people are changing the program under your feet, releasing customer versions and changing the design themselves. There is a point where a refactoring is not only hard, but also a bad design choice.

And that’s when code become introns, and are seldom removed.


The power of networks is rising, slower than expected, though. For decades, people know about synergy, chaos and emergent behaviour, but it was only recently, with the quality and amount of information on global social interaction, that those topics are rising again in the main picture.

Twitter, Facebook and the like have risen so many questions about human behaviour, and a lot of research has been done to address those questions and, to a certain extent, answer them. Psychologists and social scientists knew for centuries that social interaction is greater than the sum of all parts, but now we have the tools and the data to prove it once and for all.

Computing clusters have being applied to most of the hard scientific problems for half a century (weather prediction, earthquake simulation, exhaustion proofs in graph theory). They also took on a commercial side with MapReduce and similar mechanisms that have popularised the distributed architectures, but that’s only the beginning.

On distributed systems of today, emergent behaviour is treated as a bug, that has to be eradicated. In the exact science of computing, locks and updates have to occur in the precise order they were programmed to, to yield the exact result one is expecting. No more, no less.

But to keep our system out of emergent behaviours, we invariably introduce emergent behaviour in our code. Multiple checks on locks and variables, different design choices for different components that have to work together and the expectancy of precise results or nothing, makes the number of lines of code grow exponentially. And, since that has to run fast, even more lines and design choices are added to avoid one extra operation inside a very busy loop.

While all this is justifiable, it’s not sustainable. In the long run (think decades), the code will be replaced or the product will be discontinued, but there is a limit to which a program can receive additional lines without loosing some others. And the cost of refactoring increases with the lifetime of a product. This is why old products don’t get too many updates, not because they’re good enough already, but because it’s impossible to add new features without breaking a lot others.

Distant future

As much as I like emergent behaviour, I can’t begin to fathom how to harness that power. Stochastic computing is one way and has been done with certain level of success here and here, but it’s far from easy to create a general logic behind it.

Unlike Turing machines, emergent behaviour comes from multiple sources, dressed in multiple disguises and producing far too much variety in results that can be accounted in one theory. It’s similar to string theory, where there are several variations of it, but only one M theory, the one that joins them all together. The problem is, nobody knows how this M theory looks like. Well, they barely know how the different versions of string theory look like, anyway.

In that sense, emergent theory is even further than string theory to be understood in its entirety. But I strongly believe that this is one way out of the conundrum we live today, where adding more features makes harder to add more features (like mass on relativistic speeds).

With stochastic computing there is no need of locks, since all that matter is the probability of an outcome, and where precise values do not make sense. There is also no need for NxM combination of modules and special checks, since the power is not in the computation themselves, but in the meta-computation, done by the state of the network, rather than its specific components.

But that, I’m afraid, I won’t see in my lifetime.


Minix seems to be inspiring more operating systems nowadays. Microsoft Research is investing on a micro-kernel (they call it multi-kernel, as there are slight differences) called Barrelfish.

Despite being Microsoft, it’s BSD licensed. The mailing list looks pretty empty, the last snapshot is half a year ago and I couldn’t find an svn repository, but still more than I would expect from Microsoft anyway.


The basic concept is actually very interesting. The idea is to be able to have multi-core hybrid machines to the extreme, and still be able to run a single OS on it. Pretty much the same way some cluster solutions do (OpenMPI, for instance), but on a single machine. The idea is far from revolutionary. It’s a natural evolution of the multi-core trend with the current cluster solutions (available for years) and a fancy OS design (micro-kernel) that everyone learns in CS degrees.

What’s the difference, then? For one thing, the idea is to abstract everything away. CPUs will be just another piece of hardware, like the network or graphic cards. The OS will have the freedom to ask the GPU to do MP floating-point calculations, for instance, if it feels it’s going to benefit the total execution time. It’ll also be able to accept different CPUs in the same machine, Intel and ARM for instance (like the Dell Latitude z600), or have different GPUs, nVidia and ATI, and still use all the hardware.

With Windows, Linux and Mac today, you either use the nVidia driver or the ATI one. You also normally don’t have hybrid-core machines and absolutely can’t recover if one of the cores fail. This is not the same with cluster solutions, and Barrelfish’s idea is to simulate precisely that. In theory, you could do energy control (enabling and disabling cores), crash-recovery when one of the cores fail but not the other, or plug and play graphic or network cards and even different CPUs.

Imagine you have an ARM netbook that is great for browsing, but you want to play a game on it. You get your nVidia and a coreOcta 10Ghz USB4 and plug in. The OS recognizes the new hardware, loads the drivers and let you play your game. Battery life goes down, so once you’re back from the game, you just unplug the cards and continue browsing.


So, how is it possible that Barrelfish can be that malleable? The key is communication. Shared memory is great for single-processed threaded code and acceptable for multi-processed OSs with little number of concurrent process accessing the same region in memory. Most modern OSs can handle many concurrent processes, but they rarely access the same data at the same time.

Normally, processes are single threaded or have a very small number of threads (dozens) running. More than that is so difficult to control that people usually fall back to other means, such as client/server or they just go out and buy more hardware. In clusters, there is no way to use shared memory. For one, accessing memory in another computer via network is just plain stupid, but even if you use shared memory in each node and client/server among different nodes, you’re bound to have trouble. This is why MPI solutions are so popular.

In Barrelfish there’s no shared memory at all. Every process communicate with each other via messages and duplicate content (rather than share). There is an obvious associated cost (memory and bus), but the lock-free semantics is worth it. It also gives Barrelfish another freedom: to choose the communication protocol generic enough so that each piece of hardware is completely independent of all others, and plug’n’play become seamless.


It all seem fantastic, but there’s a long road ahead. First, message passing scales much better than shared memory, but nowadays there isn’t enough cores in most machines to make it worth it. Message passing also introduces some other problems that are not easily solvable: bus traffic and storage requirements increase considerably, and messages are not that generic in nature.

Some companies are famous for not adhering to standards (Apple comes to mind), and a standard hardware IPC framework would be quite hard to achieve. Also, even if using pure software IPC APIs, different companies will still use slightly modified APIs to suit their specific needs and complexity will rise, exponentially.

Another problem is where the hypervisor will live. Having a distributed control centre is cool and scales amazingly well, but its complexity also scales. In a hybrid-core machine, you have to run different instructions, in different orders, with different optimizations and communication. Choosing one core to deal with the scheduling and administration of the system is much easier, but leaves the single-point-of-failure.

Finally, going the multi-hybrid-independent style is way too complex. Even for a several-year’s project with lots of fund (M$) and clever people working on it. After all, if micro-kernel was really that useful, Tanembaum would have won the discussion with Linus. But, the future holds what the future holds, and reality (as well as hardware and some selfish vendors) can change. Multi-kernel might be possible and even easier to implement in the future.

This seems to be what the Barrelfish’s team is betting on, and I’m with them on that bet. Even if it fails miserably (as did Minix), some concepts could still be used in real-world operating systems (like Minix), whatever that’ll mean in 10 years. Being serious about parallelism is the only way forward, sticking with 40 years old concepts is definitely not.

I’m still aspiring for non-deterministic computing, though, but that’s an even longer shot…

Smart Grid Privacy

I have recently joined the IETF Smart Grid group to see what people were talking about it and to put away my fears on security and privacy. What I saw was a bunch of experts discussing the plethora of standards that could be applied (very important) but few people seemed too interested in the privacy issue.

If you see the IEEE page on Smart Grids, besides the smart generation / distribution / reception (very important) there is a paragraph on the interaction between the grid and the customers, being very careful not to mention invasive techniques to allow the grid to control customer’s appliances:

“Intelligent appliances capable of deciding when to consume power based on pre-set customer preferences.”

Here, they focus on letting the appliances decide what will be done to save power, not the grid or the provider. Later on, on the same paragraph:

“Early tests with smart grids have shown that consumers can save up to 25% on their energy usage by simply providing them with information on that usage and the tools to manage it.”

Again, enforcing that the providers will only “provide [the customer] with information”. In other words, the grid is smart up to the smart meter (that is controlled by the provider), where inside people’s houses, it’s the appliances that have to be smart. One pertinent comment from Hector Santos in the IETF group:

“Security (most privacy) issues, I believe, has been sedated over the years with the change in consumer mindset. Tomorrow (and to a large extent today) generation of consumers will not even give it a second thought. They will not even realize that it was once considered a social engineering taboo to conflict with user privacy issues.”

I hate to be pessimist, but there is a very important truth in this. Not only people are allowing systems to store their data for completely different reasons, but they don’t care if the owner of the system will distribute their information or not. I, myself, always paranoid, have signed contracts with providers knowing that they would use and sell my data to third parties. The British Telecom is one good example. He continues:

“Just look how social networking and the drive to share more, not less has changed the consumer mindset. Tomorrow engineers will be part of all this new mindset.”

There is no social engineering any more like it used to be. Who needs to steal your information when it’s already there, on your Facebook? People are sharing willingly, and a lot of them know what problems it may cause, but the benefit, for them, is greater. Moreover, millions bought music, games and films with DRM, allowing a company control what you do, see or listen. How many Kindles were bought? How many iPhones? People don’t care what’s going on if they have what they want.

That is the true meaning of sedated privacy concerns. It’s a very distorted way of selfishness, where you don’t care about yourself, as long as you are happy. If it makes no sense to you, don’t worry, it makes no sense to me too.

Recently, the Future of Privacy Forum published an excellent analysis (via Ars) on the smart grid privacy. Several concepts that are easy to understand how dangerous they can be, became commonplace to not think about it or even consider it a silly worry, given that no one cares anyway.

An evil use of a similar technology is the “Selectable Output Control“. Just like a Kindle, the media companies want to make sure you only watch what you pay for. It may seem fair, and even cheaper, as they allow “smart pricing”, like some smart-grid technologies.

But we all have seen what Amazon did to kindle users, of Apple did to its AppStore, taking down contents without warn, removing things you paid for from your device, allowing or disallowing you to run applications or contents on your device as if you hadn’t pay enough money to own the device and its contents.

In the end, “smart pricing” is like tax cut, they reduce tax A, but introduce taxes B, C and D, which double the amount of taxes you pay. Of course, you only knew about tax A and went happy about your life. All in all, nobody cares who or how much they pay, as long as they can get the newest fart app

On Workflows, State Machines and Distributed Services

I’ve been working with workflow pipelines, directly and indirectly, for quite a while now and one thing that is clearly visible is that, the way most people do it, it doesn’t scale.

Workflows (aka. Pipelines)

If you have a sequence of, say, 10 processes in a row (or graph) of dependencies and you need to run your data through each and every single one in the correct order to get the expected result, the first thing that comes to mind is a workflow. Workflows focus on data flow rather than on decisions, so there is little you can do with the data in case it doesn’t fit in your model. The result of such inconsistency generally falls in two categories: discard until further notice or re-run the current procedure until the data is correct.

Workflows are normally created ad-hoc, from a few steps. But things always grow out of proportions. If you’re lucky, these few steps would be wrapped into scripts as things grow, and you end up with a workflow of workflows, instead of a huge list of steps to take.

The benefit of a workflow approach is that you can run it at your own pace and assure that the data is still intact after each step. But the downfall is that it’s too easy to incorrectly blame the last step for some problem on your data. The status of the data could have been deteriorating over the time and the current state was the one that picked that up. Also, checking your data after each step is a very boring and nasty job and no one can guarantees that you’ll pick up every single bug anyway.

It becomes worse when the data volume increases to a level you can’t just look it up anymore. You’ll have to write syntax checkers for the intermediate results, you’ll end up with thousands of logs and scripts just to keep the workflow running. This is the third stage: meta-workflow for the workflow of workflows.

But this is not all, the worse problem is still to come… Your data is increasing by the minute, you’ll have to split it up one day and rather sooner than later. But, as long as you have to manually feed your workflow (and hope the intermediate checks work fine), you’ll have to split the data manually. If your case is simple and you can just run multiple copies of each step in parallel with each chunk of data, you’re in heaven (or rather, hell for the future). But that’s not always the case…

Now, you’ve reached a point where you have to maintain a meta-workflow for your workflow or workflows and manually manage the parallelism and collisions and checks of your data only to find out at the end that a particular piece of code was ignoring an horrendous bug in your data when it was already public.

If you want to add a new feature or change the order of two processes… well… 100mg of prozac and good luck!

Refine your Workflow

Step 1: Get rid of manual steps. The first rule is that there can be no manual step, ever. If the final result is wrong, you turn on the debugger, read the logs, find the problem, fix it and turn it off again. If you can’t afford to have wrong data live than write better checkers or rather reduce the complexity of your data.

One way to reduce the complexity is to split the workflow into smaller independent workflows, which each one generate only a fraction of your final data. If you have a mission critical environment, you better off with 1/10th of it broken than with the whole. Nevertheless, try to reduce the complexity on your pipeline, data structure and dependencies. When you have no change to do, re-think about the whole workflow, I’m sure you’ll find lots of problems every iteration.

Step 2: Unitize each step. The important rule here is: each step must process one, and only one, piece of data. How does it help? Scalability.

Consider a multiple data workflow (fig. 1 below), where you have to send the whole thing through, every time. If one of the process is much slower than the others, you’ll have to split your data for that particular step and join it again for the others. Splitting your data once at the beginning and running multiple pipelines at the same time is a nightmare as you’ll have to deal with the scrambled error messages yourself, especially if you still have manual checks around.

Multiple data Workflow
Figure 1: Split/Join is required each parallel step

On the other hand, if only one unit passes through each step (fig. 2), there is no need to split or join them and you can run as many parallel processes as you want.

Single data Workflow
Figure 2: No need to split/join

Step 3: Use simple and stupid automatic checks. If possible, don’t code anything at all.

If data must be identical on two sides, run a checksum (CRC sucks, MD5 is good). If the syntax needs to be correct, run a syntax checker, preferably a schema/ontology based automatic check. If your file is too complex or specially crafted so you need a special syntax check, re-write your file to use standards (XML and RDF are good).

Another important point on automatic checking is that you don’t have to check your emails waiting for an error message. When the subject contains the error message is already a pain, but when you have to grep for errors inside the body of the message? Oh god! I’ve lost a few lives already because of that…

Only mail when a problem occurs, only send the message related to the specific problem. It’s ok to send a weekly or monthly report just in case the automatic checks miss something. Go on and check the data for yourself once in a while and don’t worry, if things really screw up your users will let you know!


But, what’s the benefit of allowing your pipeline to automatically compute individual values and check for consistency if you still have to push the buttons? What you want now is a way of having more time to look at the pipeline flowing and fix architectural problems (and longer tea time breaks), rather than putting down the fire all the time. To calculate how many buttons you’ll press just multiply the number of data blocks you have by the number of steps… It’s a loooong way…

If still, you like pressing buttons, that’s ok. Just skip step 2 above and all will be fine. Otherwise, keep reading…

To automate your workflow you have two choices: either you fire one complete workflow for each data block or do it like a workflow of data through different services.

Complete Workflow: State Machines

If your data is small, seldom or you just like the idea, you can use a State machine to build a complete workflow for each block of data. The concept is rather simple, you receive the data and fire it through the first state. The machine will carry on, sending the changed data through all necessary states and, at the end, you’ll have your final data in-place, checked and correct.

UML is pretty good on defining state machines. For instance, you can use a state diagram to describe how your workflow is organised, class diagrams to show how each process is constructed and sequence diagrams to describe how processes talk to each other (preferably using a single technology). With UML, you can generate code and vice-versa, making it very practical for live changes and documentation purposes.

The State Design Pattern allows you to have a very simple model (each state is of the same type) with only one point of decision where to go next (when changing states): the state itself. This gives you the power to change the connections between the states easily and with very (very) little work. It’ll also save you a lot on prozac.

If you got this far you’re really interested on workflows or state machines, so I assume you also have a workflow of your own. If you do, and it’s a mess, I also believe that you absolutely don’t want to re-code all your programs just to use UML diagrams, queues and state machines. But you don’t need to.

Most programming languages allow a shell to be created and an arbitrary command to be executed. You can then manage the inter-process administration (creating/copying files, fifos, flags, etc), execute the process and, at the end, check the data and choose the next step (based on the current state of your data).

This methodology is simple, powerful and straight-forward, but it comes with a price. When you got too many data blocks flowing through, you end up with lots of copies of the same process being created and destroyed all the time. You can, however let the machine running and only provide data blocks, but still this doesn’t scale as we wanted on step 2 above.

Layered Workflow: Distributed Services

Now comes the holy (but very complex) grail of workflows. If your data is huge, constant flowing, CPU demanding and with awkward steps in between, you need to program thinking on parallelism. The idea is not complex, but the implementation can be monstrous.

On figure 2 above, you have three processes, A, B and C, running in sequence, and process B had two copies running because it took twice as long as A and C. It’s that simple: the more it takes to finish, more copies you run in parallel to make the flow constant. It’s like sewage pipes, rain water can flow in small pipes, but house waste will need much bigger ones. but later on, when you filter the rubbish, you can use small pipes back again.

So, what’s so hard on implementing this scenario? Well, first you have to take into account that those processes will be competing for resources. If they’re on the same machine, CPU will be a problem. If you have a dual core, the four processes above will share CPU, not to mention memory, cache, bus etc. If you use a cluster, they’ll all compete for network bandwidth and space on shared filesystems.

So, the general guidelines for designing robust distributed automatic workflows are:

  • Use layered state architecture. Design your machine in layers, separate the layers into machines or groups of machines and put a queue or a load-balancer in between each layer (state). This will allow you to scale much easier as you can add more hardware to a specific layer without impacting on others. It also allows you to switch off defective machines or do any maintenance on them with zero down-time.
  • One process per core. Don’t spawn more than one process per CPU as this will impact in performance in more ways than you can probably imagine. It’s just not worth it. Reduce the number of processes or steps or just buy more machines.
  • Use generic interfaces. Use the same queue / load-balancer for all state changes and, if possible, make their interfaces (protocols) identical so the previous state doesn’t need to know what’s on the next and you can change from one to another with zero cost. Also, make the states implement the same interface in the case you don’t need queues or load-balancers for a particular state.
  • Include monitors and health checks in your design. With such complex architecture it’s quite easy to ignore machines or processes failing. Separate reports into INFO, WARNING and ERROR and give them priorities or different colours on a web interface and mail or SMS only the errors to you.

As you can see, by providing a layered load-balancing, you’re getting performance and high-availability for free!

Every time data piles up in one layer, just increase the number of processes on it. If a machine breaks, it’ll be off the rotation (automatic if using queues, ping-driven or MAC-driven for load-balancers). Updating the operating system, your software or anything on the machine is just a matter of taking it out, updating, testing and deploying back again. Your service will never be off-line.

Of course, to get this benefit for real you have to remove all single-points-of-failure, which is rarely possible. But to a given degree, you can get high-performance, high-availability, load-balancing and scalability at a reasonable low cost.

The initial cost is big, though. Designing such complex network and its sub-systems, providing all safe-checks and organizing a big cluster is not an easy (nor cheap) task and definitely not done by inexperienced software engineers. But once it’s done, it’s done.

More Info

Wikipedia is a wonderful source of information, especially for the computer science field. Search for workflows, inter-process communication, queues, load-balancers, commodity clusters, process-driven applications, message passing interfaces (MPI, PVM) and some functional programming like Erlang and Scala.

It’s also a good idea to look for UML tools that generates code and vice-versa, RDF ontologies and SPARQL queries. XML and XSD is also very good for validating data formats. If you haven’t yet, take a good look on design patterns, especially the State Pattern.

Bioinformatics and internet companies have a particular affection to workflows (hence my experience) so you may find numerous examples on both fields.

RDBMS, to rewrite or not to rewrite… I got confused…

Mike Stonebreaker (Ingres/Postgres) seems to be confused as well…

First he said Google’s Map/Reduce was “Missing most of the features that are routinely included in current DBMS”, but earlier he said to ditch RDBMS anyway because “modern use of computers renders many features of mainstream DBMS obsolete”.

So, what’s the catch? Should we still use RDBMS or not? Or should we still develop technologies based on relational databases while Mike develops himself the technology of the new era? Maybe that was the message anyway…

My opinion:

MapReduce is not a step backwards, there are sometimes when indexing is actually slower than brute-force. And I’m not saying that on insert time the indexes have to be updated and so on, I’m saying in the actual search for information, if the index is too complex (or too big) it might take more time to search through the index, compute the location of the data (which might be anywhere in a range of thousands of machines), retrieve the data and later on, sort, or search on the remaining fields.

MapReduce can effectively do everything in one step, while still in the machine and return less values per search (as opposed to primary key searches first) and therefore less data will be sent over the network and less time will be taken.

Of course, MapReduce (as any other brute-force methods) is hungry for resources. You need a very large cluster to make it really effective (1800 machines is enough :)) but that’s a step forward something different from RDBMS. In the distributed world, RDBMS won’t work at all, something have to be done and Google just gave the first step.

Did we wait for warp-speed to land on the moon?! No, we got a flying foil crap and landed on it anyway.

Next steps? Many… we can continue with brute-force and do a MapReduce on the index and use the index to retrieve in an even larger cluster, or use automata to iteratively search and store smaller “views” somewhere else, or do statistical indexes (quantum indexes) and get the best result we can get instead of all results… The possibilities are endless…

Lets wait and see how it goes, but yelling DO IT than later DON’T is just useless…


This is not a rant against Stonebreaker, I share his ideas about the relational model being far too outdated and the need for something new. What I don’t agree, though, is that MapReduce is a step backwards, maybe not even a step forward, probably sideways.

The whole point is that the relational model is the thesis and there are lots of antithesis, we just didn’t come up with the synthesis yet.

LSF, Make and NFS 2

Recently I’ve posted this entry about how NFS cache was playing tricks on me and how sleep 1 kinda solved the issue.

The problem got worse, of course. I’ve raised to 5 seconds and in some cases it was still not enough, than I’ve learnt from the admins that the NFS cache timeout was 60 seconds!! I couldn’t sleep 60 on all of them, so I had to come with a script:

while [ ! -s $file ] && (( $slept < $timeout )); do sleep 5; slept=$(($slept+5)); done

In a way it’s not ugly as it may seem… First, the alternative is to change the configuration (either disable cache or reduce timeout) in the whole filesystem and that would affect others. Second because now I just wait for the (almost) correct amount of time and only when I need (the first -s will get the file if there is no problem).

At least, sleep 60 on everything would be much worse! 😉