Distributed Compilation on a Pandaboard Cluster

This week I was experimenting with the distcc¬†and Ninja on a Pandaboard cluster and it behaves exactly as I expected, which is a good thing, but it might not be what I was looking for, which is not. ūüėČ

Long story short, our LLVM buildbots were running very slow, from 3 to 4.5 hours to compile and test LLVM. If you consider that at peak time (PST hours) there are up to 10 commits in a single hour, the buildbot will end up testing 20-odd patches at the same time. If it breaks in unexpected ways, of if there is more than one patch on a given area, it might be hard to spot the guilty.

We ended up just avoiding the make clean step, which put us around 15 minutes build+tests, with the odd chance of getting 1 or 2 hours tops, which is a great deal. But one of the alternatives I was investigating is to do a distributed build. More so because of the availability of cluster nodes with dozens of ARM cores inside, we could make use of such a cluster to speed up our native testing, even benchmarking on a distributed way. If we do it often enough, the sample might be big enough to account for the differences.

The cluster

So, I got three Pandaboards ES (dual Cortex-A9, 1GB RAM each) and put the stock Ubuntu 12.04 on them and installed the bare minimum (vim, build-essential, python-dev, etc), upgraded to the latest packages and they were all set. Then, I needed to find the right tools to get a distributed build going.

It took a bit of searching, but I ended up with the following tool-set:

  • distcc:¬†The distributed build dispatcher, which knows about the other machines in the cluster and how to send them jobs and get the results back
  • CMake:¬†A Makefile generator which LLVM can use, and it’s much better than autoconf, but can also generate Ninja files!
  • Ninja: The new intelligent builder which not only is faster to resolve dependencies, but also has a very easy way to change the rules to use distcc, and also has a magical new feature called¬†pools, which allow me to scale job types independently (compilers, linkers, etc).

All three tools had to be compiled from source. Distcc’s binary distribution for ARM is too old, CMake’s version on that Ubuntu couldn’t generate Ninja files and Ninja doesn’t have binary distributions, full stop. However, it was very simple to get them interoperating nicely (follow the instructions).

You don’t have to use CMake, there are other tools that generate Ninja files, but since LLVM uses CMake, I didn’t have to do anything. What you don’t want is to generate the Ninja files yourself, it’s just not worth it. Different than Make, Ninja doesn’t try to search for patterns and possibilities (this is why it’s fast), so you have to be very specific on the Ninja file on what you want to accomplish. This is very easy for a program to do (like CMake), but very hard and error prone for a human (like me).

Distcc

To use distcc is simple:

  1. Replace the compiler command by distcc compiler on your Ninja rules;
  2. Set the environment variable DISTCC_HOSTS to the list of IPs that will be the slaves (including localhost);
  3. Start the distcc daemon on all slaves (not on the master): distccd --daemon --allow <MasterIP>;
  4. Run ninja with the number of CPUs of all machines + 1 for each machine. Ex: ninja -j6 for 2 Pandaboards.

A local build, on a single Pandaboard of just LLVM (no Clang, no check-all) takes about 63 minutes. With distcc and 2 Pandas it took 62 minutes!

That’s better, but not as much as one would hope for, and the reason is a bit obvious, but no less damaging: The Linker! It took 20 minutes to compile all of the code, and 40 minutes to link them into executable. That happened because while we had 3 compilation jobs on each machine, we had 6 linking jobs on a single Panda!

See,¬†distcc can spread the compilation jobs as long as it copies the objects back to the master, but because a linker needs all objects in memory to do the linking, it can’t do that over the network. What distcc could do, with Ninja’s help, is to know which objects will be linked together, and keep copies of them on different machines, so that you can link on separate machines, but that is not a trivial task, and relies on an interoperation level between the tools that they’re not designed to accept.

Ninja Pools

And that’s where Ninja proved to be worth its name: Ninja pools! In Ninja, pools are named resources that bundle together with a specific level of scalability. You can say that compilers scale free, but linkers can’t run more than a handful. You simply need to create a pool called¬†linker_pool (or anything you want), give it a depth of, say, 2, and annotate all linking jobs with that pool. See the manual for more details.

With the pools enabled, a distcc build on 2 Pandaboards took exactly 40 minutes. That’s 33% of gain with double the resources, not bad. But, how does that scale if we add more Pandas?

How does it scale?

To get a third point (and be able to apply a curve fit), I’ve added another Panda and ran again, with 9 jobs and linker pool at 2, and it finished in 30 minutes. That’s less than half the time with three times more resources. As expected, it’s flattening out, but how much more can we add to be profitable?

I don’t have an infinite number of Pandas (nor I want to spend all my time on it), so I just cheated and got a curve fitting program (xcrvfit, in case you’re wondering) and cooked up an exponential that was close enough to the points and use the software ability to do a best fit. It came out with¬†86.806*exp(-0.58505*x) + 14.229, which according to Lybniz, flattens out after 4 boards (about 20 minutes).

Pump Mode

Distcc has a special mode called pump mode, in which it pushes with the C file, all headers necessary to compile it solely on the node. Normally, distcc will pre-compile on the master node and send the pre-compiled result to the slaves, which convert to object code. According to the manual, this could improve the performance 10-fold! Well, my results were a little less impressive, actually, my 3-Panda cluster finished in just about 34 minutes, 4 minutes more than without push mode, which is puzzling.

I could clearly see that the files were being compiled in the slaves (distccmon-text¬†would tell me that, while there was a lot of “preprocessing” jobs on the master before), but Ninja doesn’t print times on each output line for me to guess what could have slowed it down. I don’t think there was any effect on the linker process, which was still enabled in this mode.

Conclusion

Simply put, both distcc and Ninja pools have shown to be worthy tools. On slow hardware, such as the Pandas, distributed builds can be an option, as long as you have a good balance between compilation and linking. Ninja could be improved to help distcc to link on remote nodes as well, but that’s a wish I would not press on the team.

However, scaling only to 4 boards will reduce a lot of the value for me, since I was expecting to use 16/32 cores. The main problem is again the linker jobs working solely on the master node, and LLVM having lots and lots of libraries and binaries. Ninja’s pools can also work well when compiling LLVM+Clang on debug mode, since the objects are many times bigger, and even on above average machine you can start swapping or even freeze your machine if using other GUI programs (browsers, editors, etc).

In a nutshell, the technology is great and works as advertised, but with LLVM it might not be yet the thing. It’s still more profitable to get faster hardware, like the Chromebooks, that are 3x faster than the Pandas and cost only marginally more.

Would also be good to know why the pump mode has regressed in performance, but I have no more time to spend on this, so I leave as a exercise to the reader. ūüėČ

LLVM Vectorizer

Now that I’m back working full-time with LLVM, it’s time to get some numbers about performance on ARM.

I’ve been digging the new LLVM loop vectorizer and I have to say, I’m impressed. The code is well structured, extensible and above all, sensible. There are lots of room for improvement, and the code is simple enough so you can do it without destroying the rest or having to re-design everything.

The main idea is that the loop vectorizer is a Loop Pass, which means that if you register this pass (automatically on -O3, or with -loop-vectorize option), the Pass Manager will run its runOnLoop(Loop*) function on every loop it finds.

The three main components are:

  1. The Loop Vectorization Legality: Basically identifies if it’s legal (not just possible) to vectorize. This includes checking if we’re dealing with an inner loop, and if it’s big enough to be worth, and making sure there aren’t any conditions that forbid vectorization, such as overlaps between reads and writes or instructions that don’t have a vector counter-part on a specific architecture. If nothing is found to be wrong, we proceed to the second phase:
  2. The Loop Vectorization Cost Model: This step will evaluate both versions of the code: scalar and vector. Since each architecture has its own vector model, it’s not possible to create a common model for all platforms, and in most cases, it’s the special behaviour that makes vectorization profitable (like 256-bits operations in AVX), so we need a bunch of cost model tables that we consult given an instruction and the types involved. Also, this model doesn’t know how the compiler will lower the scalar or vectorized instructions, so it’s mostly guess-work. If the vector cost (normalized to the vector size) is less than the scalar cost, we do:
  3. The Loop Vectorization: Which is the proper vectorization, ie. walking through the scalar basic blocks, changing the induction range and increment, creating the prologue and epilogue, promote all types to vector types and change all instructions to vector instructions, taking care to leave the interaction with the scalar registers intact. This last part is a dangerous one, since we can end up creating a lot of copies from scalar to vector registers, which is quite expensive and was not accounted for in the cost model (remember, the cost model is guess-work based).

All that happens on a new loop place-holder, and if all is well at the end, we replace the original basic blocks by the new vectorized ones.

So, the question is, how good is this? Well, depending on the problems we’re dealing with, vectorizers can considerably speed up execution. Especially iterative algorithms, with lots of loops, like matrix manipulation, linear algebra, cryptography, compression, etc. In more practical terms, anything to do with encoding and decoding media (watching or recording videos, pictures, audio), Internet telephones (compression and encryption of audio and video), and all kinds of scientific computing.

One important benchmark for that kind of workload is Linpack. Not only Linpack has many examples of loops waiting to be vectorized, but it’s also the benchmark that defines the Top500 list, which classifies the fastest computers in the world.

Benchmarks

So, both GCC and Clang now have the vectorizers turned on by default with -O3, so comparing them is as simple as compiling the programs and see them fly. But, since I’m also interested in seeing what is the performance gain with just the LLVM vectorizer, I also disabled it and ran a clang with only  -O3, no vectorizer.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler    Opt    MFLOPS  Diff
   Clang    -O3      2413  0.0%
     GCC  -O3 vec    2421  0.3%
   Clang  -O3 vec    3346 38.6%

This is some statement! The GCC vectorizer exists for a lot longer than LLVM’s and has been developed by many vectorization gurus and LLVM seems to easily beat GCC in that field. But, a word of warning, Linpack is by no means representative of all use cases and user visible behaviour, and it’s very likely that GCC will beat LLVM on most other cases. Still, a reason to celebrate, I think.

This boost mean that, for many cases, not only the legality if the transformations are legal and correct (or Linpack would have gotten wrong results), but they also manage to generate faster code at no discernible cost. Of course, the theoretical limit is around 4x boost (if you manage to duplicate every single scalar instruction by a vector one and the CPU has the same behaviour about branch prediction and cache, etc), so one could expect a slightly higher number, something on the order of 2x better.

It depends on the computation density we’re talking about. Linpack tests specifically the inner loops of matrix manipulation, so I’d expect a much higher ratio of improvement, something around 3x or even closer to 4x. VoIP calls, watching films and listening to MP3 are also good examples of densely packet computation, but since we’re usually running those application on a multi-task operating system, you’ll rarely see improvements higher than 2x. But general applications rarely spend that much time on inner loops (mostly waiting for user input and then doing a bunch of unrelated operations, hardly vectorizeable).

Another important aspect of vectorization is that it saves a lot of battery juice. MP3 decoding doesn’t really matter if you finish in 10 or 5 seconds, as long as the music doesn’t stop to buffer. But taking 5 seconds instead of 10 means that on the other 5 seconds the CPU can reduce its voltage and save battery. This is especially important in mobile devices.

What about ARM code?

Now that we know the vectorizer works well, and the cost model is reasonably accurate, how does it compare on ARM CPUs?

It seems that the grass is not so green on this side, at least not at the moment. I have reports that on ARM it also reached the 40% boost similar to Intel, but what I saw was a different picture altogether.

On a Samsung Chromebook (Cortex-A15) I got:

Compiler   Opt    MFLOPS   Diff
  Clang    -O3       796   0.0%
    GCC  -O3 vec     736  -8.5%
  Clang  -O3 vec     773  -2.9%

The performance regression can be explained by the amount of scalar code intermixed with vector code inside the inner loops as a result of shuffles (movement of data within the vector registers and between scalar and vector registers) not being lowered correctly. This most likely happens because the LLVM back-end relies a lot on pattern-matching for instruction selection (a good thing), but the vectorizers might not be producing the shuffles in the right pattern, as expected by each back-end.

This can be fixed by tweaking the cost model to penalize shuffles, but it’d be good to see if those shuffles aren’t just mismatched against the patterns that the back-end is expecting. We will investigate and report back.

Update

Got results for single precision floating point, which show a greater improvement on both Intel and ARM.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler   Opt   MFLOPS  Diff
   Clang   -O3     2530   0.0%
     GCC -O3 vec   3484  37.7%
   Clang -O3 vec   3996  57.9%

On a Samsung Chromebook (Cortex-A15) I got:

Compiler   Opt   MFLOPS  Diff
   Clang   -O3      867   0.0%
     GCC -O3 vec    788,  9.1%
   Clang -O3 vec   1324  52.7%

Which goes on to show that the vectorizer is, indeed, working well for ARM, but the costs of using the VFP/NEON pipeline outweigh the benefits. Remember than NEON vectors are only 128-bits wide and VFP only 64-bit wide, and NEON has no double precision floating point operations, so they’ll only do one double precision floating point operations per cycle, so the theoretical maximum depends on the speed of the soft-fp libraries.

So, in the future, what we need to be working is the cost model, to make sure we don’t regress in performance, and try to get better algorithms when lowering vector code (both by making sure we match the patterns that the back-end is expecting, and by just finding better ways of vectorizing the same loops).

Conclusion

Without further benchmarks it’s hard to come to a final conclusion, but it’s looking good, that’s for sure. Since Linpack is part of the standard LLVM test-suite benchmarks, fixing this and running it regularly on ARM will at least avoid any further regressions… Now it’s time to get our hands dirty!

 

Declaration of Internet Freedom

We stand for a free and open Internet.

We support transparent and participatory processes for making Internet policy and the establishment of five basic principles:

  • Expression:¬†Don’t censor the Internet.
  • Access:¬†Promote universal access to fast and affordable networks.
  • Openness:¬†Keep the Internet an open network where everyone is free to connect, communicate, write, read, watch, speak, listen, learn, create and innovate.
  • Innovation:¬†Protect the freedom to innovate and create without permission. Don‚Äôt block new technologies, and don‚Äôt punish innovators for their users’ actions.
  • Privacy:¬†Protect privacy and defend everyone‚Äôs ability to control how their data and devices are used.

Don’t get it? You should be more informed on the power of the internet and what governments around the world have been doing to it.

Good starting places are: Avaaz, Ars Technica, Electronic Frontier Foundation, End Software Patents, Piratpartiet and the excellent Case for Copyright Reform.

Source: http://www.internetdeclaration.org/freedom

K-means clustering

Clustering algorithms can be used with many types of data, as long as you have means to distribute them in a space, where there is the concept of distance. Vectors are obvious choices, but not everything can be represented into N-dimensional points. Another way to plot data, that is much closer to real data, is to allow for a large number of binary axis, like tags. So, you can cluster by the amount of tags the entries share, with the distance being (only relative to others) the proportion of these against the non-sharing tags.

An example of tag clustering can be viewed on Google News, an example of clustering on¬†Euclidean¬†spaces can be viewed on the image above (full code here). The clustering code is very small, and the result is very impressive for such a simple code. But the devil is in the details…

Each red dots group is generated randomly from a given central point (draws N randomly distributed points inside a circle or radius R¬†centred¬†at C). Each centre is randomly placed, and sometimes their groups collide (as you can see on the image), but that’s part of the challenge. To find the groups, and their centres, I throw random points (with no knowledge of the groups’ centres) and iterate until I find all groups.

The iteration is very simple, and consists of two steps:

  1. Assignment Step: For each point, assign¬†it to the nearest mean. This is why you need the concept of distance, and that’s a tricky part. With Cartesian coordinates, it’s simple.
  2. Update Step:¬†Calculate the real mean of all points belonging to each mean point, and update the point to be at it. This is basically moving the supposed (randomly guessed) mean to it’s rightful place.

On the second iteration, the means, that were randomly selected at first, are now closer to a set of points. Not necessarily points in the same cluster, but the cluster that has more points assigned to any given mean will slowly steal it from the others, since it’ll have more weight when updating it on step 2.

If all goes right, the means will slowly move towards the centre of each group and you can stop when the means don’t move too much after the update step.

Many problems will arise in this simplified version, for sure. For instance, if the mean is exactly in between two groups, and both pull it to their centres with equally strong forces, thus never moving the mean, thus the algorithm thinks it has already found its group, when in fact, it found two. Or if the group is so large that it ends up with two or more means which it belongs, splitting it into many groups.

To overcome these deficiencies, some advanced forms of K-means take into account the shape of the group during the update step, sometimes called soft k-means. Other heuristics can be added as further steps to make sure there aren’t two means too close to each other (relative to their groups’ sizes), or if there are big gaps between points of the same group, but that kind of heuristics tend to be exponential in execution, since they examine every point of a group in relation to other points of the same group.

All in all, still an impressive performance for such a simple algorithm. Next in line, I’ll try clustering data distributed among many binary axis and see how k-means behave.

Google knows what you searched last summer

Despise all the controversy, Google started his new Privacy Policy last Thursday and whether you like it or not, you are being watched.

Being realistic, this is not far from what they were already doing: Google already tracked your searches, what you are watching on Youtube or your emails.

But before March, 1st, Google Plus, Youtube, Gmail and almost 60 Google products, were in different databases. With this change, Google guys are giving themselves the right to put all those products in just one big place, put one and one and one together to build a better and more complete online behaviour of YOU. And use it to chase YOU with their ads.

And you can’t opt out. If you want to use any Google product you are under their privacy policy.

It should be nonsense for me to tell you to stop using Google products. Almost everything you do in the internet today, from searches and emails, to finding a street and comparing products’ prices, is somehow through a Google product or related to it.

But you can at least reduce the amount of information that Google will be able to collect from you.

You can, for instance, delete your Google history going to https://www.google.com/history/ and clicking the button ‚ÄúRemove all Web History‚ÄĚ

You can also configure your advertising settings here:  https://www.google.com/settings/u/0/ads/preferences/

You can edit your settings or even opt out.

 

Another way to ‚Äúconfuse‚ÄĚ Google is creating a different account for each Google service (if you can keep up with all usernames and passwords).

Or, when watching a video on Youtube or searching the Web, make sure you are not logged in to your Google account.

There is also the possibility to use browser plugins that work to protect your data, or even anonymous proxies.

But, the truth is, as soon as you type into your computer, click anything, visit at a page, talk through Skype, or even talk on a telephone, (mobile or fixed), those who want to, can spy on you.

At least now Google is coming clear and telling you that they are spying on you. It makes better sense to me than living in a fool’s paradise, where you still believe that you have control over your life.

Emergent behaviour

There is a lot of attention to emergent behaviour nowadays (ex. here, here, here and here), but it’s still on the outskirts of science and computing.

Science

For millennia, science has isolated each single behaviour of a system (or system of systems) to study it in detail, than join them together to grasp the bigger picture. The problem is that, this approximation can only be done with simple systems, such as the ones studied by Aristotle, Newton and Ampere. Every time scientists were approaching the edges of their theories (including those three), they just left as an exercise to the reader.

Newton has foreseen relativity and the possible lack of continuity in space and time, but he has done nothing to address that. Fair enough, his work was much more important to science than venturing throughout the unknowns of science, that would be almost mystical of him to try (although, he was an alchemist). But more and more, scientific progress seems to be blocked by chaos theory, where you either unwind the knots or go back to alchemy.

Chaos theory exists for more than a century, but it was only recently that it has been applied to anything outside differential equations. The hyper-sensibility of the initial conditions is clear on differential systems, but other systems have a less visible, but no less important, sensibility. We just don’t see it well enough, since most of the other systems are not as well formulated as differential equations (thanks to Newton and Leibniz).

Neurology and the quest for artificial intelligence has risen a strong interest in chaos theory and fractal systems. The development in neural networks has shown that groups and networks also have a fundamental chaotic nature, but more importantly, that it’s only though the chaotic nature of those systems that you can get a good amount of information from it. Quantum mechanics had the same evolution, with Heisenberg and Schroedinger kicking the ball first on the oddities of the universe and how important is the lack of knowledge of a system to be able to extract information from it (think of Schroedinger’s cat).

A network with direct and fixed thresholds doesn’t learn. Particles with known positions and velocities don’t compute. N-body systems with definite trajectories don’t exist.

The genetic code has some similarities to these models. Living beings have far more junk than genes in their chromosomes (reaching 98% of junk on human genome), but changes in the junk parts can often lead to invalid creatures. If junk within genes (introns) gets modified, the actual code (exons) could be split differently, leading to a completely new, dysfunctional, protein. Or, if you add start sequences (TATA-boxes) to non-coding region, some of them will be transcribed into whatever protein they could make, creating rubbish within cells, consuming resources or eventually killing the host.

But most of the non-coding DNA is also highly susceptible to changes, and that’s probably its most important function, adapted to the specific mutation rates of our planet and our defence mechanism against such mutations. For billions of years, the living beings on Earth have adapted that code. Each of us has a super-computer that can choose, by design, the best ratios for a giving scenario within a few generations, and create a whole new species or keep the current one adapted, depending on what’s more beneficial.

But not everyone is that patient…

Programming

Sadly, in my profession, chaos plays an important part, too.

As programs grow old, and programmers move on, a good part of the code becomes stale, creating dependencies that are hard to find, harder to fix. In that sense, programs are pretty much like the genetic code, the amount of junk increases over time, and that gives the program resistance against changes. The main problem with computing, that is not clear in genetics, is that the code that stays behind, is normally the code that no one wants to touch, thus, the ugliest and most problematic.

DNA transcriptors don’t care where the genes are, they find a start sequence and go on with their lives. Programmers, we believe, have free will and that gives them the right to choose where to apply a change. They can either work around the problem, making the code even uglier, or they can go on and try to fix the problem in the first place.

Non-programmers would quickly state that only lazy programmers would do the former, but more experienced ones will admit have done so on numerous occasions for different reasons. Good programmers would do that because fixing the real problem is so painful to so many other systems that it’s best to be left alone, and replace that part in the future (only they never will). Bad programmers are not just lazy, some of them really believe that’s the right thing to do (I met many like this), and that adds some more chaos into the game.

It’s not uncommon to try to fix a small problem, go more than half-way through and hit a small glitch on a separate system. A glitch that you quickly identify as being wrongly designed, so you, as any good programmer would do, re-design it and implement the new design, which is already much bigger than the fix itself. All tests pass, except the one, that shows you another glitch, raised by your different design. This can go on indefinitely.

Some changes are better done in packs, all together, to make sure all designs are consistent and the program behaves as it should, not necessarily as the tests say it would. But that’s not only too big for one person at one time, it’s practically impossible when other people are changing the program under your feet, releasing customer versions and changing the design themselves. There is a point where a refactoring is not only hard, but also a bad design choice.

And that’s when code become introns, and are seldom removed.

Networks

The power of networks is rising, slower than expected, though. For decades, people know about synergy, chaos and emergent behaviour, but it was only recently, with the quality and amount of information on global social interaction, that those topics are rising again in the main picture.

Twitter, Facebook and the like have risen so many questions about human behaviour, and a lot of research has been done to address those questions and, to a certain extent, answer them. Psychologists and social scientists knew for centuries that social interaction is greater than the sum of all parts, but now we have the tools and the data to prove it once and for all.

Computing clusters have being applied to most of the hard scientific problems for half a century (weather prediction, earthquake simulation, exhaustion proofs in graph theory). They also took on a commercial side with MapReduce and similar mechanisms that have popularised the distributed architectures, but that’s only the beginning.

On distributed systems of today, emergent behaviour is treated as a bug, that has to be eradicated. In the exact science of computing, locks and updates have to occur in the precise order they were programmed to, to yield the exact result one is expecting. No more, no less.

But to keep our system out of emergent behaviours, we invariably introduce emergent behaviour in our code. Multiple checks on locks and variables, different design choices for different components that have to work together and the expectancy of precise results or nothing, makes the number of lines of code grow exponentially. And, since that has to run fast, even more lines and design choices are added to avoid one extra operation inside a very busy loop.

While all this is justifiable, it’s not sustainable. In the long run (think decades), the code will be replaced or the product will be discontinued, but there is a limit to which a program can receive additional lines without loosing some others. And the cost of refactoring increases with the lifetime of a product. This is why old products don’t get too many updates, not because they’re good enough already, but because it’s impossible to add new features without breaking a lot others.

Distant future

As much as I like emergent behaviour, I can’t begin to fathom how to harness that power. Stochastic computing is one way and has been done with certain level of success here and here, but it’s far from easy to create a general logic behind it.

Unlike Turing machines, emergent behaviour comes from multiple sources, dressed in multiple disguises and producing far too much variety in results that can be accounted in one theory. It’s similar to string theory, where there are several variations of it, but only one M theory, the one that joins them all together. The problem is, nobody knows how this M theory looks like. Well, they barely know how the different versions of string theory look like, anyway.

In that sense, emergent theory is even further than string theory to be understood in its entirety. But I strongly believe that this is one way out of the conundrum we live today, where adding more features makes harder to add more features (like mass on relativistic speeds).

With stochastic computing there is no need of locks, since all that matter is the probability of an outcome, and where precise values do not make sense. There is also no need for NxM combination of modules and special checks, since the power is not in the computation themselves, but in the meta-computation, done by the state of the network, rather than its specific components.

But that, I’m afraid, I won’t see in my lifetime.

FreeCell puzzles solver API

This is a little pet project I did a while ago. It’s a FreeCell puzzle‘s solver API.

The idea is to provide a basic validation engine and board management (pretty much like my old chess validation), so people can write FreeCell solvers on top of it. It has basic board setup (of multiple sizes), movement legalisation, and a basic Solver class, which you must derive to create your own solvers.

There’s even a BruteFroceSolver that can solve a few small boards, and that gives you an idea on how to create your own solvers. However, the API is not clear enough yet that children could start playing with it, and that’s the real goal of this project: to get kids interested in solving complex optimisation problems in an easy way.

Freecell is a perfect game for it. Most boards can be solved (only a handful of them were proven – by exhaustion – not solvable), some movements can be rolled back and you can freely re-use cards that have already been placed into the foundations (their final destination) back in the game again.

It’s out of the scope of this project to produce a full-featured graphic interface for kids, but making the API easy enough so they understand the concepts without dragging themselves into idiosyncrasies of C++ is important.

Compiler optimisations

The reason why I did this was to make some of the optimisations compiler engineers have to do more appealing to non-compiler engineers or children with a taste for complex problems. But what does this have to do with compilers? The analogy is a bit far-fetching and somewhat reverse, but it’s interesting nevertheless and it was worth the shot.

Programming languages are transformed into graphs inside the compiler, which should represent the intentions of the original programmer. This graphs are often optimised multiple times until you end up with a stream of instructions representing the source code in the machine language.

Ignore for now the optimisations on those graphs, and focus on the final part: selecting machine instructions that can represent that final graph in the machine language (assembly). This selection can pick any assembly instruction at will, but it has to put them into a very specific order to represent the same semantics (not just syntactic) of the original program. Since many instructions have side effects, pipeline interactions, special flags set or cleaned up, it’s not trivial to produce correct code if you don’t check and re-check all conditions every time. This is a known complex optimisation problem and can be responsible for changes in speed or code size in orders of magnitude.

What does it have to do with the Freecell puzzle? Well, in Freecell, you have a number of cards and you have to put them in a specific order, just like assembly instructions. But in this case, the analogy is reverse: the “sequence” is trivial, but the “instructions” are hard to get.

There are other similarities. For example, you have four free cells, and they can only hold one value at a time. They are similar to registers, and manipulating them, gives you a good taste of how hard it is to do register scheduling when building the assembly result. But in this case, it’s much harder to spill (move the data back to memory, or in this case, card back to cascades), since there are strict rules on how to move cards around.

Reusing cards from the foundations is similar to expanding single instructions into a sequence of them in order to circumvent pipeline stalls. In real compilers you could expand a multiply+add (very useful for digital signal processing) into two instructions: multiply and add, if that gives you some advantage on special cases on special chips. In Freecell, you can use a 9 on top of a 10, to move an 8 from another cascade and free up a card that you need to clean up your freecells (registers).

I’m sure you can find many more similarities, even if you have to skew the rules a bit (or completely reverse them), but that’s not the point. The point is to interest people into complex optimisation techniques without the hassle of learning a whole new section of computer science, especially if that section puts fear in most people in the first place.

2011 European User Group Meeting

For all LLVM users in Europe that didn’t catch the announcement on the main list, be warned that we’re organising the European LLVM User Group Meeting in September, 16th, in London. All information necessary to register and submit your work is on the link.

It’s free of charge and we’ll also provide a happy-hour after the meeting for an extended networking. If you’re just curious about LLVM or work with it day to day or is thinking of starting your PhD with LLVM, please register and show up. You won’t regret. ūüėČ

More information will be available on the link above and through the main mailing list. Stay tuned!

Task Driven Computing

Ever since Moore’s idea became a law (by providence), and empires were built upon this law, little has been thought about the need for such advancements. Raw power is considered to many the only real benchmark to what a machine can be compared to others. Cars, computers and toasters are all alike in those matters, and are only as good as their raw throughput (real or not).

With the carbon footprint disaster, some people began to realise (not for the correct reasons) that maybe we don’t actually need all that power to be happy. Electric cars, low-powered computers and smart-appliances are now appealing to the final consumer and, for good or bad, things are changing. The rocketing growth of the mobile market (smartphones, netbooks and tablets) in recent years is a good indicator that the easily seduced consumer mass has now being driven towards leaner, more efficient machines.

But, how lean are we ready to go? How much raw power are we willing to give away. In other words, how far goes the appeal that the media push on us to relinquish those rights bestowed by Moore? It seems not so much, with all chip companies fighting for a piece of the fat market (as well as the lean, but).

What is the question, anyway?

Ever since that became a trend, the question has always been: “how lean can we make our machine without impacting on usability?”. The focus so far has only been on creating smarter hardware, to a lesser extent (and only recently) reducing the unneeded fat of operating systems and applications, but no one ever touches the fundamental question: “Do we really need all that?“.

The questions is clearly cyclic. For example, you wouldn’t need a car if the public transport was decent. You wouldn’t need health insurance if the public health system was perfect, and so on. With computing is the same. If you rely on a text editor or a spreadsheet, it has to be fast and powerful, so you can finish your work on time (and not get fired). If you are a developer and have to re-compile your code every so often, you need a damn good (in CPU and memory) computer to make it as painless as possible. Having a slow computer can harm the creative process that involves all tasks around it, and degrade the quality of your work to an unknown quantity.

Or does it?

If you didn’t have to finish your work quicker, would you still work the same way? If you didn’t have to save your work, or install additional software, just because the system you’re working on only works on a particular type of computer (say, only available on your workplace). If you could perform tasks as tasks and not a whole sequence of meaningless steps and bureaucracy, would you still take that amount of time to finish your task?

Real world

Even though the real world is not that simple, one cannot take into account the whole reality on each investigation. Science just doesn’t work that way. To be effective, you take out all but one variable and test it. One by one, until you have a simplified picture, a model of reality. If on every step you include the whole world, the real world, in your simulations, you won’t get far.

There is one paper that touched some of these topics back in 2000, and little has changed since then. I dare to say that it actually got worse. With all these app stores competing for publicity and forcing incompatibility with invisible boundaries, has only made matters worse. It seems clear enough for me that the computing world, as far I can remember (early 80’s) was always like that and it’s not showing signs of change so far.

The excuse to keep doing the wrong thing (ie. not thinking clearly about what a decent system is) was always because “the real world is not that simple”, but in fact, the only limitation factor has been the greed of investors who cannot begin to understand that a decent system can bring more value (not necessarily money) than any quickly designed and delivered piece of software available today.

Back in the lab…

Because I don’t give a fig to what they think, I can go back to the lab and think clearly. Remove greed, profit and market from the table. Leave users, systems and what’s really necessary.

Computers were (much before Turing)¬† meant to solve specific problems. Today, general purpose computers create more problems than they solve, so let’s go back to what the problem is and lets try to solve it without any external context: Tasks.

A general purpose computer can perform a task in pretty much the same way as any other, after all, that’s why they’re called “general purpose”. So the system that runs on it is irrelevant, if it does not perform the task, it’s no good. A good example of that are web browsers. Virtually every browser can render a screen, and show surprisingly similar results. A bad example is a text editor, which most of them won’t even open another’s documents, and if they do, the former will do all in its power to make the result horrid in the latter.

Supposing tasks can be done seamlessly on any computer (lets assume web pages for the moment), than does the computer only computes that task, or is it doing other things as well?

All computers I know of will be running, even if broken, until they’re turned off. Some can increase and decrease their power consumption, but they’ll still be executing instructions to the world’s end. According to out least-work principle (to execute tasks), this is not particularly relevant, so we must take that out of our system.

Thus, such a computer can only execute when a task is requested, it must complete that task (and nothing else more), and stop (really, zero watts consumption) right after that.

But this is madness!

A particular task can take longer to execute, yes. It’ll be more difficult to execute simultaneous tasks, yes. You’ll spend more cycles per particular task than usual, yes! So, if you still thinking like Moore, than this is utter madness and you can stop reading right now.

Task Driven Computing

For those who are still with me, let me try to convince you. Around 80% of my smartphone’s battery is consumed by the screen. The rest is generally spend on background tasks (system daemons) and only about 5% on real tasks. So, if you could remove 95% of your system’s consumption, you could still take 20x more power consumption for your tasks and be even.

Note that I didn’t say “20x the time”, for that’s not necessarily true. The easiest way to run multiple tasks at the same time is to have multiple CPUs in a given system. Today that doesn’t scale too well because the operating systems have to control them all, and they all just keep running (even when idle) and wasting a huge amount of power for nothing.

But if your system is not designed to control anything, but to execute tasks, even though you’ll spend more time per task, you’ll have more CPUs working on tasks and less on background maintenance. Also, once the task is done, the CPU can literally shut down (I mean, zero watts) and wait for the next task. There is no idle cost, there is no operational code being run to multi-task or to protect memory or avoid race-conditions.

Problems

Of course, that’s not as easy as it sounds. Turning on and off CPUs is not that trivial, running tasks with no OS underneath (and expecting them to communicate) is not an easy task, and fitting multiple processors into a small chip is very expensive. But, as I said earlier, I’m not concerned with investors, market or money, I’m concerned with technology and it’s real purpose.

Also, the scaling is a real problem. Connection Machines were built and thrown away, clusters have peak performance way above their average performance levels, and multi-core systems are hard to work with. Part of that is real, the interconnection and communication parts, but the rest was artificially created by operating systems to solve new problems in an old way, just because it was cheaper, or quicker, or easier.

Back in the days…

I envy the time of the savants, when they had all the time and money in the world to solve the problems of nature. Today, the world is corrupted by money and even the most prominent minds in science are corrupt by it, trying to be the first to do such and such, protecting research from other peers just to claim a silly Nobel prize or to be world famous.

The laws of physics had led us into it, we live in the local minima of the least energetic configuration possible, and that’s here, now. To get our of any local minima we need a good kick, something that will take us out in a configuration of a more energetic configuration, but with enough luck, we’ll fall into another local minima that is less energetic than this one. Or, we we’re really the masters of the universe, maybe we can even live harmoniously in a place of local maxima, who knows!?