Open Source and Profit

I have written extensively about free, open source software as a way of life, and now reading back my own articles of the past 7 years, I realize that I was wrong on some of the ideas, or in the state of the open source culture within business and around companies.

I’ll make a bold statement to start, trying to get you interested in reading past the introduction, and I hope to give you enough arguments to prove I’m right. Feel free to disagree on the comments section.

The future of business and profit, in years to come, can only come if surrounded by free thoughts.

By free thoughts I mean free/open source software, open hardware, open standards, free knowledge (both free as in beer and as in speech), etc.

Past Ideas

I began my quest to understand the open source business model back in 2006, when I wrote that open source was not just software, but also speech. Having open source (free) software is not enough when the reasons why the software is free are not clear. The reason why this is so is that the synergy, that is greater than the sum of the individual parts, can only be achieved if people have the rights (and incentives) to reach out on every possible level, not just the source, or the hardware. I make that clear later on, in 2009, when I expose the problems of writing closed source software: there is no ecosystem in which to rely, so progress is limited and the end result is always less efficient, since the costs to make it as efficient are too great and would drive the prices of the software too high up to be profitable.

In 2008 I saw both sides of the story, pro and against Richard Stallman, on the views of the legitimacy of propriety control, being it via copyright licenses or proprietary software. I may have come a long way, but I was never against his idea of the perfect society, Richard Stallman’s utopia, or as some friends put it: The Star Trek Universe. The main difference between me and Stallman is that he believes we should fight to the last man to protect ourselves from the evil corporations towards software abuse, while I still believe that it’s impossible for them to sustain this empire for too long. His utopia will come, whether they like it or not.

Finally, in 2011 I wrote about how copying (and even stealing) is the only business model that makes sense (Microsoft, Apple, Oracle etc are all thieves, in that sense) and the number of patent disputes and copyright infringement should serve to prove me right. Last year I think I had finally hit the epiphany, when I discussed all these ideas with a friend and came to the conclusion that I don’t want to live in a world where it’s not possible to copy, share, derive or distribute freely. Without the freedom to share, our hands will be tied to defend against oppression, and it might just be a coincidence, but in the last decade we’ve seen the biggest growth of both disproportionate propriety protection and disproportional governmental oppression that the free world has ever seen.

Can it be different?

Stallman’s argument is that we should fiercely protect ourselves against oppression, and I agree, but after being around business and free software for nearly 20 years, I so far failed to see a business model in which starting everything from scratch, in a secret lab, and releasing the product ready for consumption makes any sense. My view is that society does partake in an evolutionary process that is ubiquitous and compulsory, in which it strives to reduce the cost of the whole process, towards stability (even if local), as much as any other biological, chemical or physical system we know.

So, to prove my argument that an open society is not just desirable, but the only final solution, all I need to do is to show that this is the least energy state of the social system. Open source software, open hardware and all systems where sharing is at the core should be, then, the least costly business models, so to force virtually all companies in the world to follow suit, and create the Stallman’s utopia as a result of the natural stability, not a forced state.

This is crucial, because every forced state is non-natural by definition, and every non-natural state has to be maintained by using resources that could be used otherwise, to enhance the quality of the lives of the individuals of the system (being them human or not, let’s not block our point of view this early). To achieve balance on a social system we have to let things go awry for a while, so that the arguments against such a state are perfectly clear to everyone involved, and there remains no argument that the current state is non-optimal. If there isn’t discomfort, there isn’t the need for change. Without death, there is no life.

Profit

Of all the bad ideas us humans had on how to build a social system, capitalism is probably one of the worst, but it’s also one of the most stable, and that’s because it’s the closest to the jungle rule, survival of the fittest and all that. Regulations and governments never came to actually protect the people, but as to protect capitalism from itself, and continue increasing the profit of the profitable. Socialism and anarchy rely too much on forced states, in which individuals have to be devoid of selfishness, a state that doesn’t exist on the current form of human beings. So, while they’re the product of amazing analysis of the social structure, they still need heavy genetic changes in the constituents of the system to work properly, on a stable, least-energy state.

Having less angry people on the streets is more profitable for the government (less costs with security, more international trust in the local currency, more investments, etc), so panis et circenses will always be more profitable than any real change. However, with more educated societies, result from the increase in profits of the middle class, more real changes will have to be made by governments, even if wrapped in complete populist crap. One step at a time, the population will get more educated, and you’ll end up with more substance and less wrapping.

So, in the end, it’s all about profit. If not using open source/hardware means things will cost more, the tendency will be to use it. And the more everyone uses it, the less valuable will be the products that are not using it, because the ecosystem in which applications and devices are immersed in, becomes the biggest selling point of any product. Would you buy a Blackberry Application, or an Android Application? Today, the answer is close to 80% on the latter, and that’s only because they don’t use the former at all.

It’s not just more expensive to build Blackberry applications, because the system is less open, the tools less advanced, but also the profit margins are smaller, and the return on investment will never justify. This is why Nokia died with their own App store, Symbian was not free, and there was a better, free and open ecosystem already in place. The battle had already been lost, even before it started.

But none of that was really due to moral standards, or Stallman’s bickering. It was only about profit. Microsoft dominated the desktop for a few years, long enough to make a stand and still be dominant after 15 years of irrelevance, but that was only because there was nothing better when they started, not by a long distance. However, when they tried to flood the server market, Linux was not only already relevant, but it was better, cheaper and freer. The LAMP stack was already good enough, and the ecosystem was so open, that it was impossible for anyone with a closed development cycle to even begin to compete on the same level.

Linux became so powerful that, when Apple re-defined the concept of smartphones with the iPhone (beating Nokia’s earlier attempts by light-years of quality), the Android system was created, evolved and dominated in less than a decade. The power to share made possible for Google, a non-device, non-mobile company, to completely outperform a hardware manufacturer in a matter of years. If Google had invented a new OS, not based on anything existent, or if they had closed the source, like Apple did with FreeBSD, they wouldn’t be able to compete, and Apple would still be dominant.

Do we need profit?

So, the question is: is this really necessary? Do we really depend on Google (specifically) to free us from the hands of tyrant companies? Not really. If it wasn’t Google, it’d be someone else. Apple, for a long time, was the odd guy in the room, and they have created an immense value for society: they gave us something to look for, they have educated the world on what we should strive for mobile devices. But once that’s done, the shareable ecosystem learns, evolves and dominate. That’s not because Google is less evil than Apple, but because Android is more profitable than iOS.

Profit here is not just the return on investment that you plan on having on a specific number of years, but adding to that, the potential that the evolving ecosystem will allow people to do when you’ve long lost the control over it. Shareable systems, including open hardware and software, allow people far down in the planing, manufacturing and distributing process to still have profit, regardless of what were your original intentions. One such case is Maddog’s Project Cauã.

By using inexpensive RaspberryPis, by fostering local development and production and by enabling the local community to use all that as a way of living, Maddog’s project is using the power of the open source initiative by completely unrelated people, to empower the people of a country that much needs empowering. That new class of people, from this and other projects, is what is educating the population of the world, and what is allowing the people to fight for their rights, and is the reason why so many civil uprisings are happening in Brazil, Turkey, Egypt.

Instability

All that creates instability, social unrest, whistle-blowing gone wrong (Assange, Snowden), and this is a good thing. We need more of it.

It’s only when people feel uncomfortable with how the governments treat them that they’ll get up their chairs and demand for a change. It’s only when people are educated that they realise that oppression is happening (since there is a force driving us away from the least-energy state, towards enriching the rich), and it’s only when these states are reached that real changes happen.

The more educated society is, the quicker people will rise to arms against oppression, and the closer we’ll be to Stallman’s utopia. So, whether governments and the billionaire minority likes or not, society will go towards stability, and that stability will migrate to local minima. People will rest, and oppression will grow in an oscillatory manner until unrest happens again, and will throw us into yet another minimum state.

Since we don’t want to stay in a local minima, we want to find the best solution not just a solution, having it close to perfect in the first attempt is not optimal, but whether we get it close in the first time or not, the oscillatory nature of social unrest will not change, and nature will always find a way to get us closer to the global minimum.

Conclusion

Is it possible to stay in this unstable state for too long? I don’t think so. But it’s not going to be a quick transition, nor is it going to be easy, nor we’ll get it on the first attempt.

But more importantly, reaching stability is not a matter of forcing us to move towards a better society, it’s a matter of how dynamic systems behave when there are clear energetic state functions. In physical and chemical systems, this is just energy, in biological systems this is the propagation ability, and in social systems, this is profit. As sad as it sounds…

Uno score keeper

With the spring not coming soon, we had to improvise during the Easter break and play Uno every night. It’s a lot of fun, but it can take quite a while to find a piece of clean paper and a pen that works around the house, so I wondered if there was an app for that. It turns out, there wasn’t!

There were several apps to keep card game scores, but every one was specific to the game, and they had ads, and wanted access to the Internet, so I decided it was worth it writing one myself. Plus, that would finally teach me to write Android apps, a thing I was delaying to get started for years.

The App

Adding new players
Card Game Scores

The app is not just a Uno score keeper, it’s actually pretty generic. You just keep adding points until someone passes the threshold, when the poor soul will be declared a winner or a loser, depending on how you set up the game. Since we’re playing every night, even the 30 seconds I spent re-writing our names was adding up, so I made it to save the last game in the Android tuple store, so you can retrieve it via the “Last Game” button.

It’s also surprisingly easy to use (I had no idea), but if you go back and forth inside the app, it cleans the game and start over a new one, with the same players, so you can go on as many rounds as you want. I might add a button to restart (or leave the app) when there’s a winner, though.

I’m also thinking about printing the names in order in the end (from victorious to loser), and some other small changes, but the way it is, is good enough to advertise and see what people think.

If you end up using, please let me know!

Download and Source Code

The app is open source (GPL), so rest assured it has no tricks or money involved. Feel free to download it from here, and get the source code at GitHub.

Distributed Compilation on a Pandaboard Cluster

This week I was experimenting with the distcc and Ninja on a Pandaboard cluster and it behaves exactly as I expected, which is a good thing, but it might not be what I was looking for, which is not. 😉

Long story short, our LLVM buildbots were running very slow, from 3 to 4.5 hours to compile and test LLVM. If you consider that at peak time (PST hours) there are up to 10 commits in a single hour, the buildbot will end up testing 20-odd patches at the same time. If it breaks in unexpected ways, of if there is more than one patch on a given area, it might be hard to spot the guilty.

We ended up just avoiding the make clean step, which put us around 15 minutes build+tests, with the odd chance of getting 1 or 2 hours tops, which is a great deal. But one of the alternatives I was investigating is to do a distributed build. More so because of the availability of cluster nodes with dozens of ARM cores inside, we could make use of such a cluster to speed up our native testing, even benchmarking on a distributed way. If we do it often enough, the sample might be big enough to account for the differences.

The cluster

So, I got three Pandaboards ES (dual Cortex-A9, 1GB RAM each) and put the stock Ubuntu 12.04 on them and installed the bare minimum (vim, build-essential, python-dev, etc), upgraded to the latest packages and they were all set. Then, I needed to find the right tools to get a distributed build going.

It took a bit of searching, but I ended up with the following tool-set:

  • distcc: The distributed build dispatcher, which knows about the other machines in the cluster and how to send them jobs and get the results back
  • CMake: A Makefile generator which LLVM can use, and it’s much better than autoconf, but can also generate Ninja files!
  • Ninja: The new intelligent builder which not only is faster to resolve dependencies, but also has a very easy way to change the rules to use distcc, and also has a magical new feature called pools, which allow me to scale job types independently (compilers, linkers, etc).

All three tools had to be compiled from source. Distcc’s binary distribution for ARM is too old, CMake’s version on that Ubuntu couldn’t generate Ninja files and Ninja doesn’t have binary distributions, full stop. However, it was very simple to get them interoperating nicely (follow the instructions).

You don’t have to use CMake, there are other tools that generate Ninja files, but since LLVM uses CMake, I didn’t have to do anything. What you don’t want is to generate the Ninja files yourself, it’s just not worth it. Different than Make, Ninja doesn’t try to search for patterns and possibilities (this is why it’s fast), so you have to be very specific on the Ninja file on what you want to accomplish. This is very easy for a program to do (like CMake), but very hard and error prone for a human (like me).

Distcc

To use distcc is simple:

  1. Replace the compiler command by distcc compiler on your Ninja rules;
  2. Set the environment variable DISTCC_HOSTS to the list of IPs that will be the slaves (including localhost);
  3. Start the distcc daemon on all slaves (not on the master): distccd --daemon --allow <MasterIP>;
  4. Run ninja with the number of CPUs of all machines + 1 for each machine. Ex: ninja -j6 for 2 Pandaboards.

A local build, on a single Pandaboard of just LLVM (no Clang, no check-all) takes about 63 minutes. With distcc and 2 Pandas it took 62 minutes!

That’s better, but not as much as one would hope for, and the reason is a bit obvious, but no less damaging: The Linker! It took 20 minutes to compile all of the code, and 40 minutes to link them into executable. That happened because while we had 3 compilation jobs on each machine, we had 6 linking jobs on a single Panda!

See, distcc can spread the compilation jobs as long as it copies the objects back to the master, but because a linker needs all objects in memory to do the linking, it can’t do that over the network. What distcc could do, with Ninja’s help, is to know which objects will be linked together, and keep copies of them on different machines, so that you can link on separate machines, but that is not a trivial task, and relies on an interoperation level between the tools that they’re not designed to accept.

Ninja Pools

And that’s where Ninja proved to be worth its name: Ninja pools! In Ninja, pools are named resources that bundle together with a specific level of scalability. You can say that compilers scale free, but linkers can’t run more than a handful. You simply need to create a pool called linker_pool (or anything you want), give it a depth of, say, 2, and annotate all linking jobs with that pool. See the manual for more details.

With the pools enabled, a distcc build on 2 Pandaboards took exactly 40 minutes. That’s 33% of gain with double the resources, not bad. But, how does that scale if we add more Pandas?

How does it scale?

To get a third point (and be able to apply a curve fit), I’ve added another Panda and ran again, with 9 jobs and linker pool at 2, and it finished in 30 minutes. That’s less than half the time with three times more resources. As expected, it’s flattening out, but how much more can we add to be profitable?

I don’t have an infinite number of Pandas (nor I want to spend all my time on it), so I just cheated and got a curve fitting program (xcrvfit, in case you’re wondering) and cooked up an exponential that was close enough to the points and use the software ability to do a best fit. It came out with 86.806*exp(-0.58505*x) + 14.229, which according to Lybniz, flattens out after 4 boards (about 20 minutes).

Pump Mode

Distcc has a special mode called pump mode, in which it pushes with the C file, all headers necessary to compile it solely on the node. Normally, distcc will pre-compile on the master node and send the pre-compiled result to the slaves, which convert to object code. According to the manual, this could improve the performance 10-fold! Well, my results were a little less impressive, actually, my 3-Panda cluster finished in just about 34 minutes, 4 minutes more than without push mode, which is puzzling.

I could clearly see that the files were being compiled in the slaves (distccmon-text would tell me that, while there was a lot of “preprocessing” jobs on the master before), but Ninja doesn’t print times on each output line for me to guess what could have slowed it down. I don’t think there was any effect on the linker process, which was still enabled in this mode.

Conclusion

Simply put, both distcc and Ninja pools have shown to be worthy tools. On slow hardware, such as the Pandas, distributed builds can be an option, as long as you have a good balance between compilation and linking. Ninja could be improved to help distcc to link on remote nodes as well, but that’s a wish I would not press on the team.

However, scaling only to 4 boards will reduce a lot of the value for me, since I was expecting to use 16/32 cores. The main problem is again the linker jobs working solely on the master node, and LLVM having lots and lots of libraries and binaries. Ninja’s pools can also work well when compiling LLVM+Clang on debug mode, since the objects are many times bigger, and even on above average machine you can start swapping or even freeze your machine if using other GUI programs (browsers, editors, etc).

In a nutshell, the technology is great and works as advertised, but with LLVM it might not be yet the thing. It’s still more profitable to get faster hardware, like the Chromebooks, that are 3x faster than the Pandas and cost only marginally more.

Would also be good to know why the pump mode has regressed in performance, but I have no more time to spend on this, so I leave as a exercise to the reader. 😉

LLVM Vectorizer

Now that I’m back working full-time with LLVM, it’s time to get some numbers about performance on ARM.

I’ve been digging the new LLVM loop vectorizer and I have to say, I’m impressed. The code is well structured, extensible and above all, sensible. There are lots of room for improvement, and the code is simple enough so you can do it without destroying the rest or having to re-design everything.

The main idea is that the loop vectorizer is a Loop Pass, which means that if you register this pass (automatically on -O3, or with -loop-vectorize option), the Pass Manager will run its runOnLoop(Loop*) function on every loop it finds.

The three main components are:

  1. The Loop Vectorization Legality: Basically identifies if it’s legal (not just possible) to vectorize. This includes checking if we’re dealing with an inner loop, and if it’s big enough to be worth, and making sure there aren’t any conditions that forbid vectorization, such as overlaps between reads and writes or instructions that don’t have a vector counter-part on a specific architecture. If nothing is found to be wrong, we proceed to the second phase:
  2. The Loop Vectorization Cost Model: This step will evaluate both versions of the code: scalar and vector. Since each architecture has its own vector model, it’s not possible to create a common model for all platforms, and in most cases, it’s the special behaviour that makes vectorization profitable (like 256-bits operations in AVX), so we need a bunch of cost model tables that we consult given an instruction and the types involved. Also, this model doesn’t know how the compiler will lower the scalar or vectorized instructions, so it’s mostly guess-work. If the vector cost (normalized to the vector size) is less than the scalar cost, we do:
  3. The Loop Vectorization: Which is the proper vectorization, ie. walking through the scalar basic blocks, changing the induction range and increment, creating the prologue and epilogue, promote all types to vector types and change all instructions to vector instructions, taking care to leave the interaction with the scalar registers intact. This last part is a dangerous one, since we can end up creating a lot of copies from scalar to vector registers, which is quite expensive and was not accounted for in the cost model (remember, the cost model is guess-work based).

All that happens on a new loop place-holder, and if all is well at the end, we replace the original basic blocks by the new vectorized ones.

So, the question is, how good is this? Well, depending on the problems we’re dealing with, vectorizers can considerably speed up execution. Especially iterative algorithms, with lots of loops, like matrix manipulation, linear algebra, cryptography, compression, etc. In more practical terms, anything to do with encoding and decoding media (watching or recording videos, pictures, audio), Internet telephones (compression and encryption of audio and video), and all kinds of scientific computing.

One important benchmark for that kind of workload is Linpack. Not only Linpack has many examples of loops waiting to be vectorized, but it’s also the benchmark that defines the Top500 list, which classifies the fastest computers in the world.

Benchmarks

So, both GCC and Clang now have the vectorizers turned on by default with -O3, so comparing them is as simple as compiling the programs and see them fly. But, since I’m also interested in seeing what is the performance gain with just the LLVM vectorizer, I also disabled it and ran a clang with only  -O3, no vectorizer.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler    Opt    MFLOPS  Diff
   Clang    -O3      2413  0.0%
     GCC  -O3 vec    2421  0.3%
   Clang  -O3 vec    3346 38.6%

This is some statement! The GCC vectorizer exists for a lot longer than LLVM’s and has been developed by many vectorization gurus and LLVM seems to easily beat GCC in that field. But, a word of warning, Linpack is by no means representative of all use cases and user visible behaviour, and it’s very likely that GCC will beat LLVM on most other cases. Still, a reason to celebrate, I think.

This boost mean that, for many cases, not only the legality if the transformations are legal and correct (or Linpack would have gotten wrong results), but they also manage to generate faster code at no discernible cost. Of course, the theoretical limit is around 4x boost (if you manage to duplicate every single scalar instruction by a vector one and the CPU has the same behaviour about branch prediction and cache, etc), so one could expect a slightly higher number, something on the order of 2x better.

It depends on the computation density we’re talking about. Linpack tests specifically the inner loops of matrix manipulation, so I’d expect a much higher ratio of improvement, something around 3x or even closer to 4x. VoIP calls, watching films and listening to MP3 are also good examples of densely packet computation, but since we’re usually running those application on a multi-task operating system, you’ll rarely see improvements higher than 2x. But general applications rarely spend that much time on inner loops (mostly waiting for user input and then doing a bunch of unrelated operations, hardly vectorizeable).

Another important aspect of vectorization is that it saves a lot of battery juice. MP3 decoding doesn’t really matter if you finish in 10 or 5 seconds, as long as the music doesn’t stop to buffer. But taking 5 seconds instead of 10 means that on the other 5 seconds the CPU can reduce its voltage and save battery. This is especially important in mobile devices.

What about ARM code?

Now that we know the vectorizer works well, and the cost model is reasonably accurate, how does it compare on ARM CPUs?

It seems that the grass is not so green on this side, at least not at the moment. I have reports that on ARM it also reached the 40% boost similar to Intel, but what I saw was a different picture altogether.

On a Samsung Chromebook (Cortex-A15) I got:

Compiler   Opt    MFLOPS   Diff
  Clang    -O3       796   0.0%
    GCC  -O3 vec     736  -8.5%
  Clang  -O3 vec     773  -2.9%

The performance regression can be explained by the amount of scalar code intermixed with vector code inside the inner loops as a result of shuffles (movement of data within the vector registers and between scalar and vector registers) not being lowered correctly. This most likely happens because the LLVM back-end relies a lot on pattern-matching for instruction selection (a good thing), but the vectorizers might not be producing the shuffles in the right pattern, as expected by each back-end.

This can be fixed by tweaking the cost model to penalize shuffles, but it’d be good to see if those shuffles aren’t just mismatched against the patterns that the back-end is expecting. We will investigate and report back.

Update

Got results for single precision floating point, which show a greater improvement on both Intel and ARM.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler   Opt   MFLOPS  Diff
   Clang   -O3     2530   0.0%
     GCC -O3 vec   3484  37.7%
   Clang -O3 vec   3996  57.9%

On a Samsung Chromebook (Cortex-A15) I got:

Compiler   Opt   MFLOPS  Diff
   Clang   -O3      867   0.0%
     GCC -O3 vec    788,  9.1%
   Clang -O3 vec   1324  52.7%

Which goes on to show that the vectorizer is, indeed, working well for ARM, but the costs of using the VFP/NEON pipeline outweigh the benefits. Remember than NEON vectors are only 128-bits wide and VFP only 64-bit wide, and NEON has no double precision floating point operations, so they’ll only do one double precision floating point operations per cycle, so the theoretical maximum depends on the speed of the soft-fp libraries.

So, in the future, what we need to be working is the cost model, to make sure we don’t regress in performance, and try to get better algorithms when lowering vector code (both by making sure we match the patterns that the back-end is expecting, and by just finding better ways of vectorizing the same loops).

Conclusion

Without further benchmarks it’s hard to come to a final conclusion, but it’s looking good, that’s for sure. Since Linpack is part of the standard LLVM test-suite benchmarks, fixing this and running it regularly on ARM will at least avoid any further regressions… Now it’s time to get our hands dirty!

 

K-means clustering

Clustering algorithms can be used with many types of data, as long as you have means to distribute them in a space, where there is the concept of distance. Vectors are obvious choices, but not everything can be represented into N-dimensional points. Another way to plot data, that is much closer to real data, is to allow for a large number of binary axis, like tags. So, you can cluster by the amount of tags the entries share, with the distance being (only relative to others) the proportion of these against the non-sharing tags.

An example of tag clustering can be viewed on Google News, an example of clustering on Euclidean spaces can be viewed on the image above (full code here). The clustering code is very small, and the result is very impressive for such a simple code. But the devil is in the details…

Each red dots group is generated randomly from a given central point (draws N randomly distributed points inside a circle or radius R centred at C). Each centre is randomly placed, and sometimes their groups collide (as you can see on the image), but that’s part of the challenge. To find the groups, and their centres, I throw random points (with no knowledge of the groups’ centres) and iterate until I find all groups.

The iteration is very simple, and consists of two steps:

  1. Assignment Step: For each point, assign it to the nearest mean. This is why you need the concept of distance, and that’s a tricky part. With Cartesian coordinates, it’s simple.
  2. Update Step: Calculate the real mean of all points belonging to each mean point, and update the point to be at it. This is basically moving the supposed (randomly guessed) mean to it’s rightful place.

On the second iteration, the means, that were randomly selected at first, are now closer to a set of points. Not necessarily points in the same cluster, but the cluster that has more points assigned to any given mean will slowly steal it from the others, since it’ll have more weight when updating it on step 2.

If all goes right, the means will slowly move towards the centre of each group and you can stop when the means don’t move too much after the update step.

Many problems will arise in this simplified version, for sure. For instance, if the mean is exactly in between two groups, and both pull it to their centres with equally strong forces, thus never moving the mean, thus the algorithm thinks it has already found its group, when in fact, it found two. Or if the group is so large that it ends up with two or more means which it belongs, splitting it into many groups.

To overcome these deficiencies, some advanced forms of K-means take into account the shape of the group during the update step, sometimes called soft k-means. Other heuristics can be added as further steps to make sure there aren’t two means too close to each other (relative to their groups’ sizes), or if there are big gaps between points of the same group, but that kind of heuristics tend to be exponential in execution, since they examine every point of a group in relation to other points of the same group.

All in all, still an impressive performance for such a simple algorithm. Next in line, I’ll try clustering data distributed among many binary axis and see how k-means behave.

FreeCell puzzles solver API

This is a little pet project I did a while ago. It’s a FreeCell puzzle‘s solver API.

The idea is to provide a basic validation engine and board management (pretty much like my old chess validation), so people can write FreeCell solvers on top of it. It has basic board setup (of multiple sizes), movement legalisation, and a basic Solver class, which you must derive to create your own solvers.

There’s even a BruteFroceSolver that can solve a few small boards, and that gives you an idea on how to create your own solvers. However, the API is not clear enough yet that children could start playing with it, and that’s the real goal of this project: to get kids interested in solving complex optimisation problems in an easy way.

Freecell is a perfect game for it. Most boards can be solved (only a handful of them were proven – by exhaustion – not solvable), some movements can be rolled back and you can freely re-use cards that have already been placed into the foundations (their final destination) back in the game again.

It’s out of the scope of this project to produce a full-featured graphic interface for kids, but making the API easy enough so they understand the concepts without dragging themselves into idiosyncrasies of C++ is important.

Compiler optimisations

The reason why I did this was to make some of the optimisations compiler engineers have to do more appealing to non-compiler engineers or children with a taste for complex problems. But what does this have to do with compilers? The analogy is a bit far-fetching and somewhat reverse, but it’s interesting nevertheless and it was worth the shot.

Programming languages are transformed into graphs inside the compiler, which should represent the intentions of the original programmer. This graphs are often optimised multiple times until you end up with a stream of instructions representing the source code in the machine language.

Ignore for now the optimisations on those graphs, and focus on the final part: selecting machine instructions that can represent that final graph in the machine language (assembly). This selection can pick any assembly instruction at will, but it has to put them into a very specific order to represent the same semantics (not just syntactic) of the original program. Since many instructions have side effects, pipeline interactions, special flags set or cleaned up, it’s not trivial to produce correct code if you don’t check and re-check all conditions every time. This is a known complex optimisation problem and can be responsible for changes in speed or code size in orders of magnitude.

What does it have to do with the Freecell puzzle? Well, in Freecell, you have a number of cards and you have to put them in a specific order, just like assembly instructions. But in this case, the analogy is reverse: the “sequence” is trivial, but the “instructions” are hard to get.

There are other similarities. For example, you have four free cells, and they can only hold one value at a time. They are similar to registers, and manipulating them, gives you a good taste of how hard it is to do register scheduling when building the assembly result. But in this case, it’s much harder to spill (move the data back to memory, or in this case, card back to cascades), since there are strict rules on how to move cards around.

Reusing cards from the foundations is similar to expanding single instructions into a sequence of them in order to circumvent pipeline stalls. In real compilers you could expand a multiply+add (very useful for digital signal processing) into two instructions: multiply and add, if that gives you some advantage on special cases on special chips. In Freecell, you can use a 9 on top of a 10, to move an 8 from another cascade and free up a card that you need to clean up your freecells (registers).

I’m sure you can find many more similarities, even if you have to skew the rules a bit (or completely reverse them), but that’s not the point. The point is to interest people into complex optimisation techniques without the hassle of learning a whole new section of computer science, especially if that section puts fear in most people in the first place.

2011 European User Group Meeting

For all LLVM users in Europe that didn’t catch the announcement on the main list, be warned that we’re organising the European LLVM User Group Meeting in September, 16th, in London. All information necessary to register and submit your work is on the link.

It’s free of charge and we’ll also provide a happy-hour after the meeting for an extended networking. If you’re just curious about LLVM or work with it day to day or is thinking of starting your PhD with LLVM, please register and show up. You won’t regret. 😉

More information will be available on the link above and through the main mailing list. Stay tuned!

C++ class sizes #2

In my previous post about class sizes, we visited the empty classes and how their derived classes change in size. Now, we’re going to look what the ABI have to say about tail padding in inherited classes.

Again, for normal (POD) structures, everything is as it looks like, same as in C. But for more complex (non-POD) classes, things change a bit. The Itanium C++ ABI has a rather complex rule for defining the offset and size of class members, but each target-specific ABI (like ARM’s) can have additional rules that make it very complex to define the size of a class member.

Plain-Old-Data

So, first, a brief explanation of POD. POD is a concept in C++ that states, for the purpose of layout, a C++ structure (that is more than simple C structures) are just like plain old data. This normally means that the structure behaves like data, ie. it can be created on zero-initialized memory, copied with memcpy without any side effect, etc.

But not being a POD doesn’t always means you have a complex virtual class with multiple inheritance. Sometimes a simple change can transform it into a non-POD, like adding a simple empty constructor to the base class.

For example, in both C and C++, the structure below has 8 bytes:

struct Foo {
  int a;
  char b;
};

because the structure alignment is the same as the most aligned member, which is int. Assuming int is 32 bits, the second member has size one but aligns to four, yielding a size 8 for the final structure.

In C++, you can derive from that structure, and the derived class is normally implemented by having the base class as sort of a member inside it.

So, the C++ struct Bar below:

struct Foo {
	int a;
	char b;
};

struct Bar : Foo {
	char c;
};

is similar (for layout purposes) to a C construct:

struct Bar {
	struct {
		int a;
		char b;
	};
	char c;
};

So, both in C and in C++, the size of Bar is 12 bytes. Now, doing a slight change in Foo, we’re going to break that:

struct Foo {
	int a;
	char b;
	Foo() {}
};

struct Bar : Foo {
	char c;
};

You may be surprised to know that the size of Bar is still 8 bytes.

non-POD structures

Adding the constructor is one way of killing the POD-ness of a structure, but anything that adds some C++ flavour to it is enough. For instance, Bar itself is a non-POD, because it was derived from another structure. Adding simple methods (non constructors) is fine, but adding a virtual method makes it a non-POD (as you have to add the virtual table). Anything that leaves some room for implementation detail will be considered a violation of the POD contract and will transform your structure to a non-POD.

Ok, so what does that mean? If it just means your structures will be smaller, well, I’ll add a constructor to all my structures! But, no, the only change in in the tail padding. If you want to compress your structures, use packed structures, the tail padding rules are too complex to deal with them and still maintain your code clean.

So, the original Foo structure had 8 bytes. Four for the first int, Four for the char. But the char only used one of those four, so you have a 3-byte tail. When inheriting from a non-POD structure, the C++ ABI allows you to use the tail padding of the base class, providing your new type fits in there. In our example, the new type was char that fits in the 3-byte tail of the base class, so the final layout uses two of the four bytes and still have a 2-byte tail.

Further inheritance of the same type (adding a single char member) will still pack at the end of the base structure.

Bit-fields

Every time you think you have understood the rules of C++ layout, go back to bit-fields and you’ll always be surprised.

Lets have a look at this example for a moment:

struct Foo {
	char a;
	int b : 8;
	Foo(){}
};

struct Bar : Foo {
	int c : 8;
};

The first member is a char, so one byte. The second is an int, so 4 bytes and the final layout is 5 bytes, aligned to int goes to 8 bytes, right? Well, no. Even without the constructor, the size of that structure is 4 bytes.

Compilers are smart enough (and the ABI allows them to be) to figure that the member b will have only 8 bits, so there is no point in wasting 6 more bytes just to follow the declared type.

You may not be surprised, now, to know that the size of Bar is still 4 bytes, since the member c in Bar gets tail-padded into the base Foo.

But what about this:

struct Bar  {
	struct {
		char a;
		char a1;
		char a2;
		int b : 4;
	};
	int c : 4;
};

The three chars push the final member to the last byte in the block, of which only 4 bits are really used, so you have 4 bits of tail-padding. The second member c of Bar gets tail-padded into those four bits, since they fit, and the final size of that structure is 4 bytes.

I hope to have convinced you to never try to be smart and convert your structures and classes to arrays or pointers. Even disregarding virtual inheritance (which is probably coming in another post), the C++ ABI for structure layout is too complex to rely on and use it in your own code. Worse, because this is not in the standard, concrete implementations are free to change the details and sometimes, still following the generic ABI. You’ll find a few cases in the ARM ABI where they differ slightly from the ABI without actually being incompatible, sometimes not that lucky.

The final lesson is: be smart, use the language features and rely on the compiler to optimise things for you. If you’re not happy with the optimization, fill a bug rather than hack your code.

C++ class sizes

In a recent struggle to define the class size in C++, I thought would be good to share some of my findings, since there isn’t much about it on the net. The empty class size is all over the place, neatly put in Stroustroup’s FAQ, but the other weird cases were less common to find.

For obvious reasons, the C++ standard is pretty vague about class sizes and memory layout. This is more of an ABI issue than a language standard, but the problem is that even the Itanium C++ ABI is a bit vague on that point.

The C++ standard allows the memory layout to be defined by the ABI and position the class members on where it fits more naturally on the target, but there is one specific clause: an object cannot have zero size. While most layout decisions are target specific, this is strictly a language decision. If objects were allowed to have zero bytes in size, an array of 1000 zero-sized objects would occupy no space at all and it would be impossible to choose one from the other.

Empty classes

First, a background on empty classes. An empty class is somewhat useless in most cases, but there is one case in which it plays an important role: type safety.

If you want to group objects that have nothing in common (to use in an algorithm), or to force a specific class of template parameters, or even differentiate between exceptions, even if they don’t have any data on it (just to catch them differently), you can use an empty class and it does the job pretty well. Thus, using and deriving from empty classes is somewhat a common enterprise in C++.

Empty class size

So, what happens when you declare an empty type?

class Empty {
};

What is the size that objects of this type will have? It cannot be zero, as the standard forbids, so it has to be something. The best choice is one byte, since no architecture will benefit from having less (please, don’t think of bit-fields!) and if alignment is a problem, most (all?) compilers will adjust the byte to a word aligned block in memory.

Obviously, an empty class that derives from another empty class also has to follow the same rule:

class Empty {
};

class NowWhat : public Empty {
};

In this case, both empty and NowWhat have size 1. But if NowWhat has data on it, will the extra byte still be there? On both standard and ABI, there is nothing saying that it shouldn’t, but also nothing saying it should. What’s important here is the reason why you need the extra byte: the address of different objects must be different.

When the derived class already has data, its objects will already be at different locations when laid out in memory, so there is no necessity of the extra byte.

class Empty {
};

class NowWhat : public Empty {
  int a;
};

Now, Empty still has 1 byte, but NowWhat has 4, not 5. Some people consider this an optimization, I just think it’s following the rules by what the standard requires.

The reverse case is much simpler. If the derived class is the empty one (for whatever reason), the size is already non-zero, so the classes below have both the same sizes (4):

class Full {
	int a;
};

class Empty : public Full {
};

Multiple Inheritance

Multiple inheritance brings a bit of colour to this problem. The C++ standard requires that two objects always occupy different memory locations (so they compare differently by their pointers and don’t allow non-empty zero-sized arrays). So what is the size of the Derived class below?

class Empty {
};

class M1 : public Empty {
};

class M2 : public Empty {
};

class Derived : public M1, M2 {
};

Before, we could make the NowWhat class with only one byte because that would be sufficient to discriminate it in an array, but what about comparison. If you cast a Derived object to Empty via M1 you get a different (logical) object than if you do it via M2, so they must compare differently. For that, you need to add another byte, and return the first address when via one class and the second via the other.

Empty members

With single and multiple inheritance nailed down, let’s think of a more complicated case.

class Empty {
};

class Another : public Empty {
};

class NotSoMuch : public Empty {
	Another a;
};

What happening here is that the first item of the derived class is, in fact, of an empty class type, that also derives from the same class as the type.

See, this last bit is important, because if you replace Another a; by another empty class type (say Empty2), the size of the derived class will still be 1. But in this case above, this is not true. The size of NotSoMuch is actually 2.

The field Another a; has size one, as any empty class objects, so what’s the second byte for? The second byte is a padding, at the beginning of the class, to avoid type conflicts.

Type conflicts

When deriving from empty classes, the ABI (2.4-II-3) states that you should always try to put the derived members at zero offset and, in case of type conflict, move down (alignment-wise) until you don’t have any more type conflicts. The question is, what is a type conflict in this case? In the multiple inheritance, the type conflict was clear (the diamond inheritance), but here, not so much.

Since both the class and its first member can be converted to Empty types (as pointers, for instance), you can have two Empty objects that, if compared must return different addresses. In the code below, n is (a pointer to) the derived object. Itself is converted to an Empty pointer, stored in en.

	NotSoMuch *n = new NotSoMuch();
	Empty *en = (Empty*)n;
	Empty *ea = (Empty*)&n->a;

If the member Another a; was in the offset zero of the class, the pointer ea would be in the same address as en, and on a comparison for equality, it’d return true.

The type conflict in this case is that, while both en and ea are pointers to an Empty type, the former gets there via NotSoMuch and the latter via Another. According to the C++ standard, they must be different, thus, return different addresses.

Again, if the empty class member is not the first element, none of that happens. The example below has size 4 (instead of 5), because the two similar (but different) Empty pointers would now be separated by 4 bytes, thus not violating the standard.

class Empty {
};

class Another : public Empty {
};

class NotSoMuch : public Empty {
	int i;
	Another a;
};

Templates

Template code is also susceptible to this problem, of course, and the iostream library is full of such examples. The problem is not so much for the abuse of empty classes (that doesn’t happen that much in STL), it’s just that non-virtual classes that only have member functions and no data fields are considered empty classes. And since templates are a good replacement for most virtual inheritance (and its inherent run-time complexity and slowness), basic libraries have to make extensive use of templates to avoid slowing down every single C++ program.

It’s true that the iostream library is particularly heavy and cumbersome, but it could be much worse.

The additional problem with templates (for the compiler, at least), is figuring out if there is a type conflict. While templates are already expanded by the time the basic C++ compiler kicks in, it’s complicated to carry the information on the source code about inheritance graph to every single template variation and specialization. Not to mention the new standard, where C++ is getting a variadic templates mechanism

I hope this is one more reason for you to avoid C-style pointer casts (especially void*) and only use C++ static_cast and dynamic_cast. Even reinterpret_cast is dangerous, but at least you can grep them on your source files and identify them one by one.

Zero-cost exception handling in C++

C++ exception handling is a controversial subject. The intricate concepts are difficult to grasp, even to seasoned programmers but, in complexity, nothing compares to its implementation. I can’t remember any real-world (commercial or otherwise) product that makes heavy use of exception handling. Even STL itself only throws a couple of them and only upon terminal failures.

Java exceptions, on the other hand, are much more common. Containers throw them at will (ex. OutOfBounds) and creating your own exceptions is very easy, developers are encouraged to use them. But still, some C++ projects’ coding standards hint (like LLVM’s) or specifically disallow (like Google’s) the use of exceptions. The Android NDK, for example, doesn’t even have (nor plans to) support exceptions.

Some of the cons commonly listed are the encouragement of bad style (throwing at will, like in Java), compilation problems with old code and all the gotchas that exception handling can bring to your code, making it a nightmare to understand the real flow of your program and to figure out the critical DON’Ts of exception handling (like throwing from inside a destructor). For a deeper discussion on that subject, read Sutter’s book, for the implementation details of exception handling from a compiler point of view, keep reading.

Three distant worlds

Exception handling is not just a language construction, like if, it’s an intricate interaction between three distant worlds. All the complications that exception handling brings to us have to be implemented by a three-layer infrastructure: the user code, the run-time library and the compiler. The communication between them has to be absolutely perfect, one bit different and the control flow goes south, you start executing random code and blow up your program for good.

Not only it’s three very different places, but the code is written by three different kinds of people, at different times. The only formal document that binds them together is the ABI (ex. the Itanium C++ ABI), which is, to say the least, vague. Luckily, the guys who write the run-time libraries are close enough to the guys writing the compiler (and hopefully, but not necessarily, the same writing the ABI).

In the end, if the compiler implementation is correct, the user only has to care about the rules imposed by the C++ standard. But that’s not too comforting, the C++ standard is complex and frequently driven by previous implementations, rather than a fix and clear rule of thumb. Most users don’t precisely understand why you can’t throw from inside a destructor or why throwing a newly created object is calling terminate on their code, they just avoid it at all costs.

Dirty and expensive

The easy exception handling implementation is called SetJump/LongJump. It literally sets jumps between your code and exception handling code (created by the compiler). If an exception was thrown, the flow jumps directly to the exception handling code. This is very easy, but extremely expensive. Usually, if you follow good practices, you’d expect that your program would spend most of the time doing real computation and only occasionally handling exceptions. But if for every action you create a bunch of exception handling data, you’re wasting a lot of resources to check for something that might not even happen during the whole run of your program.

When an exception happens, you need to go back down the call graph and inspect if any of those functions (including the one you just thrown) is catching that exception. This is called unwinding the stack, and in a SetJump/LongJump exception handling this is done by inspecting if any LongJump was called by a series of SetJump checks. The problem is, that those checks are done even if an exception was not thrown at all.

Zero-cost

So, how is it possible to only execute exception handling code when an exception really happen? Simple, let the compiler/library do it for you. 😉

The whole idea of a zero-cost exception handling is to create a set of functions (like __cxa_throw, __cxa_begin_catch) that will deal with the exception handling flow by sending it to a different path, the EH path. Under normal situations, your code flow will be identical with or without exceptions. The flow could never reach the EH blocks and it’s only because the compiler knows those blocks are in the EH path that they don’t get discarded as unreachable code, since there is no explicit flow between user code and EH code.

Only the run-time library knows how to get there, and that’s helped by a table, normally in a read-only data section, specific to exception handling (ex. the .eh_frame ELF section), created by the compiler when reading the user’s exception flow. This is the core interaction between user, compiler and library codes.

When an exception is thrown via __cxa_allocate_exception + __cxa_throw, the first function allocates space and prepare the object to be thrown and the second starts reading the exception handling table. This table contains a list of exceptions that the particular piece of code is catching (often sorted by address, so it’s easy to do a binary search) and the code that will catch the exception.

The code that catches exceptions is called the personality routine. The personality routine library call enables you to use multiple zero-cost EH mechanisms in the same program. Normally, the table entry contains a pointer to the routine and some data it’ll use while unwinding, but some ABIs (like ARM’s EHABI) can add specially crafted tables for their purposes (like having the unwind code as 16-bits instructions embedded in the data section).

All in all, the general flow is the same: the personality routine will find the EH block (also called landing pads, unreachable from normal code) and jump there, where the catching happens. That code will get the exception allocated before the throw and make it available to the user code (inside the catch block) to do what the user requested. If there is no entry in the EH table that catches that exception, the stack will be unwound again and the library will read the caller’s EH table, until it reaches a table that does catch it or terminate if none does.

Clean-ups

So far so good, but what happens with the local variables and the temporaries (stack-based data)? If they have non-trivial destructors, they must be called, which could chain a lot of cleaning up to do. Clean-up code is part in the unreachable blocks (when you’re cleaning up EH stuff) and part in normal code, for automatic destruction at the end of a lexical block. Although they do similar things, their purpose is completely different.

Normal flow clean-up blocks are always called and always clean up all objects. But EH clean-up code has to clean up only the objects that have been allocated so far, in the current lexical block or in any parent until the root block of the function. If you can imagine deeply nested blocks, creating objects at will, with a throw inside in the increment part of a for loop, where the object beeing incremented has a non-trivial destructor, you got the picture.

What happens if you throw an exceptions when other was being caught? Here is where the complexity met reality and the boundaries were settled. If you are unwinding one exception, and have a flow being dealt with by the run-time library, and another exception is thrown, you can’t distinguish which exception you are handling. Since there is space for one exception been unwounded at a time, whenever a second exception is thrown, you have to terminate.

Of course, you could argue to create multiple lanes for exception handling, where in every exception handling is handled separately, so you know that one exception was caused when throwing another. That may be easy to implement in Java (or other interpreted/bytecode languages), because the VM/interpreter has total control of the language, but as stated earlier, exception handling in C++ is just a contract between three parties: compiler, run-time library and the user, and often these three parties are completely different groups in completely different space-time realities.

Conclusion

This is just a very brief, shallow demonstration of how complex exception handling can be. Many other pitfalls appear, especially when dealing with specific user code running on a specific hardware platform with a specific binary configuration for the object file. It’s impossible to test every single combination (including room temperature, or annual rainfall forecast) to make it an implementation final and weird bugs creep up every once in a while.

Enhancing the ABI to allow that is possible, but the complexities of doing so, together with the very scarce use of exception handling in C++, makes it a very unlikely candidate to be changed in the near future. It’d have to be done in so many levels (C++ standard, ABIs, compilers, run-time libraries) that it’s best left alone.