Going ARM

Last week I attended the Arm Research Summit, and on Tuesday there was a panel session on using Arm hardware for performance, an effort known as GoingArm. You can watch the whole panel on youtube, from 2:12 onward (~1h30). All videos are available on the website.

It was interesting to hear the opinions of people in the front lines or ARM HPC, as much as the types of questions that came, which made me believe even more strongly that now is the time to re-think about general computing in general. For a long time we have been coasting on the fact that cranking up the power and squeezing down the components were making the lives of programmers easier. Scaling up to multiple cores was probably the only big change most people had to deal with, and even so, not many people can actually take real advantages of it.

Continue reading “Going ARM”

OpenHPC on AArch64

My current effort is to drive ARM hardware into the data centre, through HPC deployments, for the exascale challenge. Current Top500 clusters start from 0.5 petaflops, but the top 10 are between 10 and 90 petaflops, using as much as close to 20MW of combined power. Scaling up 10 ~ 100 times in performance towards exascale would require a whole power plant (coal or nuclear) for each cluster. So, getting to ExaFLOPS involves not only getting server-grade ARM chips into a rack mount, but baking the whole ecosystem (PCB, PCIe, DRAM, firmware, drivers, kernel, compilers, libraries, simulations, deep learning, big data) that can run at least at 10x better performance than existing clusters at, hopefully, a fraction of the power budget.

The first step then, is to find a system that can glue most of those things together, so we can focus on real global solutions, not individual pieces together. It also means we need to fix all problems in the right places, rather than deferring that to external projects because it’s not our core business. At Linaro we tend to look at computing problems from an holistic point of view, so if things don’t work the way they should, we make it our job to go and fix the problem where it’s most meaningful, finding the appropriate upstream project and submitting pull requests there, then back-porting solutions internally and gluing them together into a proper solution.

And that’s why we came to OpenHPC, a project that aims to facilitate HPC cluster deployment. Essentially, it’s a package repository that glues functional groups using meta-packages, in additional to recipes and documentation on how to setup a cluster, and a lively community that deploys OpenHPC clusters across different architectures, operating systems and provisioning styles.

The recent version of OpenHPC, 1.3.2, has just been released with some additions proposed and tested by Linaro, such as Plasma, Scotch and SLEPc as well as LLVM with Clang and Flang. But while that works well on x86_64 clusters for now, and they have passed all tests on AArch64, installing a new cluster on AArch64 with automatic provisioning still needs some modification. That’s why it’s still in Tech Preview.

Warewulf

For those who don’t know, warewulf is a cluster provisioning service. As such, it is installed in a master node, which will then keep a database of all the compute nodes, resource files, operating system images and everything that is necessary to get the compute nodes up and running. While you can install the nodes by hand, then install a dispatcher (like Slurm) on each node, warewulf makes that process simple: it creates an image of the installation of a node, produces an EFI image for PXE boot and let the nodes discover themselves as they come up alive.

The OpenHPC documentation explain step by step how you can do this (by installing OpenHPC’s own meta-packages and running a few configuration tasks), and if all goes well, every compute node that boots in PXE mode will soon encounter the DHCP server, then the TFTP and will get its disk-less live installation painlessly. That is, of course, if the PXE process work.

While running this on a number of different ARM clusters, I realised that I was getting an error:

ERROR:  Could not locate Warewulf's internal pxelinux.0! Things might be broken!

While that doesn’t sound good at all, I learnt that people in the ARM community knew about that all along and were using a Grub hack (to create a simple Grub EFI script to jump start the TFTP download and installation). It’s good that it works in some way, but it’s the kind of thing that should just work, or people won’t even try anything further. Turns out the PXELinux folks haven’t bothered much with AArch64 (examples here and here), so what to do?

Synergy

One of the great strengths of Linaro is that there are a bunch of maintainers of the core open source projects most people use, so I was bound to find someone that knew what to do, or at least who to ask. As it turns out, two EFI gurus (Leif and Ard) work on my extended team, and we begun discussing alternatives when Eric (Arm Research, also OpenHPC collaborator) tipped us that warewulf would be completely replacing PXELinux for iPXE, in view of his own efforts in getting that working.

After a few contributions and teething issues, I managed to get iPXE booting on ARM clusters as smoothly as I would have done for x86_64.

Tech Preview

Even though it works perfectly well, OpenHPC 1.3.2 was release without it.

That’s because it uses an older snapshot of warewulf, while I was using the bleeding edge development branch, which was necessary due to all the fixes we had to do while making it work on ARM.

This has prompted me to replicate their validation build strategy so that I could replace OpenHPC’s own versions of warewulf in-situ, ignoring dependencies, which is not a very user friendly process. And while I have validated OpenHPC (ran the test-suite, all green) with the development branch of warewulf (commit 74ad08d), it is not a documented or even recommended process.

So, despite working better than the official 1.3.2 packages, we’re still in Tech Preview state, and will be until we can pack the development branch of warewulf past that commit into OpenHPC. Given that it has been reported to work on both AArch64 and x86_64, I’m hoping it will be ready for 1.3.3 or 1.4, whatever comes next.

Trashing Chromebooks

At Linaro, we do lots of toolchain tests: GCC, LLVM, binutils, libraries and so on. Normally, you’d find a fast machine where you could build toolchains and run all the tests, integrated with some dispatch mechanism (like Jenkins). Normally, you’d have a vast choice of hardware to chose from, for each different form-factor (workstation, server, rack mount) and you’d pick the fastest CPUs and a fast SSD disk with space enough for the huge temporary files that toolchain testing produces.

tcwg-rack

The only problem is, there aren’t any ARM rack-servers or workstations. In the ARM world, you either have many cheap development boards, or one very expensive (100x more) professional development board. Servers, workstations and desktops are still non-existent. Some have tried (Calxeda, for ex.) but they have failed. Others are trying with ARMv8 (the new 32/64-bit architecture), but all of them are under heavy development, so not even Alpha quality.

Meanwhile, we need to test the toolchain, and we have been doing it for years, so waiting for a stable ARM server was not an option and still isn’t. A year ago I took the task of finding the most stable development board that is fast enough for toolchain testing and fill a rack with it. Easier said than done.

Choices

Amongst the choices I had, Panda, Beagle, Arndale and Odroid boards were the obvious candidates. After initial testing, it was clear that Beagles, with only 500MB or RAM, were not able to compile anything natively without some major refactoring of the build systems involved. So, while they’re fine for running remote tests (SSH execution), they have very little use for anything else related to toolchain testing.

panda

Pandas, on the other hand, have 1GB or RAM and can compile any toolchain product, but the timing is a bit on the wrong side. Taking 5+ hours to compile a full LLVM+Clang build, a full bootstrap with testing would take a whole day. For background testing on the architecture, it’s fine, but for regression tracking and investigative work, they’re useless.

With the Arndales, we haven’t had such luck. They’re either unstable or deprecated months after release, which makes it really hard to acquire them in any meaningful volumes for contingency and scalability plans. We were left then, with the Odroids.

arndale

HardKernel makes very decent boards, with fast quad-A9 and octa-A15 chips, 2GB of RAM and a big heat sink. Compilation times were in the right ball park (40~80 min) so they’re good for both regression catching and bootstrapping toolchains. But they had the same problem as every other board we tried: instability under heavy load.

Development boards are built for hobby projects and prototyping. They normally can get at very high frequencies (1~2 GHz) and are normally designed for low-power, stand-by usage most of the time. But toolchain testing involves building the whole compiler and running the full test-suite on every commit, and that puts it on 100% CPU usage, 24/7. Since the build times are around an hour or more, by the time that the build finishes, other commits have gone through and need to be tested, making it a non-stop job.

CPUs are designed to scale down the frequency when they get too hot, so throughout the normal testing, they stay stable at their operating temperatures (~60C), and adding a heat sink only makes it go further on frequency and keeping the same temperature, so it won’t solve the temperature problem.

The issue is that, after running for a while (a few hours, days, weeks), the compilation jobs start to fail randomly (the infamous “internal compiler error”) in different places of different files every time. This is clearly not a software problem, but if it were the CPU’s fault, it’d have happened a lot earlier, since it reaches the operating temperature seconds after the test starts, and only fails hours or days after they’re running full time. Also, that same argument rules out any trouble in the power supply, since it should have failed in the beginning, not days later.

The problem that the heat sink doesn’t solve, however, is the board’s overall temperature, which gets quite hot (40C~50C), and has negative effects on other components, like the SD reader and the card itself, or the USB port and the stick itself. Those boards can’t boot from USB, so we must use SD cards for the system, and even using a USB external hard drive with a powered USB hub, we still see the failures, which hints that the SD card is failing under high load and high temperatures.

According to SanDisk, their SD cards should be ok on that temperature range, but other parties might be at play, like the kernel drivers (which aren’t build for that kind of load). What pointed me to the SD card is the first place was that when running solely on the SD card (for system and build directories), the failures appear sooner and more often than when running the builds on a USB stick or drive.

Finally, with the best failure rate at 1/week, none of those boards are able to be build slaves.

Chromebook

That’s when I found the Samsung Chromebook. I had one for personal testing and it was really stable, so amidst all that trouble with the development boards, I decided to give it a go as a buildbot slave, and after weeks running smoothly, I had found what I was looking for.

The main difference between development boards and the Chromebook is that the latter is a product. It was tested not just for its CPU, or memory, but as a whole. Its design evolved with the results of the tests, and it became more stable as it progressed. Also, Linux drivers and the kernel were made to match, fine tuned and crash tested, so that it could be used by the worst kind of users. As a result, after one and a half years running Chromebooks as buildbots, I haven’t been able to make them fail yet.

But that doesn’t mean I have stopped looking for an alternative. Chromebooks are laptops, and as such, they’re build with a completely different mindset to a rack machine, and the number of modifications to make it fit the environment wasn’t short. Rack machines need to boot when powered up, give 100% of its power to the job and distribute heat efficiently under 100% load for very long periods of time. Precisely the opposite of a laptop design.

Even though they don’t fail the jobs, they did give me a lot of trouble, like having to boot manually, overheating the batteries and not having an easy way to set up a Linux image easily deployable via network boot. The steps to fix those issues are listed below.

WARNING: Anything below will void your warranty. You have been warned.

System settings

To get your Chromebook to boot anything other than ChromeOS, you need to enter developer mode. With that, you’ll be able not only to boot from SD or USB, but also change your partition and have sudo access on ChromeOS.

With that, you go to the console (CTRL+ALT+->), login with user chronos (no password) and set the boot process as described on the link above. You’ll also need to set sudo crossystem dev_boot_signed_only=0 to be able to boot anything you want.

The last step is to make your Linux image boot by default, so when you power up your machine it boots Linux, not ChromeOS. Otherwise, you’ll have to press CTRL+U every boot, and remote booting via PDUs will be pointless. You do that via cgpt.

You need to find the partition that boots on your ChromeOS by listing all of them and seeing which one booted successfully:

$ sudo cgpt show /dev/mmcblk0

The right partition will have the information below appended to the output:

Attr: priority=0 tries=5 successful=1

If it had tries, and was successful, this is probably your main partition. Move it back down the priority order (6-th place) by running:

$ sudo cgpt add -i [part] -P 6 -S 1 /dev/mmcblk0

And you can also set the SD card’s part to priority 0 by doing the same thing over mmcblk1

With this, installing a Linux on an SD card might get you booting Linux by default on next boot.

Linux installation

You can chose a few distributions to run on the Chromebooks, but I have tested both Ubuntu and Arch Linux, which work just fine.

Follow those steps and insert the SD card in the slot and boot. You should get the Developer Mode screen and waiting for long enough, it should beep and boot directly on Linux. If it doesn’t, means your cgpt meddling was unsuccessful (been there, done that) and will need a bit more fiddling. You can press CTRL+U for now to boot from the SD card.

After that, you should have complete control of the Chromebook, and I recommend adding your daemons and settings during the boot process (inid.d, systemd, etc). Turn on the network, start the SSD daemon and other services you require (like buildbots). It’s also a good idea to change the governor to performance, but only if you’re going to use it for full time heavy load, and especially if you’re going to run benchmarks. But for the latter, you can do that on demand, and don’t need to leave it on during boot time.

To change the governor:
$ echo [scale] | sudo tee /sys/bus/cpu/devices/cpu[N]/cpufreq/scaling_governor

scale above can be one of performance, conservative, ondemand (default), or any other governor that your kernel supports. If you’re doing before benchmarks, switch to performance and then back to ondemand. Use cpuN as the CPU number (starts on 0) and do it for all CPUs, not just one.

Other interesting scripts are to get the temperatures and frequencies of the CPUs:

$ cat thermal
#!/usr/bin/env bash
ROOT=/sys/devices/virtual/thermal
for dir in $ROOT/*/temp; do
temp=`cat $dir`
temp=`echo $temp/1000 | bc -l | sed 's/0\+$/0/'`
device=`dirname $dir`
device=`basename $device`
echo "$device: $temp C"
done

$ cat freq
#!/usr/bin/env bash
ROOT=/sys/bus/cpu/devices
for dir in $ROOT/*; do
if [ -e $dir/cpufreq/cpuinfo_cur_freq ]; then
freq=`sudo cat $dir/cpufreq/cpuinfo_cur_freq`
freq=`echo $freq/1000000 | bc -l | sed 's/0\+$/0/'`
echo "`basename $dir`: $freq GHz"
fi
done

Hardware changes

batteries

As expected, the hardware was also not ready to behave like a rack server, so some modifications are needed.

The most important thing you have to do is to remove the battery. First, because you won’t be able to boot it remotely with a PDU if you don’t, but more importantly, because the head from constant usage will destroy the battery. Not just as in make it stop working, which we don’t care, but it’ll slowly release gases and bloat the battery, which can be a fire hazard.

To remove the battery, follow the iFixit instructions here.

Another important change is to remove the lid magnet that tells the Chromebook to not boot on power. The iFixit post above doesn’t mention it, bit it’s as simple as prying the monitor bezel open with a sharp knife (no screws), locating the small magnet on the left side and removing it.

Stability

With all these changes, the Chromebook should be stable for years. It’ll be possible to power cycle it remotely (if you have such a unit), boot directly into Linux and start all your services with no human intervention.

The only think you won’t have is serial access to re-flash it remotely if all else fails, as you can with most (all?) rack servers.

Contrary to common sense, the Chromebooks are a lot better as build slaves are any development board I ever tested, and in my view, that’s mainly due to the amount of testing that it has gone through, given that it’s a consumer product. Now I need to test the new Samsung Chromebook 2, since it’s got the new Exynos Octa.

Conclusion

While I’d love to have more options, different CPUs and architectures to test, it seems that the Chromebooks will be the go to machine for the time being. And with all the glory going to ARMv8 servers, we may never see an ARMv7 board to run stably on a rack.

Open Source and Profit

I have written extensively about free, open source software as a way of life, and now reading back my own articles of the past 7 years, I realize that I was wrong on some of the ideas, or in the state of the open source culture within business and around companies.

I’ll make a bold statement to start, trying to get you interested in reading past the introduction, and I hope to give you enough arguments to prove I’m right. Feel free to disagree on the comments section.

The future of business and profit, in years to come, can only come if surrounded by free thoughts.

By free thoughts I mean free/open source software, open hardware, open standards, free knowledge (both free as in beer and as in speech), etc.

Past Ideas

I began my quest to understand the open source business model back in 2006, when I wrote that open source was not just software, but also speech. Having open source (free) software is not enough when the reasons why the software is free are not clear. The reason why this is so is that the synergy, that is greater than the sum of the individual parts, can only be achieved if people have the rights (and incentives) to reach out on every possible level, not just the source, or the hardware. I make that clear later on, in 2009, when I expose the problems of writing closed source software: there is no ecosystem in which to rely, so progress is limited and the end result is always less efficient, since the costs to make it as efficient are too great and would drive the prices of the software too high up to be profitable.

In 2008 I saw both sides of the story, pro and against Richard Stallman, on the views of the legitimacy of propriety control, being it via copyright licenses or proprietary software. I may have come a long way, but I was never against his idea of the perfect society, Richard Stallman’s utopia, or as some friends put it: The Star Trek Universe. The main difference between me and Stallman is that he believes we should fight to the last man to protect ourselves from the evil corporations towards software abuse, while I still believe that it’s impossible for them to sustain this empire for too long. His utopia will come, whether they like it or not.

Finally, in 2011 I wrote about how copying (and even stealing) is the only business model that makes sense (Microsoft, Apple, Oracle etc are all thieves, in that sense) and the number of patent disputes and copyright infringement should serve to prove me right. Last year I think I had finally hit the epiphany, when I discussed all these ideas with a friend and came to the conclusion that I don’t want to live in a world where it’s not possible to copy, share, derive or distribute freely. Without the freedom to share, our hands will be tied to defend against oppression, and it might just be a coincidence, but in the last decade we’ve seen the biggest growth of both disproportionate propriety protection and disproportional governmental oppression that the free world has ever seen.

Can it be different?

Stallman’s argument is that we should fiercely protect ourselves against oppression, and I agree, but after being around business and free software for nearly 20 years, I so far failed to see a business model in which starting everything from scratch, in a secret lab, and releasing the product ready for consumption makes any sense. My view is that society does partake in an evolutionary process that is ubiquitous and compulsory, in which it strives to reduce the cost of the whole process, towards stability (even if local), as much as any other biological, chemical or physical system we know.

So, to prove my argument that an open society is not just desirable, but the only final solution, all I need to do is to show that this is the least energy state of the social system. Open source software, open hardware and all systems where sharing is at the core should be, then, the least costly business models, so to force virtually all companies in the world to follow suit, and create the Stallman’s utopia as a result of the natural stability, not a forced state.

This is crucial, because every forced state is non-natural by definition, and every non-natural state has to be maintained by using resources that could be used otherwise, to enhance the quality of the lives of the individuals of the system (being them human or not, let’s not block our point of view this early). To achieve balance on a social system we have to let things go awry for a while, so that the arguments against such a state are perfectly clear to everyone involved, and there remains no argument that the current state is non-optimal. If there isn’t discomfort, there isn’t the need for change. Without death, there is no life.

Profit

Of all the bad ideas us humans had on how to build a social system, capitalism is probably one of the worst, but it’s also one of the most stable, and that’s because it’s the closest to the jungle rule, survival of the fittest and all that. Regulations and governments never came to actually protect the people, but as to protect capitalism from itself, and continue increasing the profit of the profitable. Socialism and anarchy rely too much on forced states, in which individuals have to be devoid of selfishness, a state that doesn’t exist on the current form of human beings. So, while they’re the product of amazing analysis of the social structure, they still need heavy genetic changes in the constituents of the system to work properly, on a stable, least-energy state.

Having less angry people on the streets is more profitable for the government (less costs with security, more international trust in the local currency, more investments, etc), so panis et circenses will always be more profitable than any real change. However, with more educated societies, result from the increase in profits of the middle class, more real changes will have to be made by governments, even if wrapped in complete populist crap. One step at a time, the population will get more educated, and you’ll end up with more substance and less wrapping.

So, in the end, it’s all about profit. If not using open source/hardware means things will cost more, the tendency will be to use it. And the more everyone uses it, the less valuable will be the products that are not using it, because the ecosystem in which applications and devices are immersed in, becomes the biggest selling point of any product. Would you buy a Blackberry Application, or an Android Application? Today, the answer is close to 80% on the latter, and that’s only because they don’t use the former at all.

It’s not just more expensive to build Blackberry applications, because the system is less open, the tools less advanced, but also the profit margins are smaller, and the return on investment will never justify. This is why Nokia died with their own App store, Symbian was not free, and there was a better, free and open ecosystem already in place. The battle had already been lost, even before it started.

But none of that was really due to moral standards, or Stallman’s bickering. It was only about profit. Microsoft dominated the desktop for a few years, long enough to make a stand and still be dominant after 15 years of irrelevance, but that was only because there was nothing better when they started, not by a long distance. However, when they tried to flood the server market, Linux was not only already relevant, but it was better, cheaper and freer. The LAMP stack was already good enough, and the ecosystem was so open, that it was impossible for anyone with a closed development cycle to even begin to compete on the same level.

Linux became so powerful that, when Apple re-defined the concept of smartphones with the iPhone (beating Nokia’s earlier attempts by light-years of quality), the Android system was created, evolved and dominated in less than a decade. The power to share made possible for Google, a non-device, non-mobile company, to completely outperform a hardware manufacturer in a matter of years. If Google had invented a new OS, not based on anything existent, or if they had closed the source, like Apple did with FreeBSD, they wouldn’t be able to compete, and Apple would still be dominant.

Do we need profit?

So, the question is: is this really necessary? Do we really depend on Google (specifically) to free us from the hands of tyrant companies? Not really. If it wasn’t Google, it’d be someone else. Apple, for a long time, was the odd guy in the room, and they have created an immense value for society: they gave us something to look for, they have educated the world on what we should strive for mobile devices. But once that’s done, the shareable ecosystem learns, evolves and dominate. That’s not because Google is less evil than Apple, but because Android is more profitable than iOS.

Profit here is not just the return on investment that you plan on having on a specific number of years, but adding to that, the potential that the evolving ecosystem will allow people to do when you’ve long lost the control over it. Shareable systems, including open hardware and software, allow people far down in the planing, manufacturing and distributing process to still have profit, regardless of what were your original intentions. One such case is Maddog’s Project Cauã.

By using inexpensive RaspberryPis, by fostering local development and production and by enabling the local community to use all that as a way of living, Maddog’s project is using the power of the open source initiative by completely unrelated people, to empower the people of a country that much needs empowering. That new class of people, from this and other projects, is what is educating the population of the world, and what is allowing the people to fight for their rights, and is the reason why so many civil uprisings are happening in Brazil, Turkey, Egypt.

Instability

All that creates instability, social unrest, whistle-blowing gone wrong (Assange, Snowden), and this is a good thing. We need more of it.

It’s only when people feel uncomfortable with how the governments treat them that they’ll get up their chairs and demand for a change. It’s only when people are educated that they realise that oppression is happening (since there is a force driving us away from the least-energy state, towards enriching the rich), and it’s only when these states are reached that real changes happen.

The more educated society is, the quicker people will rise to arms against oppression, and the closer we’ll be to Stallman’s utopia. So, whether governments and the billionaire minority likes or not, society will go towards stability, and that stability will migrate to local minima. People will rest, and oppression will grow in an oscillatory manner until unrest happens again, and will throw us into yet another minimum state.

Since we don’t want to stay in a local minima, we want to find the best solution not just a solution, having it close to perfect in the first attempt is not optimal, but whether we get it close in the first time or not, the oscillatory nature of social unrest will not change, and nature will always find a way to get us closer to the global minimum.

Conclusion

Is it possible to stay in this unstable state for too long? I don’t think so. But it’s not going to be a quick transition, nor is it going to be easy, nor we’ll get it on the first attempt.

But more importantly, reaching stability is not a matter of forcing us to move towards a better society, it’s a matter of how dynamic systems behave when there are clear energetic state functions. In physical and chemical systems, this is just energy, in biological systems this is the propagation ability, and in social systems, this is profit. As sad as it sounds…

Uno score keeper

With the spring not coming soon, we had to improvise during the Easter break and play Uno every night. It’s a lot of fun, but it can take quite a while to find a piece of clean paper and a pen that works around the house, so I wondered if there was an app for that. It turns out, there wasn’t!

There were several apps to keep card game scores, but every one was specific to the game, and they had ads, and wanted access to the Internet, so I decided it was worth it writing one myself. Plus, that would finally teach me to write Android apps, a thing I was delaying to get started for years.

The App

Adding new players
Card Game Scores

The app is not just a Uno score keeper, it’s actually pretty generic. You just keep adding points until someone passes the threshold, when the poor soul will be declared a winner or a loser, depending on how you set up the game. Since we’re playing every night, even the 30 seconds I spent re-writing our names was adding up, so I made it to save the last game in the Android tuple store, so you can retrieve it via the “Last Game” button.

It’s also surprisingly easy to use (I had no idea), but if you go back and forth inside the app, it cleans the game and start over a new one, with the same players, so you can go on as many rounds as you want. I might add a button to restart (or leave the app) when there’s a winner, though.

I’m also thinking about printing the names in order in the end (from victorious to loser), and some other small changes, but the way it is, is good enough to advertise and see what people think.

If you end up using, please let me know!

Download and Source Code

The app is open source (GPL), so rest assured it has no tricks or money involved. Feel free to download it from here, and get the source code at GitHub.

Distributed Compilation on a Pandaboard Cluster

This week I was experimenting with the distcc and Ninja on a Pandaboard cluster and it behaves exactly as I expected, which is a good thing, but it might not be what I was looking for, which is not. 😉

Long story short, our LLVM buildbots were running very slow, from 3 to 4.5 hours to compile and test LLVM. If you consider that at peak time (PST hours) there are up to 10 commits in a single hour, the buildbot will end up testing 20-odd patches at the same time. If it breaks in unexpected ways, of if there is more than one patch on a given area, it might be hard to spot the guilty.

We ended up just avoiding the make clean step, which put us around 15 minutes build+tests, with the odd chance of getting 1 or 2 hours tops, which is a great deal. But one of the alternatives I was investigating is to do a distributed build. More so because of the availability of cluster nodes with dozens of ARM cores inside, we could make use of such a cluster to speed up our native testing, even benchmarking on a distributed way. If we do it often enough, the sample might be big enough to account for the differences.

The cluster

So, I got three Pandaboards ES (dual Cortex-A9, 1GB RAM each) and put the stock Ubuntu 12.04 on them and installed the bare minimum (vim, build-essential, python-dev, etc), upgraded to the latest packages and they were all set. Then, I needed to find the right tools to get a distributed build going.

It took a bit of searching, but I ended up with the following tool-set:

  • distcc: The distributed build dispatcher, which knows about the other machines in the cluster and how to send them jobs and get the results back
  • CMake: A Makefile generator which LLVM can use, and it’s much better than autoconf, but can also generate Ninja files!
  • Ninja: The new intelligent builder which not only is faster to resolve dependencies, but also has a very easy way to change the rules to use distcc, and also has a magical new feature called pools, which allow me to scale job types independently (compilers, linkers, etc).

All three tools had to be compiled from source. Distcc’s binary distribution for ARM is too old, CMake’s version on that Ubuntu couldn’t generate Ninja files and Ninja doesn’t have binary distributions, full stop. However, it was very simple to get them interoperating nicely (follow the instructions).

You don’t have to use CMake, there are other tools that generate Ninja files, but since LLVM uses CMake, I didn’t have to do anything. What you don’t want is to generate the Ninja files yourself, it’s just not worth it. Different than Make, Ninja doesn’t try to search for patterns and possibilities (this is why it’s fast), so you have to be very specific on the Ninja file on what you want to accomplish. This is very easy for a program to do (like CMake), but very hard and error prone for a human (like me).

Distcc

To use distcc is simple:

  1. Replace the compiler command by distcc compiler on your Ninja rules;
  2. Set the environment variable DISTCC_HOSTS to the list of IPs that will be the slaves (including localhost);
  3. Start the distcc daemon on all slaves (not on the master): distccd --daemon --allow <MasterIP>;
  4. Run ninja with the number of CPUs of all machines + 1 for each machine. Ex: ninja -j6 for 2 Pandaboards.

A local build, on a single Pandaboard of just LLVM (no Clang, no check-all) takes about 63 minutes. With distcc and 2 Pandas it took 62 minutes!

That’s better, but not as much as one would hope for, and the reason is a bit obvious, but no less damaging: The Linker! It took 20 minutes to compile all of the code, and 40 minutes to link them into executable. That happened because while we had 3 compilation jobs on each machine, we had 6 linking jobs on a single Panda!

See, distcc can spread the compilation jobs as long as it copies the objects back to the master, but because a linker needs all objects in memory to do the linking, it can’t do that over the network. What distcc could do, with Ninja’s help, is to know which objects will be linked together, and keep copies of them on different machines, so that you can link on separate machines, but that is not a trivial task, and relies on an interoperation level between the tools that they’re not designed to accept.

Ninja Pools

And that’s where Ninja proved to be worth its name: Ninja pools! In Ninja, pools are named resources that bundle together with a specific level of scalability. You can say that compilers scale free, but linkers can’t run more than a handful. You simply need to create a pool called linker_pool (or anything you want), give it a depth of, say, 2, and annotate all linking jobs with that pool. See the manual for more details.

With the pools enabled, a distcc build on 2 Pandaboards took exactly 40 minutes. That’s 33% of gain with double the resources, not bad. But, how does that scale if we add more Pandas?

How does it scale?

To get a third point (and be able to apply a curve fit), I’ve added another Panda and ran again, with 9 jobs and linker pool at 2, and it finished in 30 minutes. That’s less than half the time with three times more resources. As expected, it’s flattening out, but how much more can we add to be profitable?

I don’t have an infinite number of Pandas (nor I want to spend all my time on it), so I just cheated and got a curve fitting program (xcrvfit, in case you’re wondering) and cooked up an exponential that was close enough to the points and use the software ability to do a best fit. It came out with 86.806*exp(-0.58505*x) + 14.229, which according to Lybniz, flattens out after 4 boards (about 20 minutes).

Pump Mode

Distcc has a special mode called pump mode, in which it pushes with the C file, all headers necessary to compile it solely on the node. Normally, distcc will pre-compile on the master node and send the pre-compiled result to the slaves, which convert to object code. According to the manual, this could improve the performance 10-fold! Well, my results were a little less impressive, actually, my 3-Panda cluster finished in just about 34 minutes, 4 minutes more than without push mode, which is puzzling.

I could clearly see that the files were being compiled in the slaves (distccmon-text would tell me that, while there was a lot of “preprocessing” jobs on the master before), but Ninja doesn’t print times on each output line for me to guess what could have slowed it down. I don’t think there was any effect on the linker process, which was still enabled in this mode.

Conclusion

Simply put, both distcc and Ninja pools have shown to be worthy tools. On slow hardware, such as the Pandas, distributed builds can be an option, as long as you have a good balance between compilation and linking. Ninja could be improved to help distcc to link on remote nodes as well, but that’s a wish I would not press on the team.

However, scaling only to 4 boards will reduce a lot of the value for me, since I was expecting to use 16/32 cores. The main problem is again the linker jobs working solely on the master node, and LLVM having lots and lots of libraries and binaries. Ninja’s pools can also work well when compiling LLVM+Clang on debug mode, since the objects are many times bigger, and even on above average machine you can start swapping or even freeze your machine if using other GUI programs (browsers, editors, etc).

In a nutshell, the technology is great and works as advertised, but with LLVM it might not be yet the thing. It’s still more profitable to get faster hardware, like the Chromebooks, that are 3x faster than the Pandas and cost only marginally more.

Would also be good to know why the pump mode has regressed in performance, but I have no more time to spend on this, so I leave as a exercise to the reader. 😉

LLVM Vectorizer

Now that I’m back working full-time with LLVM, it’s time to get some numbers about performance on ARM.

I’ve been digging the new LLVM loop vectorizer and I have to say, I’m impressed. The code is well structured, extensible and above all, sensible. There are lots of room for improvement, and the code is simple enough so you can do it without destroying the rest or having to re-design everything.

The main idea is that the loop vectorizer is a Loop Pass, which means that if you register this pass (automatically on -O3, or with -loop-vectorize option), the Pass Manager will run its runOnLoop(Loop*) function on every loop it finds.

The three main components are:

  1. The Loop Vectorization Legality: Basically identifies if it’s legal (not just possible) to vectorize. This includes checking if we’re dealing with an inner loop, and if it’s big enough to be worth, and making sure there aren’t any conditions that forbid vectorization, such as overlaps between reads and writes or instructions that don’t have a vector counter-part on a specific architecture. If nothing is found to be wrong, we proceed to the second phase:
  2. The Loop Vectorization Cost Model: This step will evaluate both versions of the code: scalar and vector. Since each architecture has its own vector model, it’s not possible to create a common model for all platforms, and in most cases, it’s the special behaviour that makes vectorization profitable (like 256-bits operations in AVX), so we need a bunch of cost model tables that we consult given an instruction and the types involved. Also, this model doesn’t know how the compiler will lower the scalar or vectorized instructions, so it’s mostly guess-work. If the vector cost (normalized to the vector size) is less than the scalar cost, we do:
  3. The Loop Vectorization: Which is the proper vectorization, ie. walking through the scalar basic blocks, changing the induction range and increment, creating the prologue and epilogue, promote all types to vector types and change all instructions to vector instructions, taking care to leave the interaction with the scalar registers intact. This last part is a dangerous one, since we can end up creating a lot of copies from scalar to vector registers, which is quite expensive and was not accounted for in the cost model (remember, the cost model is guess-work based).

All that happens on a new loop place-holder, and if all is well at the end, we replace the original basic blocks by the new vectorized ones.

So, the question is, how good is this? Well, depending on the problems we’re dealing with, vectorizers can considerably speed up execution. Especially iterative algorithms, with lots of loops, like matrix manipulation, linear algebra, cryptography, compression, etc. In more practical terms, anything to do with encoding and decoding media (watching or recording videos, pictures, audio), Internet telephones (compression and encryption of audio and video), and all kinds of scientific computing.

One important benchmark for that kind of workload is Linpack. Not only Linpack has many examples of loops waiting to be vectorized, but it’s also the benchmark that defines the Top500 list, which classifies the fastest computers in the world.

Benchmarks

So, both GCC and Clang now have the vectorizers turned on by default with -O3, so comparing them is as simple as compiling the programs and see them fly. But, since I’m also interested in seeing what is the performance gain with just the LLVM vectorizer, I also disabled it and ran a clang with only  -O3, no vectorizer.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler    Opt    MFLOPS  Diff
   Clang    -O3      2413  0.0%
     GCC  -O3 vec    2421  0.3%
   Clang  -O3 vec    3346 38.6%

This is some statement! The GCC vectorizer exists for a lot longer than LLVM’s and has been developed by many vectorization gurus and LLVM seems to easily beat GCC in that field. But, a word of warning, Linpack is by no means representative of all use cases and user visible behaviour, and it’s very likely that GCC will beat LLVM on most other cases. Still, a reason to celebrate, I think.

This boost mean that, for many cases, not only the legality if the transformations are legal and correct (or Linpack would have gotten wrong results), but they also manage to generate faster code at no discernible cost. Of course, the theoretical limit is around 4x boost (if you manage to duplicate every single scalar instruction by a vector one and the CPU has the same behaviour about branch prediction and cache, etc), so one could expect a slightly higher number, something on the order of 2x better.

It depends on the computation density we’re talking about. Linpack tests specifically the inner loops of matrix manipulation, so I’d expect a much higher ratio of improvement, something around 3x or even closer to 4x. VoIP calls, watching films and listening to MP3 are also good examples of densely packet computation, but since we’re usually running those application on a multi-task operating system, you’ll rarely see improvements higher than 2x. But general applications rarely spend that much time on inner loops (mostly waiting for user input and then doing a bunch of unrelated operations, hardly vectorizeable).

Another important aspect of vectorization is that it saves a lot of battery juice. MP3 decoding doesn’t really matter if you finish in 10 or 5 seconds, as long as the music doesn’t stop to buffer. But taking 5 seconds instead of 10 means that on the other 5 seconds the CPU can reduce its voltage and save battery. This is especially important in mobile devices.

What about ARM code?

Now that we know the vectorizer works well, and the cost model is reasonably accurate, how does it compare on ARM CPUs?

It seems that the grass is not so green on this side, at least not at the moment. I have reports that on ARM it also reached the 40% boost similar to Intel, but what I saw was a different picture altogether.

On a Samsung Chromebook (Cortex-A15) I got:

Compiler   Opt    MFLOPS   Diff
  Clang    -O3       796   0.0%
    GCC  -O3 vec     736  -8.5%
  Clang  -O3 vec     773  -2.9%

The performance regression can be explained by the amount of scalar code intermixed with vector code inside the inner loops as a result of shuffles (movement of data within the vector registers and between scalar and vector registers) not being lowered correctly. This most likely happens because the LLVM back-end relies a lot on pattern-matching for instruction selection (a good thing), but the vectorizers might not be producing the shuffles in the right pattern, as expected by each back-end.

This can be fixed by tweaking the cost model to penalize shuffles, but it’d be good to see if those shuffles aren’t just mismatched against the patterns that the back-end is expecting. We will investigate and report back.

Update

Got results for single precision floating point, which show a greater improvement on both Intel and ARM.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler   Opt   MFLOPS  Diff
   Clang   -O3     2530   0.0%
     GCC -O3 vec   3484  37.7%
   Clang -O3 vec   3996  57.9%

On a Samsung Chromebook (Cortex-A15) I got:

Compiler   Opt   MFLOPS  Diff
   Clang   -O3      867   0.0%
     GCC -O3 vec    788,  9.1%
   Clang -O3 vec   1324  52.7%

Which goes on to show that the vectorizer is, indeed, working well for ARM, but the costs of using the VFP/NEON pipeline outweigh the benefits. Remember than NEON vectors are only 128-bits wide and VFP only 64-bit wide, and NEON has no double precision floating point operations, so they’ll only do one double precision floating point operations per cycle, so the theoretical maximum depends on the speed of the soft-fp libraries.

So, in the future, what we need to be working is the cost model, to make sure we don’t regress in performance, and try to get better algorithms when lowering vector code (both by making sure we match the patterns that the back-end is expecting, and by just finding better ways of vectorizing the same loops).

Conclusion

Without further benchmarks it’s hard to come to a final conclusion, but it’s looking good, that’s for sure. Since Linpack is part of the standard LLVM test-suite benchmarks, fixing this and running it regularly on ARM will at least avoid any further regressions… Now it’s time to get our hands dirty!

 

Declaration of Internet Freedom

We stand for a free and open Internet.

We support transparent and participatory processes for making Internet policy and the establishment of five basic principles:

  • Expression: Don’t censor the Internet.
  • Access: Promote universal access to fast and affordable networks.
  • Openness: Keep the Internet an open network where everyone is free to connect, communicate, write, read, watch, speak, listen, learn, create and innovate.
  • Innovation: Protect the freedom to innovate and create without permission. Don’t block new technologies, and don’t punish innovators for their users’ actions.
  • Privacy: Protect privacy and defend everyone’s ability to control how their data and devices are used.

Don’t get it? You should be more informed on the power of the internet and what governments around the world have been doing to it.

Good starting places are: Avaaz, Ars Technica, Electronic Frontier Foundation, End Software Patents, Piratpartiet and the excellent Case for Copyright Reform.

Source: http://www.internetdeclaration.org/freedom

K-means clustering

Clustering algorithms can be used with many types of data, as long as you have means to distribute them in a space, where there is the concept of distance. Vectors are obvious choices, but not everything can be represented into N-dimensional points. Another way to plot data, that is much closer to real data, is to allow for a large number of binary axis, like tags. So, you can cluster by the amount of tags the entries share, with the distance being (only relative to others) the proportion of these against the non-sharing tags.

An example of tag clustering can be viewed on Google News, an example of clustering on Euclidean spaces can be viewed on the image above (full code here). The clustering code is very small, and the result is very impressive for such a simple code. But the devil is in the details…

Each red dots group is generated randomly from a given central point (draws N randomly distributed points inside a circle or radius R centred at C). Each centre is randomly placed, and sometimes their groups collide (as you can see on the image), but that’s part of the challenge. To find the groups, and their centres, I throw random points (with no knowledge of the groups’ centres) and iterate until I find all groups.

The iteration is very simple, and consists of two steps:

  1. Assignment Step: For each point, assign it to the nearest mean. This is why you need the concept of distance, and that’s a tricky part. With Cartesian coordinates, it’s simple.
  2. Update Step: Calculate the real mean of all points belonging to each mean point, and update the point to be at it. This is basically moving the supposed (randomly guessed) mean to it’s rightful place.

On the second iteration, the means, that were randomly selected at first, are now closer to a set of points. Not necessarily points in the same cluster, but the cluster that has more points assigned to any given mean will slowly steal it from the others, since it’ll have more weight when updating it on step 2.

If all goes right, the means will slowly move towards the centre of each group and you can stop when the means don’t move too much after the update step.

Many problems will arise in this simplified version, for sure. For instance, if the mean is exactly in between two groups, and both pull it to their centres with equally strong forces, thus never moving the mean, thus the algorithm thinks it has already found its group, when in fact, it found two. Or if the group is so large that it ends up with two or more means which it belongs, splitting it into many groups.

To overcome these deficiencies, some advanced forms of K-means take into account the shape of the group during the update step, sometimes called soft k-means. Other heuristics can be added as further steps to make sure there aren’t two means too close to each other (relative to their groups’ sizes), or if there are big gaps between points of the same group, but that kind of heuristics tend to be exponential in execution, since they examine every point of a group in relation to other points of the same group.

All in all, still an impressive performance for such a simple algorithm. Next in line, I’ll try clustering data distributed among many binary axis and see how k-means behave.

Emergent behaviour

There is a lot of attention to emergent behaviour nowadays (ex. here, here, here and here), but it’s still on the outskirts of science and computing.

Science

For millennia, science has isolated each single behaviour of a system (or system of systems) to study it in detail, than join them together to grasp the bigger picture. The problem is that, this approximation can only be done with simple systems, such as the ones studied by Aristotle, Newton and Ampere. Every time scientists were approaching the edges of their theories (including those three), they just left as an exercise to the reader.

Newton has foreseen relativity and the possible lack of continuity in space and time, but he has done nothing to address that. Fair enough, his work was much more important to science than venturing throughout the unknowns of science, that would be almost mystical of him to try (although, he was an alchemist). But more and more, scientific progress seems to be blocked by chaos theory, where you either unwind the knots or go back to alchemy.

Chaos theory exists for more than a century, but it was only recently that it has been applied to anything outside differential equations. The hyper-sensibility of the initial conditions is clear on differential systems, but other systems have a less visible, but no less important, sensibility. We just don’t see it well enough, since most of the other systems are not as well formulated as differential equations (thanks to Newton and Leibniz).

Neurology and the quest for artificial intelligence has risen a strong interest in chaos theory and fractal systems. The development in neural networks has shown that groups and networks also have a fundamental chaotic nature, but more importantly, that it’s only though the chaotic nature of those systems that you can get a good amount of information from it. Quantum mechanics had the same evolution, with Heisenberg and Schroedinger kicking the ball first on the oddities of the universe and how important is the lack of knowledge of a system to be able to extract information from it (think of Schroedinger’s cat).

A network with direct and fixed thresholds doesn’t learn. Particles with known positions and velocities don’t compute. N-body systems with definite trajectories don’t exist.

The genetic code has some similarities to these models. Living beings have far more junk than genes in their chromosomes (reaching 98% of junk on human genome), but changes in the junk parts can often lead to invalid creatures. If junk within genes (introns) gets modified, the actual code (exons) could be split differently, leading to a completely new, dysfunctional, protein. Or, if you add start sequences (TATA-boxes) to non-coding region, some of them will be transcribed into whatever protein they could make, creating rubbish within cells, consuming resources or eventually killing the host.

But most of the non-coding DNA is also highly susceptible to changes, and that’s probably its most important function, adapted to the specific mutation rates of our planet and our defence mechanism against such mutations. For billions of years, the living beings on Earth have adapted that code. Each of us has a super-computer that can choose, by design, the best ratios for a giving scenario within a few generations, and create a whole new species or keep the current one adapted, depending on what’s more beneficial.

But not everyone is that patient…

Programming

Sadly, in my profession, chaos plays an important part, too.

As programs grow old, and programmers move on, a good part of the code becomes stale, creating dependencies that are hard to find, harder to fix. In that sense, programs are pretty much like the genetic code, the amount of junk increases over time, and that gives the program resistance against changes. The main problem with computing, that is not clear in genetics, is that the code that stays behind, is normally the code that no one wants to touch, thus, the ugliest and most problematic.

DNA transcriptors don’t care where the genes are, they find a start sequence and go on with their lives. Programmers, we believe, have free will and that gives them the right to choose where to apply a change. They can either work around the problem, making the code even uglier, or they can go on and try to fix the problem in the first place.

Non-programmers would quickly state that only lazy programmers would do the former, but more experienced ones will admit have done so on numerous occasions for different reasons. Good programmers would do that because fixing the real problem is so painful to so many other systems that it’s best to be left alone, and replace that part in the future (only they never will). Bad programmers are not just lazy, some of them really believe that’s the right thing to do (I met many like this), and that adds some more chaos into the game.

It’s not uncommon to try to fix a small problem, go more than half-way through and hit a small glitch on a separate system. A glitch that you quickly identify as being wrongly designed, so you, as any good programmer would do, re-design it and implement the new design, which is already much bigger than the fix itself. All tests pass, except the one, that shows you another glitch, raised by your different design. This can go on indefinitely.

Some changes are better done in packs, all together, to make sure all designs are consistent and the program behaves as it should, not necessarily as the tests say it would. But that’s not only too big for one person at one time, it’s practically impossible when other people are changing the program under your feet, releasing customer versions and changing the design themselves. There is a point where a refactoring is not only hard, but also a bad design choice.

And that’s when code become introns, and are seldom removed.

Networks

The power of networks is rising, slower than expected, though. For decades, people know about synergy, chaos and emergent behaviour, but it was only recently, with the quality and amount of information on global social interaction, that those topics are rising again in the main picture.

Twitter, Facebook and the like have risen so many questions about human behaviour, and a lot of research has been done to address those questions and, to a certain extent, answer them. Psychologists and social scientists knew for centuries that social interaction is greater than the sum of all parts, but now we have the tools and the data to prove it once and for all.

Computing clusters have being applied to most of the hard scientific problems for half a century (weather prediction, earthquake simulation, exhaustion proofs in graph theory). They also took on a commercial side with MapReduce and similar mechanisms that have popularised the distributed architectures, but that’s only the beginning.

On distributed systems of today, emergent behaviour is treated as a bug, that has to be eradicated. In the exact science of computing, locks and updates have to occur in the precise order they were programmed to, to yield the exact result one is expecting. No more, no less.

But to keep our system out of emergent behaviours, we invariably introduce emergent behaviour in our code. Multiple checks on locks and variables, different design choices for different components that have to work together and the expectancy of precise results or nothing, makes the number of lines of code grow exponentially. And, since that has to run fast, even more lines and design choices are added to avoid one extra operation inside a very busy loop.

While all this is justifiable, it’s not sustainable. In the long run (think decades), the code will be replaced or the product will be discontinued, but there is a limit to which a program can receive additional lines without loosing some others. And the cost of refactoring increases with the lifetime of a product. This is why old products don’t get too many updates, not because they’re good enough already, but because it’s impossible to add new features without breaking a lot others.

Distant future

As much as I like emergent behaviour, I can’t begin to fathom how to harness that power. Stochastic computing is one way and has been done with certain level of success here and here, but it’s far from easy to create a general logic behind it.

Unlike Turing machines, emergent behaviour comes from multiple sources, dressed in multiple disguises and producing far too much variety in results that can be accounted in one theory. It’s similar to string theory, where there are several variations of it, but only one M theory, the one that joins them all together. The problem is, nobody knows how this M theory looks like. Well, they barely know how the different versions of string theory look like, anyway.

In that sense, emergent theory is even further than string theory to be understood in its entirety. But I strongly believe that this is one way out of the conundrum we live today, where adding more features makes harder to add more features (like mass on relativistic speeds).

With stochastic computing there is no need of locks, since all that matter is the probability of an outcome, and where precise values do not make sense. There is also no need for NxM combination of modules and special checks, since the power is not in the computation themselves, but in the meta-computation, done by the state of the network, rather than its specific components.

But that, I’m afraid, I won’t see in my lifetime.