Middle Earth: MPI

MPI stands for Message Passing Interface and is a system to execute programs across nodes in a cluster using a message passing library to enable communication among nodes. It’s a very powerful library and is now the standard for parallel programs.

Normally I’d choose LAM MPI as I always did in the past but I wanted to test MPICH, another very famous MPI implementation.

But what I found out was that the MPICH version for Ubuntu is rather old and the on line documentation is completely different from what I had and there was no documentation at all on any Ubuntu package I could find. (for instance, my config file was apache-like and the new is XML, so I couldn’t even start the service).

Well, I guess that the best always win and that’s the third time I choose LAM over MPICH exactly because of the same problem: installation and documentation.

Installing LAM MPI was very simple. On the master node (gandalf) I installed:

$ sudo apt-get install lam-runtime lam4c2 lam4-dev

And on the execution nodes, just the runtime:

$ sudo apt-get install lam-runtime lam4c2


A while ago I had developed a set of scripts to help running and syncing a LAM MPI cluster when you don’t have a shared disk yet to use within the cluster (my case yet) so it’s specially designed to home clusters and the start of a more serious cluster when you didn’t have time to setup a shared disk setup yet. πŸ˜‰

So, installing MPEasy is easy, download the tarball, explode it into some dir and set the env variable on your startup script:

On .bashrc:

export MPEASY=~/mpeasy
export PATH=$PATH:$MPEASY/bin

On .cshrc:

setenv MPEASY ~/mpeasy
setenv PATH $PATH:$MPEASY/bin

And put the node list, one per line, on $MPEASY/conf/lam_hosts. Afther that, just running:

$ bw_start

should start your mpi cluster. After that you can start some MPI tests. Go to the $MPEASY/devel/test directory and compile the hello.c.

$ mpicc -o hello hello.c

Than, you need to sync the current devel directory to all nodes:

$ bw_sync

And run:

$ bw_run 10 $MPEASY/devel/test/hello

You should be able to do the same to all other codes on it, just remember to sync before running, otherwise you’ll have an outdated version on the nodes and you’ll have problems. On a shared disk environment it wouldn’t be a problem, of course.

PI Monte Carlo – Distributed version

On April I published an entry explaining how to calculate PI using the most basic Monte Carlo algorithm and now, using the Middle Earth cluster I can do it parallelized.

Parallelizing Monte Carlo is a very simple task because of it’s random and independent nature and this basic monte carlo is even easier. I can just run exactly the same routine as before on all nodes and at the end sum everything and divide by the number of nodes. To achieve that, I just changed the main.cc file to use MPI, quite simple indeed.

The old main.cc just called the function and returned the value:

    area_disk(pi, max_iter, delta);
    cout << "PI = " << pi << endl;

But now, the new version should know whether it’s the main node or a computing node. After, all computing nodes should calculate the area and the main node should gather and sum.

    /* Nodes, compute and return */
    if (myrank) {
        area_disk(tmp, max_iter, delta);
        MPI_Send( &tmp, 1, MPI_DOUBLE, 0, 17, MPI_COMM_WORLD );

    /* Main, receive all areas and sum */
    } else {
        for (int i=1; i < size; i++) {
            MPI_Recv( &tmp, 1, MPI_DOUBLE, i, 17, MPI_COMM_WORLD, &status );
            pi += tmp;
        pi /= (size-1);
        cout << "PI = " << pi << endl;

On MPI, myrank says your node number and size shows you the total number of nodes. On the most basic MPI program, if it’s zero you’re the main node, otherwise you’re a computing node.

All computing nodes calculate the area and MPI_Send the result to the main node. The main node waits for all responses on the main loop and sum the temporary result tmp to pi and at the end divide by the number of computing nodes.


This monte carlo is extremely basic and very easy to parallelize. As this copy is run over N computing nodes and there’s no dependency between them you should achieve an increase in speed of over N times the non-parallel one.

Unfortunately, this algorithm is so slow and inaccurate that even running on 9 computing nodes (ie. 9 times faster) it’s still wrong on the third digit.

The slowness is due to the algorithm’s stupidity but the inaccuracy is due to the lack of a really good standard random number generators. Almost all machines yielded results far from the 5-digit answer on M_PI macro of C standard library and the result was also far from it. Also, there are so many other ways of calculating PI that are so much faster that it wouldn’t be a good approach ever!

The good thing is that it was just to show a distributes monte carlo algorithm working… πŸ˜‰

Services Market

An new service is available to Brazilian professionals, it’s called Services Market and is a new way of connecting companies that need projects to be done with professionals that need projects to do.

Quite complete for a brand new service, it’s a mix of a consulting company with eBay, cover not only IT but lots of professional areas such as architecture, CAD, legal, marketing, translations, etc. It’ll heat up Brazil’s services market for sure!

Twisting bits

Long time ago a friend told me about a very simple algorithm but with beautiful results. It’s a simple inversion of a number’s bits. Despite being nothing amazing “as is” it may have some cool results. In his math paper about he explain how good it is to scramble numbers to have a feeling of randomness but still assuring that every number on the range will show up exactly one time, no more, no less.

For instance, take a number with four bits: 15 [1111]. If you take every number only once between 1 and 15 you’ll never repeat the same number, right ? So, if I take every number from 1 to 15 but, before showing the result to the user, I invert it’s bits. The output won’t be a sequence but you still iterate through a sequence, thus, the output looks like random but in fact it’s just a weird iteration.

The sequence would be like:

Original     Output
01: 0001    08: 1000
02: 0010    04: 0100
03: 0011    12: 1100
04: 0100    02: 0100
15: 1111    15: 1111

There are some draw backs, of course. One of them is that it’s not even pseudo-random, other is that all even number comes before all odd numbers (because of the first bit being the last), but also have two good uses I’ve seen:

– Reder images in a non-linear sequence, giving you an idea of how it’ll look like quite sooner
– Create IDs for websites on a non-trivial sequence making harder for people trying to guess the user’s IDs

But again, it won’t block people from seeing anything anywhere, it’s just a way of scrambling. But today, sites are using hashes to do the same effect, better but still not safe.

Another good thing is that if you change the offset (or how many bits to swap like 0001 or 00000001) the sequence will be completely different.

Anyway, if you’re curious check the code.

Middle Earth: Moving freely

So, talking about the patch hack reminded me to say a word about an important thing when building clusters: moving around. If you have hundreds of nodes and have to update one config file, would you like to type your admin password hundreds of times ?

So, the simple way of doing it on a controlled environment is using passwordless SSH keys and passwordless sudo for certain tasks.

SSH Keys: When you SSH to another computer you normally have to type a password, but there’s another way of authenticating and it is a trusted DSA/RSA key. This key is created using the ssh-keygen tool:

$ ssh-keygen -t dsa -b 1024

It’ll ask for a passphrase and there is where you just type ENTER. This will create two files on your ~/.ssh directory: id_dsa and id_dsa.pub. The public file should be copied to all your node’s ~/.ssh directory and renamed as authorized_keys. That’s it, SSH to the node and check that it won’t ask you for a password.

$ ssh node mkdir .ssh  (type password)
$ scp .ssh/id_dsa.pub node:.ssh/  (type password)
$ ssh node mv ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys  (type password)
$ ssh node   (won't ask for a password)

Sudo rules: Sudo helps you to execute things as root without being root but the root must allow that and the way to allow that is to add you to the /etc/sudoers file. Ubuntu already sets you in sudoers if you provided your username on the installation. If not, you should be able to run the visudo application properly.

$ sudo visudo

The line should be something like that on Ubuntu’s sudoers:

%admin  ALL=(ALL) ALL

And the quick solution is to change to that:


When you save and close the editor (:x on vi) will update the sudoers and you’ll be able to run everything as root without typing a password. BEWARE! This approach is very very very insecure so make sure you have all your machines completely separated from your network otherwise it’ll compromise your entire network.

Disclaimer: I use that because, it won’t open any security hole on your machine and in the event of someone breaking into one of the machines via another security hole it’ll compromise all your nodes because they should have exactly the same configurations, so no point trying to make one node secure the other.

So, with all that said, it’s very simple to shutdown the cluster:

for node in `cat /etc/cluster`; do
    ssh $node sudo halt

no passwords, no words, just a quick halt.

Middle Earth: hacks

Ok ok, I admit… I tried to stay away from hacks and non-package things but I was just lazy about trying to find out the best configuration manager while I needed to distribute my files around and execute the same command on all nodes of the cluster so I cheated!

At least was a very small cheat and I still want to do the right way, someday…. πŸ˜€

I needed multiple execution and multiple copy, so I’ve created a file called /etc/cluster that contained all my node’s names:


Than, I made a extremely simple script to read it and execute a command on every node:

for node in `cat /etc/cluster`; do
    ssh -t $node $CMD

I did also the same using scp and the joy of the day: patch! Yes, I quite liked it, a script that you provide the original path on nodes and the patch file. It’ll copy to your home directory and apply the patch on the original file using both simple scripts previously made to copy and run.

Than it became easy to admin the config of the cluster:

$ diff -e /etc/mon/mon.cf /etc/mon/mon.cf.original > ~/mon.patch
$ cpatch /etc/mon/mon.cf ~/mon.patch
$ cexe /etc/init.d/mon restart

Ludicrously simple, isn’t it?!?! πŸ˜€

Of course, to do that I needed to create two things: passwordless ssh and passwordless sudo on each node. But that’s another story.

Middle Earth: Monitoring

My first choice was Ganglia as it’s used on Oscar and Rocks but the Ubuntu version is very old (2.5.7 while the current one is 3.0.3), the configuration file is completely different and the man page of 2.5.7 is empy. So, nagios and mon were my second choices but mon seemed much simpler and I had used nagios before and didn’t find it very straightforward to configure as well.

So I ended up with mon. Mon is a stable software, written in Perl and available from the kernel.org site. It’s so simple to configure and customize that I spent more time installing the packages.

So, on Ubuntu, what you need is:

$ sudo apt-get install mon fping

Edit the /etc/mon/mon.cf and put this:

hostgroup cluster [cluster IPs separated by space]
watch cluster
        service Ping
                interval 1m
                monitor fping.monitor -r 5 -t 2000
                period wd {Mon-Sun} hr {0am-24pm}
                        alertafter 1
                        alertevery 4h
                        alert mail.alert [your email]

And restart the service:

$ sudo /etc/init.d/mon restart

If you run “monshow –full” you’ll see the status of your Ping check. If you want a better (but not that much better) interface, you can run “monshow” as a CGI and for that you’ll need Apache.

$ sudo apt-get install apache

And then, symlink the monshow as a CGI from the cgi-bin dir on Apache’s tree:

$ sudo ln -s /usr/bin/monshow /usr/lib/cgi-bin/monshow.cgi

Then, just put the URL “http://[machine-ip]/cgi-bin/monshow.cgi” on your browser and it should be showing you some HTML with the status of your health checks. I changed monshow to always show me everything, so on the $CF variable, I put 1 (true) on both “full” and “show-disabled”.

Extending and customizing

You might check for all monitors at /usr/lib/mon/mon.d/ and even create your own if you know a little bit of Perl. Copy an existent monitor to your own and edit it to your needs. You can also have arguments just as fping have it’s own, it’s very simple.

Middle Earth: Ubuntu Installation

My machines were once computing nodes of a bigger cluster so none of them had CD drives or Floppy neither had a good way of inserting them easily. They also had nothing on their hard drives except temporary files and that was also barely used… they were almost diskless nodes booting over network, so to speak.

I had then two choices: try to put a CD drive and boot to install ubuntu or network boot with PXE, and the second one was my choice.

First, you’ll need a PXE boot boot server so your boxes can connect and receive the boot files. This is very simple because the PXE protocol is, in a nutshell, a very simple DHCP with an even simpler TFTP file server.

To setup the DHCP server you need to:

$ sudo apt-get install dhcp

and change the /etc/dhcpd.conf to contain just that:

subnet netmask {
      option routers;
      filename "pxelinux.0";

and remember to put the IP range of your network and your router correctly. Restart the DHCP daemon:

$ sudo /etc/init.d/dhcp restart

Now, you need to install the TFTP server. If you try yo get the regular TFTP package you’ll fail because it doesn’t implement correctly the PXE boot (missing some basic features), so use the TFTP HPA instead:

$ sudo apt-get install tftp-hpa xinetd

and add a file called “tftp” it to your /etc/xinetd.d/ with the following content:

service tftp
      socket_type             = dgram
      protocol                = udp
      wait                    = yes
      user                    = root
      server                  = /usr/sbin/in.tftpd
      server_args             = -s /srv/tftp
      disable                 = no
      per_source              = 11
      cps                             = 100 2
      flags                   = IPv4

and restart the service:

$ sudo /etc/init.d/xinetd restart

You’ll need to create that directory /srv/tftp in order to put there pxelinux.0 and all other files needed to boot:

$ sudo mkdir /srv/tftp

Ok, now we have both service working but no files!! Here’ s how to get them:

$ cd /srv/tftp
$ URL=http://archive.ubuntu.com/ubuntu \\
  URL=$URL/dists/dapper/main/installer-i386 \\
  URL=$URL/current/images/ \\
  sudo lftp -c "open $URL; mirror netboot/"
$ sudo mv netboot/* .
$ sudo rmdir netboot
$ sudo tar zxf netboot.tar.gz

Files in place, boot your node and you should have a nice ubuntu screen asking you to boot your installation of Ubuntu. Just type “server” and ENTER the realm of Ubuntu.

Some portions were taken from http://wiki.koeln.ccc.de/index.php/Ubuntu_PXE_Install
but couldn’t make it work just by following that steps. Maybe you’ll have to look here and there in order to make your system work properly.

Middle Earth Beowulf Project

I’m venturing into another cluster out of old boxes but this time is different, or at least I believe it is.


With the 486 cluster we got crappy boxes and it took us weeks to have one machine running using components from all other in order to boot, no good. The next one was using other people’s boxes at work and it went well until people started to complain about someone else taking their precious cycles. At home the cluster worked pretty well, but with only two machines and one being dual boot with Windows, it wasn’t up most of the time.

This time it’ll be different. And it’ll be harder!

I’ve decided to install just the necessary and for a Beowulf project it should be very simple, just choose one among several good off-the-shelf software like Oscar or Rocks and you have it all. But I’ve started it differently, I’ve installed Ubuntu 6.06 Server on all machines.

As all those packages RPM based, Debian Linux have no turn on this run and I won’t render myself using alien and converting everything to .DEB and installing here unless strictly necessary, and for what I’ve seen around, it isn’t.

So, I decided to take that project ahead and will not rest until I have at least four things:
– A node configuration, administration and monitoring tool
– A job scheduler
– A Parallel File System
– A proper MPI installation (easy with MPICH or Lam-Mpi)

Well, seems a long way to go, I’ll post my steps on this blog as soon as I finish them.