Silly prototypes of the week

This time I don’t have a project, nor probably will have time to implement them (I’m on real holidays!), so I’ll just post the ideas and welcome any comments on the feasibility or usefulness of them. Feel free to implement them and let me know how it went.

Independent Bayesian Agents

An Artificial Neural Network is a collection of neurons connected in meaningful ways that, when the information, coming from an input layer, passes through them, a resulting signal is produced on the output layer. The learning is done via feedback (good or bad), by changing the internal functions of each neuron to accommodate them for the next “pulse”.

The problem with this is that, after the learning phase, the resulting network can only classify a strict class of problems. Networks created to recognise alphabetical letter in an image has a different format and components than those to recognise other patterns. There are some implementations where the network itself can change its connections in order to learn something new, but that’s not commonly seen, not even on many scientific works on the subject.

You can have the same network structure using Bayesian nodes. The class of problems to be solved are not the same but the overall logic is similar.

Bayesian Agents are different. They are independent, learning programs that might be able to communicate with other agents upon necessity and also update their internal states from feedback and previous probabilities. It also has the advantage to return not only true/false values but probabilities or even probability distributions to the user.

But this way, your agents will be much more complex than the neurons on ANN, and that’s a path that is rarely worth taking if you want to build emergent behaviour as it becomes more and more difficult to combine them to produce unexpected behaviours. So, the idea is to break all the agents into very small pieces that can interact with each other, with just a few rules.

This way, the agents themselves would be pretty much like objects in an OO design, following the implementation of a class (agent definition) and inheriting or using methods from other Agents (interaction) you can, at the same time, build a network with any format you want (using graphs or even network connections via name service or so for the interactions) and keep it simple, by following the agent definition and not trying to put any complex logic on each agent.

Same as OO design, you don’t want to create one single class that does everything. What you do is a relationship between classes that express the problem set you’re solving and, if well designed, you can reuse some of the components to solve other problems as well. The power of the OO design is the relationship between the classes and not what the classes are actually doing, right?

So, my hunch is that, if we do very simple agents and let the network itself be the relationship between them, we acquire a new power of extensibility.

Also, by changing the connections, the learning happens at the network level instead of the agent level. Further on, by inflicting an “explorer behaviour” on each agent (like extending from a common class Explorer for all agents), they can search for new answers once in a while, in background, to increase their confidence levels.

In the same line, by extending from an agent Feedback you can make all agents be able to process feedback from the users, from its derived classes such as BooleanFeedback returning good or bad, or MultipleChoiceFeedback, ContinuousFeedback and so on.

Genetic programming using template Policies

Learning by structure given above (network structure) is similar to genetic algorithms, but in a different level. Genetic programming allows you to learn by the structure of the program itself. A standard way to solve that is using a rulebase approach. Create a big set of simple rules and write programs to use them in a mixed fashion.

Given the problem you want to solve, run all programs against it and try to define the quality of each solution (if they’re at least able to get somewhere) and give those programs closer to the solution a higher chance of survival, but don’t kill the unfit, and that’s an important step.

If there are good, but dormant, rules (genes) on the unfit and you kill them, you’re removing from genetic pool (rulebase) some solutions that could be the best when joint with others that you didn’t have the chance to join yet. Anyway, crossing those programs (by exchanging rules and creating new combinations) and running the next breed on the same problem you can, iteratively, lead the evolution towards the best solution by chance.

So, what about template policies?

C++ templates are very powerful. One of the great powers (which comes with great responsibility) is to be able to inherit functionality from the template argument.

template <class EatPolicy, MovePolicy, ReproducePolicy>
class LifeForm : public EatPolicy, MovePolicy, ReproducePolicy {
};

With this, you can create a life form following one of the policies provided. They will inherit the methods and properties of the policies as if it was a normal OO inheritance. So, to create a bacterium, just implement some policies inheriting from EatPolicy, MovePolicy and ReproducePolicy and instantiate like this:

LifeForm<Fermentation, Osmosis, Mitosis> bacteria;

You can also create several policies inheriting from Fermentation (from different materials, etc) and create several types of bacteria, but the problem is how to cross them later. Because templates are evaluated during compile time (and types are created), you can’t create new types during run time and the crossing over should be done off-line (and recompile). A few pre-processor commands and a Perl script could do it, though, and isolate that nasty part from the rest of the code.

I’m not sure of the implementation details of the cross-over but the basic skeleton did work: Skeleton of genetic policies. Would give it a try later on.

Silly project of the week: molecule dynamics

This week’s project is a molecular dynamics simulation. Don’t get too excited, it’s not using any of the state-of-art algorithms nor is assembling 3-dimensional structures of complex proteins. I began with a simple carbon chain using only coulomb’s law in a spring-mass system.

The molecule I’m using is this:

Molecular Dynamics

The drawing program is quite simple and wont work for most molecules, but for the 2-dimensional simple molecules (max. of 3 connections per atom) it kinda works.

Later on, putting the program to run, each atom “pushes” all others electrically and the spring “pulls” them back. A good way to solve that is to say that q1 . q2 / x² = – k . x = m . d²x/dx² (where x is a vector) and integrate numerically using Runge-Kutta.

But that’s my first openGL program, so I decided to go easy on the model and actually see it pseudo-working with an iterative-based simulation following the same equations above. This picture is a frame after a few iterations.

Quoting its page: “As this simulation is not using any differential solution, the forces grow and grow until the atom becomes unstable and break apart. Some Runge-Kutta is required to push the realism further.

UPDATE:

The webpage of the fully-functional prototype is HERE.

Markov chain available for NumCalc

NumCalc is my personal numerical methods program where I’ve implemented some nice algorithms for numerical computation. The new in the list is Markov Chain.

The Wikipedia article (link above) is far too complex… I’ll try to give a simplified explanation:

A travelling salesman goes back and forth in a set of cities and, given the city he is currently in, you want to know what’s the next city he’ll travel. Of course, he won’t show you his travel itinerary.

The simplest way of doing it is to record all travels he does within time. For each city, you have a counter of how many times he went from each city to all other. If you think these numbers as a portion of all the travels from each city you have a probability of going to any other city in the list.

Example: When he was on Paris, he went 3 times to London, 2 times to Amsterdam and only 1 time to Milan. It means that, 3 out of 6 times (50%) he went to London, so the probability of going again is 50%.

For such small quantities it’s weird to assume that the behaviour will be always the same (he can go to new cities as well) but when the amount of statistics you have is big, the behaviour become very repetitive and thus, predictable.

Real Cases:

  • MegaHAL uses an advanced Markov model to create chat bots by replying what people said before based primarily on the sole probability of one word coming after the other.
  • HMMER is hidden Markov model (a Markov model to predict another Markov model to predict something else) that can do powerful searches within long and scrambled sequences of proteins and genes. The IntrePro group use it to find their protein matches against UniProt.

Of course my super-simplified model is far from being that efficient and useful, but it’s a good start to understand how simple and how powerful they are. You can download it from its webpage.

Numerical methods package in C++

I still code in my spare time and for a while I’ve been gathering some numerical methods I did at university in an easy-to-understand generic C++ package. Despite being easy to understand, I also tried to implement the best method I knew for each problem.

The root finding algorithm is based on Brent’s method which is, in turn, based on the secant method. If everything fails, it throws an exception and you can use the safe bisection method.

The integration is using the Romberg’s method, which is a further extrapolations of Simpson’s method which is already much better than the trapezoid basic method.

For Interpolation I’m using the Natural Cubic Spline but would like to implement other types of splines (like complete, periodical, clamped etc). The interpolation is working, but I couldn’t managed to get the coefficients right yet.

Other codes I have are Monte Carlo, Runge-Kutta and Markov Chain (this one using boost graph library for C++) and will be integrated soon. I’ll let you know when it’s done.

Silly projects of the week

Last week I did two silly but still quite funny projects: word search on protein sequences and chat bot using markov chains.

Word search

Searching for similar sequences among the known proteins to understand evolutionary paths and function similarities is a powerful algorithm called BLAST. Following the same lines I spent a few minutes to develop a similar (but not quite the same) algorithm to search for all dictionary words on all UniProt sequences.

The algorithm will look into a dictionary (say /usr/share/dict/words) and for every sequence it’ll find the words in it, allowing a few (no more than 2) amino-acids to be between the letters. Than it gives a weight to the match based on the size of the word and the “added length” due to additional amino-acids in between. You can also restrict the number of letters on the word list to avoid too short or long words.

The results for 7 to 9 characters words in English searched only on Swiss-Prot entries are:

STRANGE  1.4
CHARADE  1.75
SLEDGED  1.75
MEALIER  1.75
DEMEANS  2.33333
DETAINS  2.33333

Markov bot

I knew a markov chain could have true wisdom but I never tried to make it actually talk to anyone until a friend told me about MegaHAL (much better than my markov model, of course). I decided to give it a try and got the XMPP bot from the Python examples and plugged in my markov program on all phrases spoken to it.

In the beginning it was awful, of course, it knew nothing. But with time (and hard work from some friends to actually teach him about leisure, programming, physics etc) it could babble a few coherent words once in a while.

The results are complete rubbish, below is a sample of the complete non-sense conversation I had with it now:

rengolin: how are you doing? have you managed to find the answer to Life, The Universe and Everything?
oraculoids: with an amazing company makes the basic points of some operator
rengolin: operator or not, you must come to reality and sense the danger there is in talking like that after such a complex and meaningful question
oraculoids: dead as an interface to follow an animal
rengolin: whatever animal you prefer, I would say but it seems that you are now driving the conversation
oraculoids:
rengolin: aha! I know you didn’t have answers to everything!
oraculoids: how is in between and pray
rengolin: I don’t pray too much, really, I’m not that kind of a person anyway…
oraculoids: but at the conversation it’t fall into them!
rengolin: I would never say that, especially not with you my friend
oraculoids: A design pattern is a practical laywer
rengolin: lawyers have no purpose other than screwing up the poor
oraculoids: hi bot
rengolin: I am not a bot
oraculoids: field of some of life, love them!

Quite stupid, I might say, but still I had a lot of fun doing it and that’s what matters… 😉 In spite of that, though, megaHAL is quite an impressive program and BLAST is a very powerful search mechanism.

RDBMS, to rewrite or not to rewrite… I got confused…

Mike Stonebreaker (Ingres/Postgres) seems to be confused as well…

First he said Google’s Map/Reduce was “Missing most of the features that are routinely included in current DBMS”, but earlier he said to ditch RDBMS anyway because “modern use of computers renders many features of mainstream DBMS obsolete”.

So, what’s the catch? Should we still use RDBMS or not? Or should we still develop technologies based on relational databases while Mike develops himself the technology of the new era? Maybe that was the message anyway…

My opinion:

MapReduce is not a step backwards, there are sometimes when indexing is actually slower than brute-force. And I’m not saying that on insert time the indexes have to be updated and so on, I’m saying in the actual search for information, if the index is too complex (or too big) it might take more time to search through the index, compute the location of the data (which might be anywhere in a range of thousands of machines), retrieve the data and later on, sort, or search on the remaining fields.

MapReduce can effectively do everything in one step, while still in the machine and return less values per search (as opposed to primary key searches first) and therefore less data will be sent over the network and less time will be taken.

Of course, MapReduce (as any other brute-force methods) is hungry for resources. You need a very large cluster to make it really effective (1800 machines is enough :)) but that’s a step forward something different from RDBMS. In the distributed world, RDBMS won’t work at all, something have to be done and Google just gave the first step.

Did we wait for warp-speed to land on the moon?! No, we got a flying foil crap and landed on it anyway.

Next steps? Many… we can continue with brute-force and do a MapReduce on the index and use the index to retrieve in an even larger cluster, or use automata to iteratively search and store smaller “views” somewhere else, or do statistical indexes (quantum indexes) and get the best result we can get instead of all results… The possibilities are endless…

Lets wait and see how it goes, but yelling DO IT than later DON’T is just useless…

UPDATE:

This is not a rant against Stonebreaker, I share his ideas about the relational model being far too outdated and the need for something new. What I don’t agree, though, is that MapReduce is a step backwards, maybe not even a step forward, probably sideways.

The whole point is that the relational model is the thesis and there are lots of antithesis, we just didn’t come up with the synthesis yet.

True wisdom from randomness

You can live a whole life and remain stupid but a stupid program using a pseudo-random number generator and a clever algorithm (Markov’s chain) can excel us quite easily:

  • Input: The GNU GPL license
  • Output:

    “GNU General Public License along with you add to sue for details.”

  • Input: man perl
  • Output:

    “PERL (higher numbers usually being affected by wraparound).”

  • Input: My own wiki
  • Output:

    “Bioinformatics is a physicist (definitions of enforcing standards but it puts wrong things can build complex information systems and nothing is totally unacceptable for every new piece of giving generic answers.”

LSF, Make and NFS

I use LSF at work, a very good job scheduler. To parallelize my jobs I use Makefiles (with -j option) and inside every rule I run the command with the job scheduler. Some commands call other Makefiles, cascading even more the spawn of jobs. Sometimes I achieve 200+ jobs in parallel.

Our shared disk BlueArc is also very good, with access times quite often faster than my local disk but yet, for almost two years I’ve seen some odd behaviour when putting all of them together.

I’ve reported random failures on processes that worked until then and, without any modifications, worked ever after. But not a long time ago I figured out what the problem was… NFS refresh speed vs. LSF spawn speed using Makefiles.

When your Makefile looks like this:

bar.gz:
    $(my_program) foo > bar
    gzip bar

There isn’t any problem because as soon as bar is created gzip can run and create the gz file. Plain Makefile behaviour, nothing to worry about. But then, when I changed to:

bar.gz:
    $(lsf_submit) $(my_program) foo > bar
    $(lsf_submit) gzip bar

Things started to go crazy. Once every a few months in one of my hundreds of Makefiles it just finished saying:

bar: No such file or directory
make: *** [bar.gz] Error 1

And what’s even weirder, the file WAS there!

During the period when these magical problems were happening, which I was lucky to streamline the Makefiles every day so I could just restart the whole thing and it went well as planned, I had another problem, quite common when using NFS: NFS stale handle.

I have my CVS tree under the NFS filesystem and when testing some perl scripts between AMD Linux and Alpha OSF machines I used to get this errors (the NFS cache was being updated) and had to wait a bit or just try again on most of the cases.

It was then that I have figured out what the big random problem was: NFS stale handle! Because the Makefile was running on different computers, the NFS cache took a few milliseconds to update and the LSF spawner, berzerk for performance, started the new job way before NFS could reorganize itself. This is why the file was there after all, because it was on its way and the Makefile crashed before it arrived.

The solution? Quite stupid:

bar.gz:
    $(lsf_submit) "$(my_program) foo > bar" && sleep 1
    $(lsf_submit) gzip bar

I’ve put it on all rules that have more than one command being spawned by LSF and never had this problem again.

The smart reader will probably tell me that it’s not just ugly, it doesn’t cover all cases at all, and you’re right, it doesn’t. NFS stale handle can take more than one second to update, single-command rules can break on the next hop, etc but because there is some processing between them (rule calculations are quite costy, run make -d and you’ll know what I’m talking about) the probability is too low for our computers today… maybe in ten years I’ll have to put sleep 1 on all rules… 😉

Yet another supercomputer

SciCortex is to launch their cluster-in-a-(lunch)-box with promo video and everything. Seems pretty nice but some things worries me a bit …

Of course a highly interconnected backpane and some smart shortest-path routing algorithms (probably not as good as Feynman’s) is much faster (and reliable?) than gigabit ethernet (myrinet also?). Of course, all-in-one chip technology is much faster and safer and more economic than any HP or IBM 1U node money can buy.

There are also some eye-candy like a pretty nice external case, dynamic resource partitioning (like VMS), native parallel filesystem, MPI optimized interconnection and so on… but do you remember Cray-1? It had wonderful vector machines but in the end it was so complex and monolithic that everyone got stuck with it and never used it anymore.

Assembling a 1024-node Linux cluster with PC nodes, Gigabit, PVFS, MPI etc is hard? Of course it is, but the day Intel stops selling PCs you can use AMD (and vice-versa) and you won’t have to stop using the old machines until you have a whole bunch of new ones up and running transparently integrated with your old cluster. If you do it right you can have a single cluster beowulf cluster running alphas, Intel, AMD, Suns etc, just bother with the paths and the rest is done.

I’m not saying it’s easier, nor cheaper (costs with air conditioning, cabling and power can be huge) but being locked to a vendor is not my favourite state of mind… Maybe if they had smaller machines (say 128 nodes) that could be assembled in a cluster and still allow external hardware to be connected having intelligent algorithms to understand the cost of migrating process to external nodes (based on network bandwidth and latency) would be better. Maybe it could even make their entry easier to existent clusters…

PI Monte Carlo – Distributed version

On April I published an entry explaining how to calculate PI using the most basic Monte Carlo algorithm and now, using the Middle Earth cluster I can do it parallelized.

Parallelizing Monte Carlo is a very simple task because of it’s random and independent nature and this basic monte carlo is even easier. I can just run exactly the same routine as before on all nodes and at the end sum everything and divide by the number of nodes. To achieve that, I just changed the main.cc file to use MPI, quite simple indeed.

The old main.cc just called the function and returned the value:

    area_disk(pi, max_iter, delta);
    cout << "PI = " << pi << endl;

But now, the new version should know whether it’s the main node or a computing node. After, all computing nodes should calculate the area and the main node should gather and sum.

    /* Nodes, compute and return */
    if (myrank) {
        area_disk(tmp, max_iter, delta);
        MPI_Send( &tmp, 1, MPI_DOUBLE, 0, 17, MPI_COMM_WORLD );

    /* Main, receive all areas and sum */
    } else {
        for (int i=1; i < size; i++) {
            MPI_Recv( &tmp, 1, MPI_DOUBLE, i, 17, MPI_COMM_WORLD, &status );
            pi += tmp;
        }
        pi /= (size-1);
        cout << "PI = " << pi << endl;
    }

On MPI, myrank says your node number and size shows you the total number of nodes. On the most basic MPI program, if it’s zero you’re the main node, otherwise you’re a computing node.

All computing nodes calculate the area and MPI_Send the result to the main node. The main node waits for all responses on the main loop and sum the temporary result tmp to pi and at the end divide by the number of computing nodes.

Benefits:

This monte carlo is extremely basic and very easy to parallelize. As this copy is run over N computing nodes and there’s no dependency between them you should achieve an increase in speed of over N times the non-parallel one.

Unfortunately, this algorithm is so slow and inaccurate that even running on 9 computing nodes (ie. 9 times faster) it’s still wrong on the third digit.

The slowness is due to the algorithm’s stupidity but the inaccuracy is due to the lack of a really good standard random number generators. Almost all machines yielded results far from the 5-digit answer on M_PI macro of C standard library and the result was also far from it. Also, there are so many other ways of calculating PI that are so much faster that it wouldn’t be a good approach ever!

The good thing is that it was just to show a distributes monte carlo algorithm working… 😉