Silly projects of the week

Last week I did two silly but still quite funny projects: word search on protein sequences and chat bot using markov chains.

Word search

Searching for similar sequences among the known proteins to understand evolutionary paths and function similarities is a powerful algorithm called BLAST. Following the same lines I spent a few minutes to develop a similar (but not quite the same) algorithm to search for all dictionary words on all UniProt sequences.

The algorithm will look into a dictionary (say /usr/share/dict/words) and for every sequence it’ll find the words in it, allowing a few (no more than 2) amino-acids to be between the letters. Than it gives a weight to the match based on the size of the word and the “added length” due to additional amino-acids in between. You can also restrict the number of letters on the word list to avoid too short or long words.

The results for 7 to 9 characters words in English searched only on Swiss-Prot entries are:

STRANGE  1.4
CHARADE  1.75
SLEDGED  1.75
MEALIER  1.75
DEMEANS  2.33333
DETAINS  2.33333

Markov bot

I knew a markov chain could have true wisdom but I never tried to make it actually talk to anyone until a friend told me about MegaHAL (much better than my markov model, of course). I decided to give it a try and got the XMPP bot from the Python examples and plugged in my markov program on all phrases spoken to it.

In the beginning it was awful, of course, it knew nothing. But with time (and hard work from some friends to actually teach him about leisure, programming, physics etc) it could babble a few coherent words once in a while.

The results are complete rubbish, below is a sample of the complete non-sense conversation I had with it now:

rengolin: how are you doing? have you managed to find the answer to Life, The Universe and Everything?
oraculoids: with an amazing company makes the basic points of some operator
rengolin: operator or not, you must come to reality and sense the danger there is in talking like that after such a complex and meaningful question
oraculoids: dead as an interface to follow an animal
rengolin: whatever animal you prefer, I would say but it seems that you are now driving the conversation
oraculoids:
rengolin: aha! I know you didn’t have answers to everything!
oraculoids: how is in between and pray
rengolin: I don’t pray too much, really, I’m not that kind of a person anyway…
oraculoids: but at the conversation it’t fall into them!
rengolin: I would never say that, especially not with you my friend
oraculoids: A design pattern is a practical laywer
rengolin: lawyers have no purpose other than screwing up the poor
oraculoids: hi bot
rengolin: I am not a bot
oraculoids: field of some of life, love them!

Quite stupid, I might say, but still I had a lot of fun doing it and that’s what matters… 😉 In spite of that, though, megaHAL is quite an impressive program and BLAST is a very powerful search mechanism.

Why not the primary key?

Recently I came to an amusing situation with Oracle (again) where the primary key was not used when explicitly requested…

The query was:

select name from table where table_id = 1;

Of course, table_id was the primary key. Astonishingly Oracle performed a FULL_TABLE_SCAN.

WTF?

When things like that happen you think there’s something quite wrong with the database, so I decided to ask the DBAs what was going on. The answer was something like:

“It may happen if Oracle decides to. Even though you have created an index, there is no guaranteed they will be used for all queries and the optimizer will decide which one is the best path”.

Seriously, if Oracle decides NOT to use the primary key for that query, there is something really wrong with the whole thing. I couldn’t think of a situation where that might be even close to valid! A friend who knows Oracle much better than me pictured two extreme cases why it could happen:

  1. There are just very few records in that table -> table data = 1 data block, reading the index root block (1 data block) and then accessing the one table data block is certainly more expensive than just read that one table data block.
  2. The index is in a Tablespace with different block size which resides on very slow disks. The buffer cache for the non-default block size is hugely under-sized. So the cost to read the index and the table data might be higher than just reading the table data. It’s a bit unrealistic, but I’ve witnessed stupid things like this.

Let’s face it, the first scenario is just too extreme to be true. If you have only one data block on your table you better use email rather than databases. And the second scenario, why would anyone put indexes on a much slower disk? Also, if the index is too big the data will be proportionally big too, so there is no gain in doing a full table scan anyway.

Later on he found out what the problem was by hacking into the configuration parameters:

  • The production database (working fine) had:
  • optimizer_index_caching = 95
    optimizer_index_cost_adj = 10

  • The development database (Oracle default values) had:
  • optimizer_index_caching = 0
    optimizer_index_cost_adj = 100

I don’t quite understand what they really mean to Oracle but index_caching = 0 seems a bit too radical for me to make it default.

At the end (and after more iterations than I’d like) it was fixed but what really pissed me off was to get the pre-formatted answer that “Oracle knows better which path to take” without even look on how stupid was that decision in the first place. This extreme confidence in Oracle drives me nuts…

gzip madness

Another normal day here at EBI when I change a variable called GZIP from local to global (via export on Bash) and I got a very nice surprise: all my gzipped files have gzip itself as a header!!!

Let me explain… I have a makefile that, among other things, gzip some files. So, I’ve created a variable called GZIP that is the same as “gzip –best –stdout” and on my rules I do:

%.foo : %.bar
        $(GZIP) < $ $@

So far so good, always worked. But I had a few makefiles redefining the same command, so I though: why not make an external include file with all shared variables? I could use the @include for makefiles but I also needed some of those variables for shell scripts as well, so I decided to use “export VARIABLE” for all make variables (otherwise they aren’t caught) and called it a day. That’s when everything started failing…

gzip environment

After a while digging the problem (I was blaming the poor LSF on that) I found that when I hadn’t the GZIP variable defined all went well, but by the moment I defined GZIP=”/bin/gzip –best –stdout” even a plain call to gzip was corrupted (ie. had the binary gzip as a header).

A quick look on gzip’s manual gave me the answer… GZIP is the environment variable that gzip stores all default options. So, if you say that GZIP=”–best –stdout”, every time you call gzip it’ll use those parameters by default.

So, by putting “gzip” on the parameter list, I was always running the following command:

$ /bin/gzip /bin/gzip --best --stdout  a.bar

and putting a compressed copy of gzip binary together with a.foo into a.bar.

What a mess can a simple environment variable do…