Last week I did two silly but still quite funny projects: word search on protein sequences and chat bot using markov chains.

Word search

Searching for similar sequences among the known proteins to understand evolutionary paths and function similarities is a powerful algorithm called BLAST. Following the same lines I spent a few minutes to develop a similar (but not quite the same) algorithm to search for all dictionary words on all UniProt sequences.

The algorithm will look into a dictionary (say /usr/share/dict/words) and for every sequence it’ll find the words in it, allowing a few (no more than 2) amino-acids to be between the letters. Than it gives a weight to the match based on the size of the word and the “added length” due to additional amino-acids in between. You can also restrict the number of letters on the word list to avoid too short or long words.

The results for 7 to 9 characters words in English searched only on Swiss-Prot entries are:

STRANGE  1.4
CHARADE  1.75
SLEDGED  1.75
MEALIER  1.75
DEMEANS  2.33333
DETAINS  2.33333

Markov bot

I knew a markov chain could have true wisdom but I never tried to make it actually talk to anyone until a friend told me about MegaHAL (much better than my markov model, of course). I decided to give it a try and got the XMPP bot from the Python examples and plugged in my markov program on all phrases spoken to it.

In the beginning it was awful, of course, it knew nothing. But with time (and hard work from some friends to actually teach him about leisure, programming, physics etc) it could babble a few coherent words once in a while.

The results are complete rubbish, below is a sample of the complete non-sense conversation I had with it now:

rengolin: how are you doing? have you managed to find the answer to Life, The Universe and Everything?
oraculoids: with an amazing company makes the basic points of some operator
rengolin: operator or not, you must come to reality and sense the danger there is in talking like that after such a complex and meaningful question
oraculoids: dead as an interface to follow an animal
rengolin: whatever animal you prefer, I would say but it seems that you are now driving the conversation
oraculoids:
rengolin: aha! I know you didn’t have answers to everything!
oraculoids: how is in between and pray
rengolin: I don’t pray too much, really, I’m not that kind of a person anyway…
oraculoids: but at the conversation it’t fall into them!
rengolin: I would never say that, especially not with you my friend
oraculoids: A design pattern is a practical laywer
rengolin: lawyers have no purpose other than screwing up the poor
oraculoids: hi bot
rengolin: I am not a bot
oraculoids: field of some of life, love them!

Quite stupid, I might say, but still I had a lot of fun doing it and that’s what matters… ๐Ÿ˜‰ In spite of that, though, megaHAL is quite an impressive program and BLAST is a very powerful search mechanism.

3 Replies to “Silly projects of the week”

  1. Maybe by getting the probabilities of the words that appeared in the protein sequences matched to those in the human conversation and search for similarities on those sequences in that order…

    Great idea Andrรฉ, thanks for the tip!! Will be my next week’s pet project… ๐Ÿ˜‰

Leave a Reply

Your e-mail address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.