What is bioinformatics anyway?

After almost three years in the field I’m pretty sure I have no idea. A few months ago I though I knew and wrote an essay about software quality on bioinformatics but I now figured out that, even though those things might make sense to the rest of us, for bioinformatics it doesn’t.

Wikipedia (which have a much higher quality than many papers I’ve read) defines software as: “a general term used to describe a collection of computer programs, procedures and documentation that perform some tasks on a computer system”. It also defines programming as: “the process of writing, testing, debugging/troubleshooting, and maintaining the source code of computer programs”.

So, every one that writes programs (let’s forget about documentation, tests, maintenance etc for now) is a programmer. But a computer programmer IS NOT a software engineer. Programmers can write as much code as they want but without formal definitions, metrics, good design decisions and practices, tests, documentation and so on, they are useless as ants without pheromone.

Quick tip: Whenever you see a job for a software engineer in a bioinformatics institute, beware: It generally means a developer to maintain random code and make random changes in random environments.

So what?

I might not have a clue about what bioinformatics is, but now I’m pretty sure what it ISN’T: Software Engineering. You will find a huge amount of code, scripts, programs, databases but rarely find a fair piece of software. Therefore, my previous ideas could be valid for software quality, but not at all to bioinformatics.

Don’t get me wrong, I know some bioinformaticians (and programmers around) that understand the basic ideas about software and quality and why we should have them, but the whole structure, the scientific community, the people that give them money, have no idea whatsoever of what software really is or where it fits in the loop.

Still, bioinformaticians are getting half-programming and half-biology degrees, on two fields that each has more to know than the whole humanity can hold on their brains added up. How is it possible (and fair) to put those poor guys to work on such sub-human conditions, without any guidance or quality control, without any clue, in fact, to what they really should be doing in the first place.

Some of them come out pretty well, so well that they abandon the field and go work on better companies, with much better software strategies, proper engineering, scientific development in the right place (sandboxes) and production code done by real engineers with solid experience in mission-critical environments.

In the end, it leaves bioinformatics (to be fair, the informatics part only) in the hands of inexperienced people in all sorts of fields and levels, students writing production software, people that never saw a mission-critical environment coordinating databases, filesystems and development, with one bad decision after the other.

Is it just a rant, then?

No, not really. It’s a liberation. For a while I struggled to understand the motives behind those weird decisions. I knew that, in every industry, you have a whole set of values and people can, sometimes take completely awkward decisions, which turns out to be the right one. I’ve seen it happening when moving between jobs, especially when I worked at Yahoo! (big company, big culture). But with time, the awkward decisions still sounded awkward, even after considering all the new information I had.

Other people got fed up with all this and left, one after the other. I talked to them, and the answer was always the same: random (generally bad) decisions, ego in astronomic proportions and zero technical knowledge from all parts. Now I’m leaving for good and you won’t need to ask me why, will you?

I generally need a very good reason to leave a work place. I was feeling out-placed but couldn’t leave without a very good reason, but now I got a good bunch of them…

A liberation indeed!

Is there a way out?

Seriously, no. In 10 years definitely no. In 15 quite likely no. In 20, maybe… but things must start changing now!

Being optimistic, assuming they stop running like headless chickens, they would still need a strong guidance, which is virtually impossible to happen because of the strong ego of scientists in general. Bioinformatics exists for decades already, who is the software engineer that will tell them they’re doing all wrong?

Besides, the people that grant them money (governments) have no clue about software engineering (nor they should) and they will keep sending money every year, as long as, in the reports, they pretend to be doing great things. In fact, most could’ve been done in a few weeks with two or three people prepared to compromise.

Who doesn’t want a job where they can do almost nothing at all, get paid every month without even the remote fear of loosing their jobs and still pretend they’re doing great things? Who say no to this and start working for real gets a really bad reputation… While this win-win situation keeps going, there is little or zero chance of doing real stuff in the field and bioinformatics is doomed to constant failure and ineffectiveness.

At last, it’s not a specific problem, where you can just change a couple of people and everything will be all right, as many believe. This is nobody’s fault, it’s just the way the two fields: biology and informatics, joined together some decades ago and was never straightened. If there is a way out, I’d be very glad to see and will congratulate those who managed to do it, but this is much more politics than software development and I am, very luckily, just a programmer…

Markov chain available for NumCalc

NumCalc is my personal numerical methods program where I’ve implemented some nice algorithms for numerical computation. The new in the list is Markov Chain.

The Wikipedia article (link above) is far too complex… I’ll try to give a simplified explanation:

A travelling salesman goes back and forth in a set of cities and, given the city he is currently in, you want to know what’s the next city he’ll travel. Of course, he won’t show you his travel itinerary.

The simplest way of doing it is to record all travels he does within time. For each city, you have a counter of how many times he went from each city to all other. If you think these numbers as a portion of all the travels from each city you have a probability of going to any other city in the list.

Example: When he was on Paris, he went 3 times to London, 2 times to Amsterdam and only 1 time to Milan. It means that, 3 out of 6 times (50%) he went to London, so the probability of going again is 50%.

For such small quantities it’s weird to assume that the behaviour will be always the same (he can go to new cities as well) but when the amount of statistics you have is big, the behaviour become very repetitive and thus, predictable.

Real Cases:

  • MegaHAL uses an advanced Markov model to create chat bots by replying what people said before based primarily on the sole probability of one word coming after the other.
  • HMMER is hidden Markov model (a Markov model to predict another Markov model to predict something else) that can do powerful searches within long and scrambled sequences of proteins and genes. The IntrePro group use it to find their protein matches against UniProt.

Of course my super-simplified model is far from being that efficient and useful, but it’s a good start to understand how simple and how powerful they are. You can download it from its webpage.