I’ve been working with workflow pipelines, directly and indirectly, for quite a while now and one thing that is clearly visible is that, the way most people do it, it doesn’t scale.

Workflows (aka. Pipelines)

If you have a sequence of, say, 10 processes in a row (or graph) of dependencies and you need to run your data through each and every single one in the correct order to get the expected result, the first thing that comes to mind is a workflow. Workflows focus on data flow rather than on decisions, so there is little you can do with the data in case it doesn’t fit in your model. The result of such inconsistency generally falls in two categories: discard until further notice or re-run the current procedure until the data is correct.

Workflows are normally created ad-hoc, from a few steps. But things always grow out of proportions. If you’re lucky, these few steps would be wrapped into scripts as things grow, and you end up with a workflow of workflows, instead of a huge list of steps to take.

The benefit of a workflow approach is that you can run it at your own pace and assure that the data is still intact after each step. But the downfall is that it’s too easy to incorrectly blame the last step for some problem on your data. The status of the data could have been deteriorating over the time and the current state was the one that picked that up. Also, checking your data after each step is a very boring and nasty job and no one can guarantees that you’ll pick up every single bug anyway.

It becomes worse when the data volume increases to a level you can’t just look it up anymore. You’ll have to write syntax checkers for the intermediate results, you’ll end up with thousands of logs and scripts just to keep the workflow running. This is the third stage: meta-workflow for the workflow of workflows.

But this is not all, the worse problem is still to come… Your data is increasing by the minute, you’ll have to split it up one day and rather sooner than later. But, as long as you have to manually feed your workflow (and hope the intermediate checks work fine), you’ll have to split the data manually. If your case is simple and you can just run multiple copies of each step in parallel with each chunk of data, you’re in heaven (or rather, hell for the future). But that’s not always the case…

Now, you’ve reached a point where you have to maintain a meta-workflow for your workflow or workflows and manually manage the parallelism and collisions and checks of your data only to find out at the end that a particular piece of code was ignoring an horrendous bug in your data when it was already public.

If you want to add a new feature or change the order of two processes… well… 100mg of prozac and good luck!

Refine your Workflow

Step 1: Get rid of manual steps. The first rule is that there can be no manual step, ever. If the final result is wrong, you turn on the debugger, read the logs, find the problem, fix it and turn it off again. If you can’t afford to have wrong data live than write better checkers or rather reduce the complexity of your data.

One way to reduce the complexity is to split the workflow into smaller independent workflows, which each one generate only a fraction of your final data. If you have a mission critical environment, you better off with 1/10th of it broken than with the whole. Nevertheless, try to reduce the complexity on your pipeline, data structure and dependencies. When you have no change to do, re-think about the whole workflow, I’m sure you’ll find lots of problems every iteration.

Step 2: Unitize each step. The important rule here is: each step must process one, and only one, piece of data. How does it help? Scalability.

Consider a multiple data workflow (fig. 1 below), where you have to send the whole thing through, every time. If one of the process is much slower than the others, you’ll have to split your data for that particular step and join it again for the others. Splitting your data once at the beginning and running multiple pipelines at the same time is a nightmare as you’ll have to deal with the scrambled error messages yourself, especially if you still have manual checks around.

Multiple data Workflow
Figure 1: Split/Join is required each parallel step

On the other hand, if only one unit passes through each step (fig. 2), there is no need to split or join them and you can run as many parallel processes as you want.

Single data Workflow
Figure 2: No need to split/join

Step 3: Use simple and stupid automatic checks. If possible, don’t code anything at all.

If data must be identical on two sides, run a checksum (CRC sucks, MD5 is good). If the syntax needs to be correct, run a syntax checker, preferably a schema/ontology based automatic check. If your file is too complex or specially crafted so you need a special syntax check, re-write your file to use standards (XML and RDF are good).

Another important point on automatic checking is that you don’t have to check your emails waiting for an error message. When the subject contains the error message is already a pain, but when you have to grep for errors inside the body of the message? Oh god! I’ve lost a few lives already because of that…

Only mail when a problem occurs, only send the message related to the specific problem. It’s ok to send a weekly or monthly report just in case the automatic checks miss something. Go on and check the data for yourself once in a while and don’t worry, if things really screw up your users will let you know!


But, what’s the benefit of allowing your pipeline to automatically compute individual values and check for consistency if you still have to push the buttons? What you want now is a way of having more time to look at the pipeline flowing and fix architectural problems (and longer tea time breaks), rather than putting down the fire all the time. To calculate how many buttons you’ll press just multiply the number of data blocks you have by the number of steps… It’s a loooong way…

If still, you like pressing buttons, that’s ok. Just skip step 2 above and all will be fine. Otherwise, keep reading…

To automate your workflow you have two choices: either you fire one complete workflow for each data block or do it like a workflow of data through different services.

Complete Workflow: State Machines

If your data is small, seldom or you just like the idea, you can use a State machine to build a complete workflow for each block of data. The concept is rather simple, you receive the data and fire it through the first state. The machine will carry on, sending the changed data through all necessary states and, at the end, you’ll have your final data in-place, checked and correct.

UML is pretty good on defining state machines. For instance, you can use a state diagram to describe how your workflow is organised, class diagrams to show how each process is constructed and sequence diagrams to describe how processes talk to each other (preferably using a single technology). With UML, you can generate code and vice-versa, making it very practical for live changes and documentation purposes.

The State Design Pattern allows you to have a very simple model (each state is of the same type) with only one point of decision where to go next (when changing states): the state itself. This gives you the power to change the connections between the states easily and with very (very) little work. It’ll also save you a lot on prozac.

If you got this far you’re really interested on workflows or state machines, so I assume you also have a workflow of your own. If you do, and it’s a mess, I also believe that you absolutely don’t want to re-code all your programs just to use UML diagrams, queues and state machines. But you don’t need to.

Most programming languages allow a shell to be created and an arbitrary command to be executed. You can then manage the inter-process administration (creating/copying files, fifos, flags, etc), execute the process and, at the end, check the data and choose the next step (based on the current state of your data).

This methodology is simple, powerful and straight-forward, but it comes with a price. When you got too many data blocks flowing through, you end up with lots of copies of the same process being created and destroyed all the time. You can, however let the machine running and only provide data blocks, but still this doesn’t scale as we wanted on step 2 above.

Layered Workflow: Distributed Services

Now comes the holy (but very complex) grail of workflows. If your data is huge, constant flowing, CPU demanding and with awkward steps in between, you need to program thinking on parallelism. The idea is not complex, but the implementation can be monstrous.

On figure 2 above, you have three processes, A, B and C, running in sequence, and process B had two copies running because it took twice as long as A and C. It’s that simple: the more it takes to finish, more copies you run in parallel to make the flow constant. It’s like sewage pipes, rain water can flow in small pipes, but house waste will need much bigger ones. but later on, when you filter the rubbish, you can use small pipes back again.

So, what’s so hard on implementing this scenario? Well, first you have to take into account that those processes will be competing for resources. If they’re on the same machine, CPU will be a problem. If you have a dual core, the four processes above will share CPU, not to mention memory, cache, bus etc. If you use a cluster, they’ll all compete for network bandwidth and space on shared filesystems.

So, the general guidelines for designing robust distributed automatic workflows are:

  • Use layered state architecture. Design your machine in layers, separate the layers into machines or groups of machines and put a queue or a load-balancer in between each layer (state). This will allow you to scale much easier as you can add more hardware to a specific layer without impacting on others. It also allows you to switch off defective machines or do any maintenance on them with zero down-time.
  • One process per core. Don’t spawn more than one process per CPU as this will impact in performance in more ways than you can probably imagine. It’s just not worth it. Reduce the number of processes or steps or just buy more machines.
  • Use generic interfaces. Use the same queue / load-balancer for all state changes and, if possible, make their interfaces (protocols) identical so the previous state doesn’t need to know what’s on the next and you can change from one to another with zero cost. Also, make the states implement the same interface in the case you don’t need queues or load-balancers for a particular state.
  • Include monitors and health checks in your design. With such complex architecture it’s quite easy to ignore machines or processes failing. Separate reports into INFO, WARNING and ERROR and give them priorities or different colours on a web interface and mail or SMS only the errors to you.

As you can see, by providing a layered load-balancing, you’re getting performance and high-availability for free!

Every time data piles up in one layer, just increase the number of processes on it. If a machine breaks, it’ll be off the rotation (automatic if using queues, ping-driven or MAC-driven for load-balancers). Updating the operating system, your software or anything on the machine is just a matter of taking it out, updating, testing and deploying back again. Your service will never be off-line.

Of course, to get this benefit for real you have to remove all single-points-of-failure, which is rarely possible. But to a given degree, you can get high-performance, high-availability, load-balancing and scalability at a reasonable low cost.

The initial cost is big, though. Designing such complex network and its sub-systems, providing all safe-checks and organizing a big cluster is not an easy (nor cheap) task and definitely not done by inexperienced software engineers. But once it’s done, it’s done.

More Info

Wikipedia is a wonderful source of information, especially for the computer science field. Search for workflows, inter-process communication, queues, load-balancers, commodity clusters, process-driven applications, message passing interfaces (MPI, PVM) and some functional programming like Erlang and Scala.

It’s also a good idea to look for UML tools that generates code and vice-versa, RDF ontologies and SPARQL queries. XML and XSD is also very good for validating data formats. If you haven’t yet, take a good look on design patterns, especially the State Pattern.

Bioinformatics and internet companies have a particular affection to workflows (hence my experience) so you may find numerous examples on both fields.