I use LSF at work, a very good job scheduler. To parallelize my jobs I use Makefiles (with -j option) and inside every rule I run the command with the job scheduler. Some commands call other Makefiles, cascading even more the spawn of jobs. Sometimes I achieve 200+ jobs in parallel.

Our shared disk BlueArc is also very good, with access times quite often faster than my local disk but yet, for almost two years I’ve seen some odd behaviour when putting all of them together.

I’ve reported random failures on processes that worked until then and, without any modifications, worked ever after. But not a long time ago I figured out what the problem was… NFS refresh speed vs. LSF spawn speed using Makefiles.

When your Makefile looks like this:

bar.gz:
    $(my_program) foo > bar
    gzip bar

There isn’t any problem because as soon as bar is created gzip can run and create the gz file. Plain Makefile behaviour, nothing to worry about. But then, when I changed to:

bar.gz:
    $(lsf_submit) $(my_program) foo > bar
    $(lsf_submit) gzip bar

Things started to go crazy. Once every a few months in one of my hundreds of Makefiles it just finished saying:

bar: No such file or directory
make: *** [bar.gz] Error 1

And what’s even weirder, the file WAS there!

During the period when these magical problems were happening, which I was lucky to streamline the Makefiles every day so I could just restart the whole thing and it went well as planned, I had another problem, quite common when using NFS: NFS stale handle.

I have my CVS tree under the NFS filesystem and when testing some perl scripts between AMD Linux and Alpha OSF machines I used to get this errors (the NFS cache was being updated) and had to wait a bit or just try again on most of the cases.

It was then that I have figured out what the big random problem was: NFS stale handle! Because the Makefile was running on different computers, the NFS cache took a few milliseconds to update and the LSF spawner, berzerk for performance, started the new job way before NFS could reorganize itself. This is why the file was there after all, because it was on its way and the Makefile crashed before it arrived.

The solution? Quite stupid:

bar.gz:
    $(lsf_submit) "$(my_program) foo > bar" && sleep 1
    $(lsf_submit) gzip bar

I’ve put it on all rules that have more than one command being spawned by LSF and never had this problem again.

The smart reader will probably tell me that it’s not just ugly, it doesn’t cover all cases at all, and you’re right, it doesn’t. NFS stale handle can take more than one second to update, single-command rules can break on the next hop, etc but because there is some processing between them (rule calculations are quite costy, run make -d and you’ll know what I’m talking about) the probability is too low for our computers today… maybe in ten years I’ll have to put sleep 1 on all rules… 😉

2 Replies to “LSF, Make and NFS”

  1. I don’t know lsf at all, but isn’t it possible that you simply should have some barrier there? Replace your script plus gzip with a single call for a script that does:

    #/bin/bash
    ${my_program} foo > bar
    ${lsf_submit} gzip bar

    This way you know gzip would be called after ${my_program} finished, not before, as I guess is what is happening here.

  2. It’s not calling before at all, it’s being called after the first process finished but faster than NFS can produce the correct references to other machines (stale handle).

    I did scripts for the most common behaviours but there are so many rules and they change so often that would be impossible to maintain that amount of scripts not to say which names they would have… 😉

    Worse, would be a nightmare to maintain the makefile itself! Imagine that for every rule (I have hundreds) with more than one command I’d have to open a script (with two lines) to know what it does. No way!

Leave a Reply

Your email address will not be published. Required fields are marked *