I use LSF at work, a very good job scheduler. To parallelize my jobs I use Makefiles (with -j option) and inside every rule I run the command with the job scheduler. Some commands call other Makefiles, cascading even more the spawn of jobs. Sometimes I achieve 200+ jobs in parallel.
Our shared disk BlueArc is also very good, with access times quite often faster than my local disk but yet, for almost two years I’ve seen some odd behaviour when putting all of them together.
I’ve reported random failures on processes that worked until then and, without any modifications, worked ever after. But not a long time ago I figured out what the problem was… NFS refresh speed vs. LSF spawn speed using Makefiles.
When your Makefile looks like this:
bar.gz: $(my_program) foo > bar gzip bar
There isn’t any problem because as soon as bar is created gzip can run and create the gz file. Plain Makefile behaviour, nothing to worry about. But then, when I changed to:
bar.gz: $(lsf_submit) $(my_program) foo > bar $(lsf_submit) gzip bar
Things started to go crazy. Once every a few months in one of my hundreds of Makefiles it just finished saying:
bar: No such file or directory make: *** [bar.gz] Error 1
And what’s even weirder, the file WAS there!
During the period when these magical problems were happening, which I was lucky to streamline the Makefiles every day so I could just restart the whole thing and it went well as planned, I had another problem, quite common when using NFS: NFS stale handle.
I have my CVS tree under the NFS filesystem and when testing some perl scripts between AMD Linux and Alpha OSF machines I used to get this errors (the NFS cache was being updated) and had to wait a bit or just try again on most of the cases.
It was then that I have figured out what the big random problem was: NFS stale handle! Because the Makefile was running on different computers, the NFS cache took a few milliseconds to update and the LSF spawner, berzerk for performance, started the new job way before NFS could reorganize itself. This is why the file was there after all, because it was on its way and the Makefile crashed before it arrived.
The solution? Quite stupid:
bar.gz: $(lsf_submit) "$(my_program) foo > bar" && sleep 1 $(lsf_submit) gzip bar
I’ve put it on all rules that have more than one command being spawned by LSF and never had this problem again.
The smart reader will probably tell me that it’s not just ugly, it doesn’t cover all cases at all, and you’re right, it doesn’t. NFS stale handle can take more than one second to update, single-command rules can break on the next hop, etc but because there is some processing between them (rule calculations are quite costy, run make -d and you’ll know what I’m talking about) the probability is too low for our computers today… maybe in ten years I’ll have to put sleep 1 on all rules… 😉