2007-03-13 15:05:00
Today I was working on a shell script that's supposed to process multiple text files in the exact same manner. Usually you can get through this by running a FOR-loop where the code inside the loop is repeated for each file in a sequential manner.
Since this would take a lot of time (going over 1e6 lines of text in multiple passes) I wondered whether it wouldn't be possible to run the contents of the FOR-loop in parallel. I rehashed my script into the following form:
subroutine()
{
contents of old FOR-loop, using $FILE
}
for file in "list of files"
do
FILE="$file"
subroutine &
done
This will result in a new instance of your script for each file in the list. Got seven files to process? You'll end up with seven additional processes that are vying for the CPUs attention.
On average I've found that the performance of my shell script was improved by a factor of 2.5, going from ~40 lines per three seconds to ~100 lines. I was processing seven files in this case.
The only downside to this is that you're going to have to build in some additional code that prevents your shell script from running ahead, while the subroutines are running in the background. What this code needs to be fully depends on the stuff you're doing in the subroutine.
kilala.nl tags: unix, work, unix, sysadmin,
View or add comments (curr. 2)
Posted by Thomas
Vaix,
I hadn't heard of "wait" before, so thanks for the tips :) They're very useful.
All content, with exception of "borrowed" blogpost images, or unless otherwise indicated, is copyright of Tess Sluijter. The character Kilala the cat-demon is copyright of Rumiko Takahashi and used here without permission.
2007-03-26 18:48:00
Posted by Vaix
You can take two approaches depending on your need:
1) Put a "wait" statement after your "done" - and it will insure that all subroutine() invocations complete before progressing
2) modify subroutine() to return pids of processes kicked off - modify your initial loop (create a second one) that controls execution based on the number of outstanding pids. This can allow you to run an arbitrary # of processes in parallel - avoiding the thundering horde problem