Unsorted Notes

Self-monitoring of batch jobs

Example scripts how to monitor a job execution on the headnode of the job itself.

Aim of the scripts is to monitor the execution of a program, for example a simulation, by an external script. This external script can be used to inform the user about the status of the running simulation as well as for additional processing tasks.

The first script will be submitted as batch script. It starts the monitoring script as well as the script performing the intended work.

#!/bin/bash
#
# File: job_script.sh
#
# High-level script to execute the job work as well as the job monitoring.
#


MON_ROUNDS=2050             # # of rounds of MON_INTERVAL length to wait
                            # for the job termination, here one week with
MON_INTERVAL=300            # Monitoring interval every 5 minutes
MON_TERM_MARKER="job-done"  # Flag to signal job termination.


rm -rf $MON_TERM_MARKER

# Start the monitoring
./job_mon.sh $MON_INTERVAL $MON_ROUNDS $MON_TERM_MARKER &


# Start the job work
WORK_DURATION=500 # second to sleep in simulated simulation
./job_work.sh arguments of the simulation $WORK_DURATION
if [ $? == 0 ] ; then
   # Job done, mark the termination for the monitoring script.
   echo "ok" > $MON_TERM_MARKER
else
   echo "fail" > $MON_TERM_MARKER
fi

# Now, relax and wait for the work to be done and
# the monitoring to assess the situation carefully.
wait
#
#

The monitoring script will check every $MON_INTERVAL seconds, if the file with the name $MON_TERM_MARKER exists. The monitoring script will be terminated if the file exists or if a waiting time of $MON_ROUNDS*$MON_INTERVAL seconds has been reached.

The example of the monitoring script shows how the post-processing can be steered depending on the termination status of the simulation.

#!/bin/bash
#
# File: job_mon.sh
#
# Monitoring of job termination. Do some action after termination.

MON_INTERVAL=$1
MON_ROUNDS=$2
MON_TERM_MARKER=$3
shift 3

I=0
while [ $(( I < MON_ROUNDS )) == 1 ] ; do
   sleep $MON_INTERVAL
   let I=I+1
   echo "Monitoring round done $I"
   if [ -e $MON_TERM_MARKER ] ; then break; fi
done

if [ -e $MON_TERM_MARKER ] ; then
   # job work terminated.
   echo "We progress to the next meaningful thing in our life."
   echo "Result of our work was '$(cat job-done)'"
else
   # job work not terminated
   echo "Did you know that time is money?"
   # We should maybe kill the simulation process, shouldn't we?
fi
#
#

Finally, the following script has been used to simulate the simulation run in the job.

#!/bin/bash
#
# File: job_work.sh
#
# This is the job itself, we do the simulation here for example.

echo "My job arguments are: $*"

# Now get our simulation arguments
WORK_DURATION=$5
echo "I am going to calculate something meaningful, taking $WORK_DURATION sec"
sleep $WORK_DURATION

echo "Found something meaningful: $(($WORK_DURATION*(3*13+3)/$WORK_DURATION))"

exit 0 # Success of the simulation, != 0 if failure
#
#

2012-12-28 – Category: hpc – Tags: batch-processing shell-programming