CoEPP RC
 

Advanced Staging - OBSOLETE!

This page is obsolete.  You don't need to do any staging to use the cloud.

The simple introduction to staging showed how to stage a file to a worker node, generate a single output file and stage the output file back to the original directory.

Here we discuss how to get a little more adventurous with staging.

Staging multiple files

If your batch job generates more than one output file, you need to stage multiple files back to your host node. Or you may need to stage in multiple files for a complex program. You do this by just repeating the stageout/stagein parameters as many times as you need:

# Copy our script and data to the worker node
#PBS -W stagein=$TMPDIR/script@cxin01.cloud.coepp.org.au:$HOME/job_test/script
#PBS -W stagein=$TMPDIR/data1@cxin01.cloud.coepp.org.au:$HOME/job_test/data1
#PBS -W stagein=$TMPDIR/data2@cxin01.cloud.coepp.org.au:$HOME/job_test/data2
# Copy our result files back to the directory job_test
#PBS -W stageout=$TMPDIR/result1@cxin01.cloud.coepp.org.au:$HOME/job_test/result1
#PBS -W stageout=$TMPDIR/result2@cxin01.cloud.coepp.org.au:$HOME/job_test/result2
#PBS -W stageout=$TMPDIR/result3@cxin01.cloud.coepp.org.au:$HOME/job_test/result3

Staging directories

You can stage whole directories at once:

# Stage in the job_test directory to the worker node
#PBS -W stagein=$TMPDIR/job_test@cxin01.cloud.coepp.org.au:$HOME/job_test

The directory and its contents are recursively copied to the worker node. Similarly, if you need more than one directory staged to the worker node before executing, just repeat the stagein command:

# Copy our directories to the worker node for running
#PBS -W stagein=$TMPDIR/job_test@cxin01.cloud.coepp.org.au:$HOME/job_test
#PBS -W stagein=$TMPDIR/another_test@cxin01.cloud.coepp.org.au:$HOME/another_test

Staging results of many runs

You may have noted in previous examples that we staged back files with a fixed name, such as result1. This means that if we submit the same job script twice in quick succession with different parameters on the executable, we will overwrite the result1 file with the results of the job that finishes last. This may not be desirable.

How can we submit multiple similar jobs and keep the results distinct? We can use the environment variables provided by the PBS system during execution.

Suppose we look at the variables provided by PBS by running a cloud job that contains only this code:

env | grep PBS_ | sort

The batch job ran with a number of 645 and the output file returned contains:

PBS_ENVIRONMENT=PBS_BATCH
PBS_GPUFILE=/var/torque/aux//645.c3torque.cloud.coepp.org.augpu
PBS_JOBCOOKIE=17B43E56D3906C9BC9DB79E47A6DFC87
PBS_JOBID=645.c3torque.cloud.coepp.org.au
PBS_JOBNAME=my_test
PBS_MOMPORT=15003
PBS_NODEFILE=/var/torque/aux//645.c3torque.cloud.coepp.org.au
PBS_NODENUM=0
PBS_NP=1
PBS_NUM_NODES=1
PBS_NUM_PPN=1
PBS_O_HOME=/home/rwilson
PBS_O_HOST=vm-115-146-84-190.rc.melbourne.nectar.org.au
PBS_O_LANG=en_AU.UTF-8
PBS_O_LOGNAME=rwilson
PBS_O_MAIL=/var/spool/mail/rwilson
PBS_O_PATH=/usr/kerberos/bin:/bin:/opt/lcg/bin:/usr/local/bin:/usr/bin
PBS_O_QUEUE=batch
PBS_O_SHELL=/bin/bash
PBS_O_WORKDIR=/home/rwilson/job_test
PBS_QUEUE=short
PBS_SERVER=c3torque.cloud.coepp.org.au
PBS_TASKNUM=1
PBS_VERSION=TORQUE-2.5.7
PBS_VNODENUM=0

Most of the above is of interest only to the PBS batch system itself. However, the PBS_JOBID variable is useful - it contains the number of the PBS job, which is unique. This means we can extract the job ID and use it to return results in such a way that we can't overwrite results. We do this by creating a uniquely-named directory with all our output files inside. Then we stage that directory back on job completion. I like to keep all these unique results directories inside a top-level results directory, but that's a personal choice.

So, if we simulate a batch job that creates two output files:

  cd $TMPDIR
  echo "Result file 1" > result1
  echo "Result file 2" > result2

how can we create a unique results directory?

If we say that we are going to keep results in a subdirectory called results, then we do this in the batch job:

  cd $TMPDIR
  OUT_DIR=results/$(echo $PBS_JOBID | cut -d. -f1)
  mkdir -p $OUT_DIR
  echo "Result file 1" > $OUT_DIR/result1
  echo "Result file 2" > $OUT_DIR/result2

The bit of code $(echo $PBS_JOBID | cut -d. -f1) just gets the first '.' delimited field of the PBS_JOBID variable. That is, the job number. In this job run, OUT_DIR would be results/645.

Now we must stage back the results directory into our job_test directory:

#PBS -W stageout=$TMPDIR/results@cxin01.cloud.coepp.org.au:$HOME/job_test

Before we run any jobs, our job_test directory looks like:

bash-3.2$ ls -l 
total 16
-rwxr-xr-x 1 rwilson people 7311 Nov 12 01:22 fib
-rw-r--r-- 1 rwilson people  479 Nov 12 01:22 fib.c
-rw-r--r-- 1 rwilson people  800 Feb 18 01:12 run_job.sh

We submit a single job, which runs to completion. Now our directory contains:

bash-3.2$ ls -l 
total 16
-rwxr-xr-x 1 rwilson people 7311 Nov 12 01:22 fib
-rw-r--r-- 1 rwilson people  479 Nov 12 01:22 fib.c
-rw------- 1 rwilson people    0 Feb 18 01:14 my_test.o645
drwxr-xr-x 3 rwilson people   16 Feb 18 01:14 results
-rw-r--r-- 1 rwilson people  800 Feb 18 01:12 run_job.sh
bash-3.2$ ls -l results
total 0
drwxr-xr-x 2 rwilson people 34 Feb 18 01:14 645

The newly-created results directory contains a subdirectory 645 which was the job number.

Now if we submit five more jobs from the unchanged job script, directory results contains:

bash-3.2$ ls -l results
total 0
drwxr-xr-x 2 rwilson people 34 Feb 18 01:14 645
drwxr-xr-x 2 rwilson people 34 Feb 18 01:16 646
drwxr-xr-x 2 rwilson people 34 Feb 18 01:16 647
drwxr-xr-x 2 rwilson people 34 Feb 18 01:16 648
drwxr-xr-x 2 rwilson people 34 Feb 18 01:16 649
drwxr-xr-x 2 rwilson people 34 Feb 18 01:16 650

And each subdirectory contains the result files:

bash-3.2$ ls -l results/645
total 8
-rw-r--r-- 1 rwilson people 14 Feb 18 01:14 result1
-rw-r--r-- 1 rwilson people 14 Feb 18 01:14 result2

Another way to stage input and output of many runs

The above method to stage the output of many similar jobs does work, but the 'unpicking' of the PBS_JOBID environment variable is a little fiddly, and that approach doesn't lend itself to staging input files. So what to do?

We can use the -t option to qsub. If we look at the options to qsub (man qsub), we see:

-t array_request
        Specifies the task ids of a job array.  Single task arrays are allowed.

        The array_request argument is an integer id or a range of integers. Multiple ids  or
        id  ranges  can  be  combined  in  a  comma delimted list. Examples : -t 1-100 or -t
        1,10,50-100

        An optional slot limit can be specified to limit the amount of  jobs  that  can  run
        concurrently  in  the job array. The default value is unlimited. The slot limit must
        be the last thing specified in the array_request and is delimited from the array  by
        a percent sign (%).

        qsub script.sh -t 0-299%5

        This sets the slot limit to 5. Only 5 jobs from this array can run at the same time.

        Note: You can use qalter to modify slot limits on an  array.  The  server  parameter
        max_slot_limit can be used to set a global slot limit policy.

This doesn't make a lot of sense on first reading. However, a simple usage example will help.

Let's make a directory test and move into it. In this directory we create a very simple job script test.sh:

#!/bin/bash
 
#PBS -S /bin/bash
#PBS -q batch
#PBS -j oe
#PBS -l nodes=1
#PBS -l mem=512MB,vmem=512MB
#PBS -l walltime=00:05:00
#PBS -N array_test
 
# Run the job
echo "Hello from job $PBS_ARRAYID"

Suppose we want to run this job five times. We can do:

qsub -t 1-5 test.sh

Running qstat a few times shows the job eventually running:

bash-3.2$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1542[].c3torque            array_test       rwilson                0 Q short          
bash-3.2$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1542[].c3torque            array_test       rwilson                0 R short          

Note that we don't see a set of five job numbers. Instead we see a single number with a [] suffix, showing that this is an array job.

Once qstat shows that the array job has finished, we see the output directories:

bash-3.2$ ls -l 
total 24
-rw------- 1 rwilson people  17 Feb 21 01:00 array_test.o1542-1
-rw------- 1 rwilson people  17 Feb 21 01:00 array_test.o1542-2
-rw------- 1 rwilson people  17 Feb 21 01:00 array_test.o1542-3
-rw------- 1 rwilson people  17 Feb 21 01:00 array_test.o1542-4
-rw------- 1 rwilson people  17 Feb 21 01:00 array_test.o1542-5
-rw-r--r-- 1 rwilson people 195 Feb 21 00:58 test.sh

Looking in a few of the output files, we see:

bash-3.2$ more array_test.o1542-1
Hello from job 1
bash-3.2$ more array_test.o1542-2
Hello from job 2
bash-3.2$ more array_test.o1542-5
Hello from job 5

So the PBS_ARRAYID environment variable is the index into the job array for each job in the array.

We can use the PBS_ARRAYID environment variable to stage-in and stage-out different sets of files for each job in an array. Suppose we have a job that reads an input file and outputs another file after processing the input file. We must stage the input file in to the job and then stage out the output file, and we must keep them separate on the host node.

Here's a sample job script that reads an input file and outputs some data:

#!/bin/bash
 
#PBS -S /bin/bash
#PBS -q batch
#PBS -j oe
#PBS -l nodes=1
#PBS -l mem=512MB,vmem=512MB
#PBS -l walltime=00:05:00
#PBS -N array_test
#PBS -m ae
#PBS -M ross.wilson@adelaide.edu.au
 
#PBS -W stagein=$TMPDIR/test@cxin02.cloud.coepp.org.au:$HOME/test
#PBS -W stageout=$TMPDIR/test/results@cxin02.cloud.coepp.org.au:$HOME/test
 
# Run the job
cd $TMPDIR/test
 
INPUT=inputs/input$PBS_ARRAYID
OUTPUT=results/output$PBS_ARRAYID
 
echo "Job $PBS_ARRAYID" > $OUTPUT
cat $INPUT >> $OUTPUT

And we also create 5 input files:

bash-3.2$ ls -l inputs
total 72
-rw-r--r-- 1 rwilson people   13 Feb 21 01:14 input1
-rw-r--r-- 1 rwilson people   13 Feb 21 01:15 input2
-rw-r--r-- 1 rwilson people   13 Feb 21 01:15 input3
-rw-r--r-- 1 rwilson people   13 Feb 21 01:15 input4
-rw-r--r-- 1 rwilson people   13 Feb 21 01:15 input5
bash-3.2$ more inputs/input2
Input file 2
bash-3.2$ more inputs/input5
Input file 5

Let's submit the job as a 5 array:

bash-3.2$ qsub -t 1-5 test.sh
1577[].c3torque.cloud.coepp.org.au
bash-3.2$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1577[].c3torque            array_test       rwilson                0 Q short          

When the jobs have all finished, we see the error+output files:

bash-3.2$ ls -l 
total 36
drwxr-xr-x 2 rwilson people 4096 Feb 21 04:57 inputs
drwxr-xr-x 2 rwilson people    6 Feb 21 05:09 results
-rw------- 1 rwilson people    0 Feb 22 00:41 array_test.o1577-1
-rw------- 1 rwilson people    0 Feb 22 00:41 array_test.o1577-2
-rw------- 1 rwilson people    0 Feb 22 00:41 array_test.o1577-3
-rw------- 1 rwilson people    0 Feb 22 00:41 array_test.o1577-4
-rw------- 1 rwilson people    0 Feb 22 00:41 array_test.o1577-5
-rw-r--r-- 1 rwilson people  557 Feb 22 00:40 test.sh

Looking at a few of the output files:

bash-3.2$ more results/output2
Job 2
Input file 2
bash-3.2$ more results/output5
Job 5
Input file 5
cloud/advanced_staging.txt · Last modified: 2013/07/30 13:41 by rwilson
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki