CoEPP RC
 

Cloud FAQ

Don't use /tmp

:?: I always use /tmp as a scratch area.

:!: You should not use /tmp for scratch storage, as the available disk space there is small. If /tmp fills up it can impact your job and other following jobs. If you need to write into a temporary area, use $TMPDIR.

If you use mktemp to create a temporary file or directory, the new file/directory will automatically be created under $TMPDIR.

Good practice suggests you should create a work directory unique to your job and delete that directory on job completion:

MYDIR=$TMPDIR/$PBS_JOBID
trap "rm -Rf $MYDIR" EXIT      # delete directory on unexpected job exit
mkdir -p $MYDIR                # create the temp directory
cd $MYDIR
    <create many files>
cd
rm -Rf $MYDIR

or using mktemp:

MYDIR=$(mktemp -d)             # create the temp directory
trap "rm -Rf $MYDIR" EXIT      # delete directory on unexpected job exit
cd $MYDIR
    <create many files>
cd
rm -Rf $MYDIR

The /data partition is slow?

:?: Input/output to the filesystem under the /data partition is slow, how can I get around this?

:!: The /data partition is an NFS share (at the moment) and is relatively slow and easily overloaded. You can write data to the /scratch filesystem which is faster. Note how big /scratch is (df -h) before writing lots of data. At the moment, the cloud nodes have /scratch with 26GB of free space, but this can change, especially when other users write in them.

Don't forget to copy any data you need back to your directory under the /data partition before your batch job finishes! The /scratch filesystem on a cloud node is only visible on that cloud node!!

Please cleanup any files you create under /scratch to leave the maximum amount of free space for other users. One convenient approach is to create a directory with a name unique to your batch job. Write all your data under that directory and delete the entire directory when your batch job finishes:

# start of your code in your bash job script
WORKDIR=/scratch/$PBS_JOBID
trap "rm -Rf $WORKDIR" EXIT      # delete directory on unexpected job exit
mkdir -p $WORKDIR

# create lots of files under $WORKDIR

# copy back any file(s) you want to keep
cp $WORKDIR/REALLY_IMPORTANT_RESULTS /data/smith/results

# clean up when finished
rm -Rf $WORKDIR

Unique output files

:?: I run many simultaneous jobs. How do I stop these jobs corrupting created files?

:!: You need to use $PBS_JOBID.

If you submit the same batch file many times to the cloud you may find that any output files created by your batch job are corrupted since all of the running jobs wrote to the same output file(s) at the same time.

How can we ensure that each job writes to its own private output file (or directory)?

A running batch job inherits many torque-sourced environment variables. Most are of little use to you, the user. One, however, is very useful as it contains the batch job number which is unique to your batch job. This is the $PBS_JOBID environment variable. If you run a simple batch job which just does this:

echo "$PBS_JOBID"

then you will see something like this in the output file:

390505.c3torque.cloud.coepp.org.au

You can use the contents of $PBS_JOBID if you need a unique string to name directories or files. For instance, if you want to create many files, do it in a directory, as this keeps everything in one place and allows easy deletion when required:

MYDIR=$TMPDIR/$PBS_JOBID
mkdir -p $MYDIR
cd $MYDIR
    <create many files>
    <store results somewhere>
cd
rm -Rf $MYDIR

If the full value of $PBS_JOBID is too long for your purposes you can shorten it by extracting just the job number:

JOBNUM=$(echo $PBS_JOBID | cut -d. -f1)

Here JOBNUM would contain “390505” .

Unable to copy file?

:?: Why do I get emails containing “Unable to copy file”?

:!: Possibly because you aren't submitting your job from your /data directory.

You get emails containing:

An error has occurred processing your job, see below.
Post job file processing error; job 362527.c3torque.cloud.coepp.org.au on host vm-115-146-86-98.melbourne.rc.nectar.org.au/0

Unable to copy file /var/lib/torque/spool/362527.c3torque.cloud.coepp.org.au.OU to ....

You must submit your batch jobs from your /data directory as the batch system notes the directory in which you submit a job and tries to copy the output files back to that directory. On a batch worker node the /home partition is not available, so the copy fails.

If the output that wasn't copied contains something extremely important and you can't rerun the batch job, forward the error email to RC support (rc@coepp.org.au) who might be able to retrieve the file (no promises).

My job keeps getting terminated because of exceeded time

:?: Why does my job keep being terminated due to exceeded walltime or CPU time?

:!: Possibly because you aren't setting a required walltime in your job script, which could mean your job runs in the wrong queue. The default walltime and CPU time limits for the different batch queues are:

queue walltime CPU time
short maximum 1 hour maximum 1 hour
long default maximum 96 hours default maximum 5 hours

This means that if you submit a job without specifying a queue and without including any request for an increased wall or CPU time, your job will be submitted to the short queue and have a maximum wall and CPU time of 1 hour.

If you submit the same job to the long queue

qsub -q long my_job

then your job will have 96 hours of walltime and 5 hours of CPU time before termination.

You are free to include PBS directives in your job script that set your required wall and/or CPU time limits. These are explained in more detail in this simple introduction. The basic commands are:

#PBS -l walltime=1000:00:00
#PBS -l cput=100:00:00

The above commands ask for 1000 hours of walltime and 100 hours of CPU time.

Note that you don't have to specify a queue when you set your requirements for wall/CPU time as the batch system will automatically select the short or long queue as appropriate by looking at your wall/CPU time requirements.

Problems arise if you don't specify the queue and also don't set a required wall or CPU time. The batch system will place your job in the short queue with the associated 1 hour limit on wall and CPU time.

I don't get any output

:?: Why don't I get any standard output or generated files in my /data directory?

:!: You have to wait a while after seeing your job disappear from the qstat listing before you can see those files. If you don't get an error email and it has been something like fifteen minutes and you still can't see expected files, get in touch with your support people (rc@coepp.org.au).

I want to run python in the cloud

:?: I don't want to submit a bash script, I want to run python!

:!: Welcome pythonistas! You can submit a python job to the cloud as long as it has a standard python shebang line:

#!/usr/bin/env python

Note that the 'standard' python on interactive and worker nodes is python 2.6, which is what the above shebang will get you. If you want another version of python, you'll have to make sure it's on the cloud nodes (rc@coepp.org.au can help here) and use the appropriate shebang.

A simple python job to get the job ID and print the python version:

#!/usr/bin/env python

#PBS -N python_job
#PBS -j oe
#PBS -l walltime=0:01:00

import os
import sys

print('PBS_JOBID=%s' % os.environ['PBS_JOBID'])
print('python version=%s' % sys.version)

As usual, the PBS directives should come before any actual code.

Running this job in the cloud would get some output similar to this (as of August 2013):

PBS_JOBID=35907.c3torque.cloud.coepp.org.au
python version=2.6.6 (r266:84292, Feb 21 2013, 19:26:11)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)]

Note that while python is installed in the batch system, very few of the many useful modules outside the standard library are installed. If you need something like numpy, for instance, please ask for it to be installed (rc@coepp.org.au).

cloud/faq.txt · Last modified: 2014/04/10 15:42 by rwilson
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki