This is a very simple introduction to using the cloud to run your jobs. If you know how to create a shell script to run your jobs, you know enough to use the cloud.
However, if you don't know anything about UNIX/Linux shell programming or editing have a look at these links first:
You don't need to know much about shell and editors, just enough to manage directories and edit a file.
With that out of the way, let's get started!
To use a cloud batch system you login to the “interactive nodes”. For the NeCTAR tier 3 cloud the nodes are cxin01, cxin02, cxin03 and cxin04. You login this way:
ssh -Y <user_name>@cxin01.cloud.coepp.org.au ssh -Y <user_name>@cxin02.cloud.coepp.org.au ssh -Y <user_name>@cxin03.cloud.coepp.org.au ssh -Y <user_name>@cxin04.cloud.coepp.org.au
The above nodes are at Melbourne.
The interactive nodes are used to submit jobs to the cloud and also for interactive use. They each have 16 cores and 64GB memory.
On the interactive nodes you have your own home directory /home/<user_name> to develop and prepare executables. For interactive work you will work in your home directory.
When your job runs in the cloud it does not have access to your home directory. A directory /data/<user_name> is available to you on both the interactive nodes and all the batch worker nodes. You are recommended to now use CephFS /coepp/cephfs You must place your executable files and any input data required under your /data/<user_name> or /coepp/cephfs directory before submitting a batch job. Similarly, any output files written by your batch job will be under your /data/<user_name> or /coepp/cephfs directory.
$ cat fib_minimal.job #!/bin/bash # --- Start PBS Directives --- # Inherit current user environment #PBS -V # Submit to the long queue #PBS -q long # Job Name #PBS -N fib_minimal.job # Email on abort and exit #PBS -m ae #PBS -M <your email> # --- End PBS Directives --- # Run the job from current working directory cd $PBS_O_WORKDIR # Run fib ./fib 30 exit 0
$ pwd /data/smith $ ls -l -rwxr-xr-x 1 smith people 7298 Nov 7 01:10 fib -rw-r--r-- 1 smith people 21 Nov 7 01:13 fib_minimal.job
$ qsub fib_minimal.job 365240.c3torque.cloud.coepp.org.au
$ ls -l -rwxr-xr-x 1 smith people 7298 Nov 7 01:10 fib -rw-r--r-- 1 smith people 21 Nov 7 01:13 fib_minimal.job -rw------- 1 smith people 0 Nov 7 01:14 fib_minimal.job.e365240 -rw------- 1 smith people 21 Nov 7 01:14 fib_minimal.job.o365240
fib(30)=832040
So far we have run a very simple minimal batch job. You will normally execute much longer running jobs so we need to talk about batch queues and time limits.
There are two batch queues in the current system: short and long. Each queue has limits on CPU time and wall clock time (walltime). The default walltime limits for the batch queues are:
queue | walltime |
---|---|
short | maximum 1 hour |
long | maximum 7 days |
extralong | maximum 31 days |
If your batch jobs exceed either the walltime limit they will be terminated.
You can specify which queue to run on when you submit your job:
qsub -q long fib_minimal.job
If you don't specify a queue you get the short queue by default. That didn't matter for our little fib example as the required CPU time is very short. But for your longer running jobs you must consider your required times since if you don't specify a queue, you run on the short queue and get a maximum of one hour of time (wall). If your jobs requires more than one hour of walltime you should submit to the long queue (shown above).
If you need more than five hours of walltime (for instance), there are ways to request extended limits using batch parameters. Suppose we had a very inefficient method to compute Fibonnaci 30 that is expected to run for ten CPU hours. We would need a batch file like this:
#PBS -l walltime=10:00:00 /data/smith/fib 30
Now when our job runs it will be terminated if the CPU used exceeds ten hours. The cput parameter specifies the required time in an <hour>:<minute>:<second> format.
Note that the walltime limit is still 96 hours. If your job needed more walltime, you could do:
#PBS -l cput=10:00:00 #PBS -l walltime=1000:00:00 /data/smith/fib 30
and now your walltime limit is 1000 hours and CPU limit is 10 hours.
Note that if you do specify required times in your batch file, you don't need to request a particular queue. The batch system is smart enough to send you to the correct queue depending on your requirements.
If you run many different types of jobs it is useful to name the jobs. You do that with the name parameter
#PBS -N name_of_my_job
The name you specify this way appears against your job in a qstat listing.
If you run many jobs it may be inconvenient to receive the two output files (standard output and error). You can combine the two files into one with this parameter:
#PBS -j oe
We have disabled email from the batch system as it leads to our mail server being blocked for spam
You can specify the queue you want your job to be submitted to in your script by using
#PBS -q long
This has the same effect as if you had submitted the job with
qsub -q long myfile
It is difficult to understand what's happening in your batch job when it's running on another machine. But you can debug by putting commands in your batch file that help you.
For instance, if you submitted a batch file that contained only the command:
./fib 30
you would get an output file containing:
/var/lib/torque/mom_priv/jobs/365356.c3torque.cloud.coepp.org.au.SC: line 1: ./fib: No such file or directory
which isn't much help beyond telling you something is wrong.
Adding some debug to your batch file will help:
pwd ./fib 30
The output file now tells us what the problem is:
/scratch/workdir/smith /var/lib/torque/mom_priv/jobs/365357.c3torque.cloud.coepp.org.au.SC: line 2: ./fib: No such file or directory
It turns out that the batch job starts to run in our batch home directory (/scratch/workdir/smith) which isn't what you expected.
There are ways to control what directory your batch jobs runs in, but it's better to explicitly control what directory you are in in your batch file. So we could change the batch file to:
cd /data/smith ./fib 30
and this will run without error.
It is not good practice to put all your executables and batch files in your /data/<user_name> directory. You should have a sub-directory for every set of jobs you run in the cloud. That is, you would run the fib executable from within a /data/<user_name>/fib directory. Your batch file should move to that directory to run your executables, and any input data files should be there.
Once you have submitted one or more jobs to the batch system, you can get the status of the jobs by doing:
$ qstat Job ID Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 365364.c3torque METScale_UP.sh lspiller 00:00:54 R long 365365.c3torque Wsys_DOWN.sh lspiller 00:01:30 R long 365366.c3torque Wsys_UP.sh lspiller 00:00:59 R long 365367.c3torque fib_minimal.job smith 0 Q short
This shows that there are four jobs in the system, three of which are running (R status). Your batch job isn't running yet, it is queued (Q status).
Warning! There can be thousands of running and queued jobs in the system.
It is possible to delete a batch job you have submitted. When you submit a job you get a unique ID number:
$ qsub fib_minimal.job 365240.c3torque.cloud.coepp.org.au
The 365240 number is your job number. To delete this job, whether it is queued or has started to run, just do:
$ qdel 365240
The batch file #PBS options we saw above, such as:
#PBS -N my_job #PBS -q long
and all the others, are just standard options that qsub accepts on the command line, but with a prefix of “#PBS ”.
When you submit a job with qsub any command line options specified override the same options in the job script file, if any. This means that even though you may have set a CPU limit of 10 hours in your job script, you can change the limit for a submitted job to 20 hours by doing:
qsub -l cput=20:00:00 my_job.sh
As always, if you have a question or want to report a problem, send email to:
rc@coepp.org.au