CoEPP RC
 

A Walk in the Cloud

Introduction

This is a very simple introduction to using the cloud to run your jobs. If you know how to create a shell script to run your jobs, you know enough to use the cloud.

However, if you don't know anything about UNIX/Linux shell programming or editing have a look at these links first:

You don't need to know much about shell and editors, just enough to manage directories and edit a file.

With that out of the way, let's get started!

Interactive Nodes

To use a cloud batch system you login to the “interactive nodes”. For the NeCTAR tier 3 cloud the nodes are cxin01, cxin02, cxin03 and cxin04. You login this way:

ssh -Y <user_name>@cxin01.cloud.coepp.org.au
ssh -Y <user_name>@cxin02.cloud.coepp.org.au
ssh -Y <user_name>@cxin03.cloud.coepp.org.au
ssh -Y <user_name>@cxin04.cloud.coepp.org.au

The above nodes are at Melbourne.

The interactive nodes are used to submit jobs to the cloud and also for interactive use. They each have 16 cores and 64GB memory.

On the interactive nodes you have your own home directory /home/<user_name> to develop and prepare executables. For interactive work you will work in your home directory.

When your job runs in the cloud it does not have access to your home directory. A directory /data/<user_name> is available to you on both the interactive nodes and all the batch worker nodes. You are recommended to now use CephFS /coepp/cephfs You must place your executable files and any input data required under your /data/<user_name> or /coepp/cephfs directory before submitting a batch job. Similarly, any output files written by your batch job will be under your /data/<user_name> or /coepp/cephfs directory.

Software Management

Batch Jobs

  • With that out of the way, lets get down to actually running a batch job.
  • A batch job that is submitted to the cloud is just a text file. Normally it's a shell file but it is possible to run other types of files such as python files. Here we will only use shell files.
  • The whole point of a batch job is to run your executable, so we need an executable for this example. We use fib which is a compiled C program which takes a number N as a parameter and prints the Nth Fibonacci number to standard output. Pretty boring, but it makes a good example. The source code is here.
  • We create the simplest batch file possible to run “fib 30” in the cloud. The file fib_minimal.job contains:
$ cat fib_minimal.job
#!/bin/bash

# --- Start PBS Directives ---
# Inherit current user environment
#PBS -V

# Submit to the long queue
#PBS -q long

# Job Name
#PBS -N fib_minimal.job

# Email on abort and exit
#PBS -m ae
#PBS -M <your email>
# --- End PBS Directives ---

# Run the job from current working directory
cd $PBS_O_WORKDIR

# Run fib
./fib 30

exit 0
  • For this to work we must ensure that the executable and the batch file are both in our /data directory:
$ pwd
/data/smith

$ ls -l
-rwxr-xr-x  1 smith people  7298 Nov  7 01:10 fib
-rw-r--r--  1 smith people    21 Nov  7 01:13 fib_minimal.job
  • Now we run the job:
$ qsub fib_minimal.job
365240.c3torque.cloud.coepp.org.au
  • Note that the batch system gave our job the number 365240. When the job has finished, we see that two files have been created in /data/smith:
$ ls -l
-rwxr-xr-x  1 smith people  7298 Nov  7 01:10 fib
-rw-r--r--  1 smith people    21 Nov  7 01:13 fib_minimal.job
-rw-------  1 smith people     0 Nov  7 01:14 fib_minimal.job.e365240
-rw-------  1 smith people    21 Nov  7 01:14 fib_minimal.job.o365240
  • Note that the names of these output files are generated from the name of the batch file we submitted and the job number the system gave to our job. The o and e parts indicate the standard output and standard error files respectively. The fib_minimal.job.o365240 file contains:
fib(30)=832040
  • The fib_minimal.job.e365240 error file is empty.

Job Parameters

So far we have run a very simple minimal batch job. You will normally execute much longer running jobs so we need to talk about batch queues and time limits.

There are two batch queues in the current system: short and long. Each queue has limits on CPU time and wall clock time (walltime). The default walltime limits for the batch queues are:

queue walltime
short maximum 1 hour
long maximum 7 days
extralong maximum 31 days

If your batch jobs exceed either the walltime limit they will be terminated.

You can specify which queue to run on when you submit your job:

qsub -q long fib_minimal.job

If you don't specify a queue you get the short queue by default. That didn't matter for our little fib example as the required CPU time is very short. But for your longer running jobs you must consider your required times since if you don't specify a queue, you run on the short queue and get a maximum of one hour of time (wall). If your jobs requires more than one hour of walltime you should submit to the long queue (shown above).

If you need more than five hours of walltime (for instance), there are ways to request extended limits using batch parameters. Suppose we had a very inefficient method to compute Fibonnaci 30 that is expected to run for ten CPU hours. We would need a batch file like this:

#PBS -l walltime=10:00:00
/data/smith/fib 30

Now when our job runs it will be terminated if the CPU used exceeds ten hours. The cput parameter specifies the required time in an <hour>:<minute>:<second> format.

Note that the walltime limit is still 96 hours. If your job needed more walltime, you could do:

#PBS -l cput=10:00:00
#PBS -l walltime=1000:00:00
/data/smith/fib 30

and now your walltime limit is 1000 hours and CPU limit is 10 hours.

Note that if you do specify required times in your batch file, you don't need to request a particular queue. The batch system is smart enough to send you to the correct queue depending on your requirements.

Job Name

If you run many different types of jobs it is useful to name the jobs. You do that with the name parameter

#PBS -N name_of_my_job

The name you specify this way appears against your job in a qstat listing.

Combining Output Files

If you run many jobs it may be inconvenient to receive the two output files (standard output and error). You can combine the two files into one with this parameter:

#PBS -j oe

Getting Email

We have disabled email from the batch system as it leads to our mail server being blocked for spam

Selecting a Queue

You can specify the queue you want your job to be submitted to in your script by using

#PBS -q long

This has the same effect as if you had submitted the job with

qsub -q long myfile

Debugging

It is difficult to understand what's happening in your batch job when it's running on another machine. But you can debug by putting commands in your batch file that help you.

For instance, if you submitted a batch file that contained only the command:

./fib 30

you would get an output file containing:

/var/lib/torque/mom_priv/jobs/365356.c3torque.cloud.coepp.org.au.SC: line 1: ./fib: No such file or directory

which isn't much help beyond telling you something is wrong.

Adding some debug to your batch file will help:

pwd
./fib 30

The output file now tells us what the problem is:

/scratch/workdir/smith
/var/lib/torque/mom_priv/jobs/365357.c3torque.cloud.coepp.org.au.SC: line 2: ./fib: No such file or directory

It turns out that the batch job starts to run in our batch home directory (/scratch/workdir/smith) which isn't what you expected.

There are ways to control what directory your batch jobs runs in, but it's better to explicitly control what directory you are in in your batch file. So we could change the batch file to:

cd /data/smith
./fib 30

and this will run without error.

Using Directories

It is not good practice to put all your executables and batch files in your /data/<user_name> directory. You should have a sub-directory for every set of jobs you run in the cloud. That is, you would run the fib executable from within a /data/<user_name>/fib directory. Your batch file should move to that directory to run your executables, and any input data files should be there.

qstat

Once you have submitted one or more jobs to the batch system, you can get the status of the jobs by doing:

$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
365364.c3torque            METScale_UP.sh   lspiller        00:00:54 R long
365365.c3torque            Wsys_DOWN.sh     lspiller        00:01:30 R long
365366.c3torque            Wsys_UP.sh       lspiller        00:00:59 R long
365367.c3torque            fib_minimal.job  smith                  0 Q short

This shows that there are four jobs in the system, three of which are running (R status). Your batch job isn't running yet, it is queued (Q status).

Warning! There can be thousands of running and queued jobs in the system.

qdel

It is possible to delete a batch job you have submitted. When you submit a job you get a unique ID number:

$ qsub fib_minimal.job
365240.c3torque.cloud.coepp.org.au

The 365240 number is your job number. To delete this job, whether it is queued or has started to run, just do:

$ qdel 365240

Command line options and batch file options

The batch file #PBS options we saw above, such as:

#PBS -N my_job
#PBS -q long

and all the others, are just standard options that qsub accepts on the command line, but with a prefix of “#PBS ”.

When you submit a job with qsub any command line options specified override the same options in the job script file, if any. This means that even though you may have set a CPU limit of 10 hours in your job script, you can change the limit for a submitted job to 20 hours by doing:

qsub -l cput=20:00:00 my_job.sh

Much, Much More

Questions, Reporting problems, etc

As always, if you have a question or want to report a problem, send email to:

rc@coepp.org.au

See Also

You may find tmux useful to manage connections to the interactive nodes.

There is a FAQ.

cloud/a_walk_in_the_cloud.txt · Last modified: 2018/10/09 16:44 by scrosby
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki