CoEPP RC
 

A Walk in the Cloud

Introduction

This is a very simple introduction to using the cloud to run your jobs. If you know how to create a shell script to run your jobs, you know enough to use the cloud.

However, if you don't know anything about UNIX/Linux shell programming or editing have a look at these links first:

With that out of the way, let's get started!

Overview

The figure below shows a high level overview of how users can interact with computing resources in the new Tier 3 cloud system.

Using the Servers

Log into cxin01 or cxin02 interactive nodes:

ssh -Y <user_name>@cxin01.cloud.coepp.org.au
ssh -Y <user_name>@cxin02.cloud.coepp.org.au

cxin01 and cxin02 are for interactive use with 16 cores and 64GB memory. Users can use these as they wish. If we find demand exceeds capacity for these two, they will be upgraded.

Batch Jobs

Every cloud job has 4 steps:

  • Identify the program you want to run and the data that is input to and output from it and create a batch job script
  • Prepare files in the /data partition
  • Run the program on a batch node
  • Retrieve files from the /data partition

A Simple Batch Job

Here we create a very simple batch job that requires one input file and creates an output file.

Create a directory job_test under your home directory on cxin01 or cxin02. We assume here that your username is smith. In the directory create a file run_job.sh containing:

#!/bin/bash

# Set the name of this batch job
#PBS -N my_test

# Join standard and error job outputs into one file
#PBS -j oe

# get mail on job end
#PBS -m ae
#PBS -M joan.smith@example.com

# Set the maximum resource usage we expect from the job.
# This usually helps the scheduler work more efficiently.
#PBS -l ncpus=1
#PBS -l mem=512MB
#PBS -l vmem=512MB
#PBS -l walltime=0:01:00
#PBS -l cput=0:01:00

cd /data/smith/job_test
cat job_test.data > job_test.output
echo "Done!"

This simple job just copies the contents of a data file job_test.data to job_test.output and then prints “Done!”.

That looks a little forbidding, but it's not too bad, and most of it won't change when you run different jobs. We'll step through each line and explain what it is, and whether you need to change it if your job requirements change.

The first parameter sets the name of the job while it is running in the cloud:

# Set the name of this batch job
#PBS -N my_test

You should make an attempt to change the job name my_test to something meaningful, but it isn't required - you can have fifty jobs running in the cloud all named my_test if you want. The cloud won't get confused, but you might!

The second parameter:

# Join standard and error job outputs into one file
#PBS -j oe

just combines the two job output files you would normally receive into one, making your directories a little less cluttered.

You may want to get email on job completion. This can help when you are starting as you get an email containing any errors that occur. Getting email is optional, and you may want to turn this off if you are running very many jobs:

# get mail on job end
#PBS -m ae
#PBS -M joan.smith@example.com

The next set of parameters are more important:

# Set the batch job parameters.
#PBS -l ncpus=1
#PBS -l mem=512MB
#PBS -l vmem=512MB
#PBS -l walltime=0:01:00
#PBS -l cput=0:01:00

Here you tell the cloud what sort of cloud node you want your job to run on, how much memory it will require (real and virtual) and how long it should take to run in human hours, minutes and seconds (not CPU time).

The ncpus=1 bit says you need only one CPU. Jobs that need more than one CPU are beyond the scope of this introduction.

The mem=512MB parameter says your job requires no more than 512MB to run, and the following vmem=512MB parameters says your job's virtual memory requirements are below 512MB. You can use the the GB suffix to denote gigabytes of memory.

The parameter walltime=0:01:00 sets the wall-clock time required for your job to run. The format is HH:MM:SS indicating hours, minutes and seconds. There are other formats using only one or two time fields instead of the three shown here, but when starting out it's best to stick to the three-field form.

Finally, the cput parameter sets the maximum amount of CPU time your batch job is allowed to use. The value you specify has the same format as the walltime above. If you don't set this value you get the default maximum configured in the batch system, which could be just one hour. So if you need a lot of CPU time, it's best to specify the amount you think you will need.

The memory, cput and walltime parameters set upper limits to the memory, CPU time and runtime that your job may use. If your job exceeds these limits it will be terminated. If you aren't sure of the memory/time requirements of your job it is usual to overestimate the limits, but don't get carried away. If you're not sure of the requirements of your job you can try running smaller versions (if possible) to get an idea of resource usage and work up to the final requirements. Note that you can run these exploratory jobs in the cloud or, if you are careful, on cxin01 or cxin02.

The final part of the file shows the execution of your job (remember your job!):

cd /data/smith/job_test
cat job_test.data > job_test.output
echo "Done!"

This just copies the text in file job_test.data into file job_test.output. Note that the first line above moves to the /data/smith/job_test directory. When your job runs on the cloud it can't see your home directory, so you must write your job as if it runs under the /data/smith directory somewhere. You must copy any input data files to your /data/smith directory before submitting the job from that directory, as it expects the job_test.data file to be there.

Now create the input file job_test.data that can contain anything you like. Here's an example:

A simple text file.

Turning things off in a batch script

When you are experimenting with job scripts you may want to temporarily remove some of the PBS commands in a script. Maybe you want to see what removing a command does. The simplest way is to just place one or more bash comment characters at the start of the line:

# Set the batch job parameters.
#PBS -l ncpus=1
###PBS -l mem=512MB
#PBS -l vmem=512MB
#PBS -l walltime=0:01:00

Here we turn off the mem option. This batch job would run with the default upper limit for memory.

Prepare the /data partition

We have a job script that copies one file into another, but we can't run it yet since our home directory (where you created the job script and input data file) doesn't exist on the cloud worker nodes.

The only filesystem common to the cxin nodes and the batch nodes is that under /data. If you look there you will see directories with the names of CoEPP users:

-bash-4.1$ ls -l /data
total 0
drwxr-xr-x 1 abangert   people 0 Jun 26 04:22 alfred
drwxr-xr-x 1 adunn      people 0 Jun 26 04:22 smith
drwxr-xr-x 1 yicai      people 0 Jun 26 04:23 xerxes
-bash-4.1$

You must submit jobs that create data files from the /data directory, and you must also copy any input data files to the /data directory. It's easier if we just copy the entire job directory to the /data area:

cp -r /home/smith/job_test /data/smith

This creates the directory /data/smith/job_test which contains the run_job.sh scripts and the input data file (job_test.data).

Submitting the job

Finally we get to run our job in the cloud. You could test the job on the cxin01 or cxin02 nodes before running in the cloud by just running the job script file, as it's still just a shell program. All the torque job parameters look like a comment to the shell interpreter. You might want to change a real job so that it doesn't take a lot of time or other resources when you do. And don't forget to change it back before submitting to the cloud!

Here's our directory /data/smith/job_test before we submit the job:

bash-4.1$ ls -l /data/smith/job_test
total 24
-rw-r--r-- 1 smith people   20 Jun 27 04:59 job_test.data
-rw-r--r-- 1 smith people  309 Jun 27 04:41 run_job.sh
bash-4.1$

Now we can submit our job:

bash-4.1$ qsub /data/smith/job_test/run_job.sh
11342.c3torque.cloud.coepp.org.au
bash-4.1$

We see that the job was accepted, was given the job number of 11342.

Checking the job

Once you have submitted your job, you can check its progress by:

bash-4.1$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
11342.c3torque             my_test          smith                  0 Q short
bash-4.1$

We see the job with id of 11342. That job is currently queued because the status is Q.

If we do another qstat we see that the job is now running (status R) and the used CPU time is 00:00:00, ie, 0 seconds so far.

bash-4.1$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
11342.c3torque             my_test          smith           00:00:00 R short
bash-4.1$

The queue name is short even though we specified batch in the job script file. This is because batch is a routing queue and the cloud figures out that your job should run on the short queue since its required walltime is only 1 minute.

If you are quick another qstat might show:

bash-3.2$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
11342.c3torque             my_test          smith           00:00:01 E short

Here our job has completed (the E status) after consuming 1 second of CPU time.

Examining output

If you look in the directory job_test under your /data/smith directory now you will see:

bash-3.2$ ls -l
total 1
-rw-r--r-- 1 smith people  20 Jun 27 05:48 job_test.data
-rw-r--r-- 1 smith people  20 Jul 30 03:28 job_test.output
-rw-r--r-- 1 smith people 544 Jul 29 06:27 run_job.sh
-rw------- 1 smith people   6 Jul 30 03:28 my_test.o11342

The file job_test.output has been copied back into your directory and contains the same text as the job_test.data file:

A simple text file.

Note the my_test.o11342 file which is the combined standard error and output file from your 11342 job. This file contains:

Done!

We also got an email on job termination:

PBS Job Id: 11342.c3torque.cloud.coepp.org.au
Job Name:   my_test
Exec host:  vm-118-138-241-121.erc.monash.edu.au/0
Execution terminated
Exit_status=0
resources_used.cput=00:00:00
resources_used.mem=0kb
resources_used.vmem=0kb
resources_used.walltime=00:00:01

Congratulations! Your job has completed successfully.

(Note that it can be a short while before your output file appears in the result directory.)

Reporting problems

If you have a problem, you can report it or get help at:

rc@coepp.org.au

See Also

tmux for a method to recover from disconnections to remote machines.

Cloud Home and FAQ

cloud/a_walk_in_the_cloud_old.txt · Last modified: 2013/11/07 16:11 by rwilson
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki