CoEPP RC
 

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
cloud:a_walk_in_the_cloud_old [2013/02/21 14:04]
rwilson removed
cloud:a_walk_in_the_cloud_old [2013/11/07 16:11] (current)
rwilson created
Line 1: Line 1:
-====== A Walk in the Cloud ======+======= A Walk in the Cloud =======
  
-===== Introduction =====+====== Introduction ​======
  
-This is a very simple introduction ​on using the cloud to run your jobs.  If you know how to create a shell script to run your jobs, you know enough to use the cloud.+This is a very simple introduction ​to using the cloud to run your jobs.  If you know how to create a shell script to run your jobs, you know enough to use the cloud.
  
-However, if you don't know anything about UNIX/Linux shell programming or editing have a look at these links:+However, if you don't know anything about UNIX/Linux shell programming or editing have a look at these links first:
   * [[http://​tldp.org/​HOWTO/​Bash-Prog-Intro-HOWTO.html|Shell programming]]   * [[http://​tldp.org/​HOWTO/​Bash-Prog-Intro-HOWTO.html|Shell programming]]
-  * [[http://​www.library.yale.edu/​wsg/​docs/​vi/​|Using the vi editor]]+  * Using the [[http://​www.library.yale.edu/​wsg/​docs/​vi/​|vi ​editor]], the [[http://​www2.lib.uchicago.edu/​keith/​tcl-course/​emacs-tutorial.html|emacs editor]] or the [[http://​mintaka.sdsu.edu/​reu/​nano.html|nano ​editor]]
  
 With that out of the way, let's get started! With that out of the way, let's get started!
 +===== Overview =====
  
 +The figure below shows a high level overview of how users can interact with computing resources in the new Tier 3 cloud system.
  
-==== Using the Servers ====+{{:​cloud:​tier3_cloud.png?​direct&​600|}}
  
-Log into the cxui or cxin01,​cxin02 interactive nodes: +===== Using the Servers =====
-  ssh <​user_name>​@cxui.cloud.coepp.org.au +
-or +
-  ssh <​user_name>​@cxin01.cloud.coepp.org.au +
-  ssh <​user_name>​@cxin02.cloud.coepp.org.au+
  
-cxin01/cxin02 ​are for interactive ​use, similar to ui.atlas.unimelb.edu.auwith 16 cores/64GB memory. Users can use these as they wish. If we find demand exceeds capacity for these two, they will be upgraded. However, cxui is a lower power server, and shouldn'​t be used except for batch submissions. Note: currently cxui has not configured /home directories,​ so we recommend using cxin01/​cxin02 in the mean time. +Log into **cxin01** or **cxin02** interactive ​nodes: 
- +  ssh -Y <​user_name>​@cxin01.cloud.coepp.org.au 
-==== Network Layout ==== +  ssh -Y <​user_name>​@cxin02.cloud.coepp.org.au 
- +   
-Below is an outline of the Tier 3 network. The blue and purple lines are what you as a user really care about. +cxin01 and cxin02 are for interactive use with 16 cores and 64GB memory. Users can use these as they wish. If we find demand exceeds capacity for these two, they will be upgraded. 
- +====== Batch Jobs ======
-There are currently 120 worker nodes available for use and 2 interactive nodes, all hosted on the [[http://​www.nectar.org.au/​research-cloud|NeCTAR Research Cloud]].\\ +
-These numbers can be increased or decreased depending on demand. +
- +
-{{:​cloud:​cloud-tier3_user-view.png?​direct&​700|}} +
-===== Batch Queue Jobs =====+
  
 Every cloud job has 4 steps: Every cloud job has 4 steps:
-  * Identify the program you want to run and the data that is its input. +  * Identify the program you want to run and the data that is input to and output from it and create a batch job script 
-  * Stage-in the program and its data to the worker node. +  * Prepare files in the /data partition 
-  * Run the program on the node. +  * Run the program on a batch node 
-  * Stage-out ​the output so you can examine it.+  * Retrieve files from the /data partition
  
-The first step is up to you. The other three steps are controlled by your batch job script.+===== A Simple Batch Job =====
  
-===== Preparing Queue Jobs ===== +Here we create a very simple batch job that requires one input file and creates an output ​file.
- +
-==== A Simple Batch Job ==== +
- +
-Create a directory **job_test** under your home directory. ​ In this directory ​create a file **run_job.sh** containing:+
  
 +Create a directory **job_test** under your home directory on cxin01 or cxin02. ​ We assume here that your username is **smith**. ​ In the directory create a file **run_job.sh** containing:
 <​code>​ <​code>​
 #!/bin/bash #!/bin/bash
Line 51: Line 41:
 # Set the name of this batch job # Set the name of this batch job
 #PBS -N my_test #PBS -N my_test
 +
 +# Join standard and error job outputs into one file
 +#PBS -j oe
 +
 +# get mail on job end
 +#PBS -m ae
 +#PBS -M joan.smith@example.com
  
 # Set the maximum resource usage we expect from the job. # Set the maximum resource usage we expect from the job.
 # This usually helps the scheduler work more efficiently. # This usually helps the scheduler work more efficiently.
-#PBS -l nodes=1+#PBS -l ncpus=1
 #PBS -l mem=512MB #PBS -l mem=512MB
 #PBS -l vmem=512MB #PBS -l vmem=512MB
 #PBS -l walltime=0:​01:​00 #PBS -l walltime=0:​01:​00
 +#PBS -l cput=0:​01:​00
  
-sleep 5 +cd /​data/​smith/​job_test 
-echo 'Done!'+cat job_test.data > job_test.output 
 +echo "Done!"
 </​code>​ </​code>​
  
-The above excessively ​simple ​script does nothing but wait for 5 seconds then print message, but it's instructive ​to explain the systemYou can run the script with '​qsub'​+This simple ​job just copies the contents of data file **job_test.data** ​to **job_test.output** and then prints "​Done!"​.
  
-  $ qsub run_job.sh +That looks a little forbidding, but it's not too bad, and most of it won't change when you run different jobs.  ​We'll step through each line and explain what it is, and whether you need to change it if your job requirements change.
-  364.c3torque.cloud.coepp.org.au+
  
-You can see your job's id number ​is 364. You can check the status ​of a running ​job with qstat:+The first parameter sets the name of the job while it is running in the cloud: 
 +  # Set the name of this batch job 
 +  #PBS -N my_test 
 +You should make an attempt to change the job name **my_test** to something meaningful, but it isn't required - you can have fifty jobs running in the cloud all named **my_test** if you want.  The cloud  won't get confused, but //you// might!
  
-  $ qstat +The second parameter
-  Job id                    Name             ​User ​           Time Use S Queue +<​code>​ 
-  ------------------------- ---------------- --------------- -------- - ----- +# Join standard and error job outputs ​into one file 
-  364.c3torque ​              ​my_test ​         neilds ​                0 Q short +#PBS -j oe 
- +</code> 
-'​Q'​ means the job is **q**ueued. Later the status flag will change to '​R'​ for **r**unning,​ followed by '​E'​ when the job has **e**nded. +just combines ​the **two** job output files you would normally receive into one, making ​your directories ​little less cluttered.
- +
-  $ qstat +
-  Job id                    Name             ​User ​           Time Use S Queue +
-  ------------------------- ---------------- --------------- -------- - ----- +
-  364.c3torque ​              ​my_test ​         neilds ​         00:00:00 E short +
- +
-(In this case since our job only sleeps for 5 seconds, total cpu time usage is almost zero, even though it ran for 5 real seconds.) +
- +
-When the job is complete, files *my_test.o364* and *my_test.e364* will be created in your local directory, containing the standard ​output ​and standard ​error from your job, respectively. If you like, you can have both files combined ​into *my_test.o364* by adding the parameter "#PBS -j oe" to your script. +
- +
-==== Stage-in and Stage-out ==== +
- +
-Of course, real jobs need data, and real jobs produce data. So we need a means of supplying input and extracting output files. Stage-in and stage-out do this for us. These are controlled by the following magic incantations:​ +
- +
-  #PBS -W stagein=/worker/​node/​path@cxin01.cloud.coepp.org.au:/​local/​path +
-  #PBS -W stageout=/​worker/​node/​path@cxin01.cloud.coepp.org.au:/​local/​path +
- +
-In each case the first path is the location on the worker node, and the second path is the location on the local ui. At the start of the job, the local files given in '​stagein'​ will be copied to the given location on the worker node. At the end of the job, the files given by the worker node path in '​stageout'​ will be copied to the given local location. +
- +
-===== An Extended Example ===== +
- +
-Create a directory ​**job_test** under your home directory. ​ In this directory create ​file **fib.c** containing:+
  
 +You may want to get email on job completion. ​ This can help when you are starting as you get an email containing any errors that occur.
 +Getting email is optional, and you may want to turn this off if you are running very many jobs:
 <​code>​ <​code>​
-// +# get mail on job end 
-// A simple little C program to (inefficiently) compute Fibonacci of a number. +#PBS -m ae 
-// NOTE THAT THERE IS NO ERROR CHECKING! +#PBS -M joan.smith@example.com
-// +
-// Usage: ./fib <​number>​ +
-// +
- +
-#include <stdio.h> +
- +
-unsigned long fib(int a) +
-+
-    if ((a == 0) || (a == 1)) +
-        return a; +
- +
-    return fib(a-1) + fib(a-2); +
-+
- +
-int main(int argc, char **argv) +
-+
-  unsigned int n; +
-  unsigned long result; +
- +
-  sscanf(argv[1],​ "​%d",​ &n); +
- +
-  printf("​Doing fib(%d)=",​ n); +
-  fflush(stdout);​ +
-  result = fib(n); +
-  printf("​%ld\n",​ result); +
- +
-  return 0; +
-}+
 </​code>​ </​code>​
  
-This (embarrassingly inefficient) simple little program doesn'​t do much, but we are only using it as an example. +The next set of parameters ​are more important:
- +
-We compile and test the program: +
 <​code>​ <​code>​
-bash-3.2$ gcc -o fib fib.c +# Set the batch job parameters
-bash-3.2$ ls -l +#PBS -l ncpus=
-total 12 +#PBS -l mem=512MB 
-rwxr-xr-x ​rwilson people 7311 Nov 11 22:13 fib +#PBS -l vmem=512MB 
-rw-r--r-- 1 rwilson people ​ 491 Nov 11 22:13 fib.c +#PBS -l walltime=0:01:00 
-bash-3.2$ ./fib 3 +#PBS -l cput=0:01:00
-Doing fib(3)=2 +
-bash-3.2$ ./fib 4 +
-Doing fib(4)=3 +
-bash-3.2$ ./fib 5 +
-Doing fib(5)=+
-bash-3.2$ ​+
 </​code>​ </​code>​
 +Here you tell the cloud what sort of cloud node you want your job to run on, how much memory it will require (real and virtual) and how long it should take to run in human hours, minutes and seconds (not CPU time).
  
-Everything looks fine, but what we really want to do is compute fib(50), which could take a while, so let's run the job in the cloud.+The **ncpus=1** bit says you need only one CPU.  Jobs that need more than one CPU are beyond ​the scope of this introduction.
  
-==== Running ​the job ====+The **mem=512MB** parameter says your job requires no more than 512MB to run, and the following **vmem=512MB** parameters says your job's virtual memory requirements are below 512MB. ​ You can use the the //GB// suffix to denote gigabytes of memory.
  
-To run a job in the cloud we must write a shell file to run our job.  ​This shell file will contain ​other cloud-specific stuff, but let'​s ​ignore that for the moment ​Create a simple little script called **run_job.sh** to run our job:+The parameter **walltime=0:​01:​00** sets the wall-clock time required for your job to run.  ​The format is //​HH:​MM:​SS//​ indicating hours, minutes and seconds. ​ There are other formats using only one or two time fields instead of the three shown here, but when starting out it'​s ​best to stick to the three-field form.
  
-<​code>​ +Finally, the **cput** parameter sets the maximum amount of CPU time your batch job is allowed to use.  The value you specify has  
-#!/bin/bash+the same format as the **walltime** above. ​ If you don't set this value 
 +you get the default maximum configured in the batch system, which could be just one hour.  So if you need a lot of CPU time, 
 +it's best to specify the amount you think you will need.
  
-# +The memory, cput and walltime parameters set upper limits to the memory, CPU time and runtime that your job may use. 
-# Example ​of running our job from a script +If your job exceeds these limits it will be terminated. ​ If you aren't sure of the memory/time requirements of your job it is usual to overestimate the limits, but don't get carried away.  If you're not sure of the requirements of your job you can try running smaller versions (if possible) to get an idea of resource usage and work up to the final requirements. ​ Note that you can run these exploratory jobs //in the cloud// or, if you are careful, on cxin01 or cxin02.
-#+
  
-cd ~/job_test +The final part of the file shows the execution of your job (remember your job!): 
-./fib 6+<​code>​ 
 +cd /data/smith/job_test 
 +cat job_test.data > job_test.output 
 +echo "​Done!"​
 </​code>​ </​code>​
 +This just copies the text in file **job_test.data** into file **job_test.output**. ​ Note that the first line above moves to the **/​data/​smith/​job_test** directory. ​ When your job runs on the cloud it can't see your home directory, so you must write your job
 +as if it runs under the **/​data/​smith** directory somewhere. ​ You must copy any input data files to your **/​data/​smith** directory before
 +submitting the job __from that directory__,​ as it expects the **job_test.data** file to be there.
  
-This script really has only two lines of interest. ​ The first: +Now create the input file **job_test.data** that can contain anything you like.  ​Here's an example:
-  cd ~/job_test +
-moves to the directory containing the job executable The next line: +
-  ​./fib 6 +
-says to run the //fib// executable in the current directory and pass //6// to it as a parameter. +
- +
-We could just as easily replace those two lines with the single line: +
-  ~/​job_test/​fib 6 +
- +
-We passed //6// as a parameter because we want to test our script on the interactive node but don't want to take a lot of CPU time doing it:+
 <​code>​ <​code>​
-bash-3.2$ bash run_job.sh +A simple text file.
-Doing fib(6)=8 +
-bash-3.2$ ​+
 </​code>​ </​code>​
  
-OK, now that we are happy that the script ​will run our job it's time to add the stuff that the cloud queueing system requires. ​ Edit **run_job.sh** to look like:+===== Turning things off in a batch script ​=====
  
 +When you are experimenting with job scripts you may want to temporarily remove some of the PBS commands
 +in a script. ​ Maybe you want to see what removing a command does.  The simplest way is to just place one
 +or more bash comment characters at the start of the line:
 <​code>​ <​code>​
-#!/bin/bash 
- 
-# 
-# Example of running our job from a script 
-# 
- 
-# Batch job gets all environment variables of submitting session 
-#PBS -V 
- 
-# Set the name of this batch job 
-#PBS -N my_test 
- 
-# Specify the batch queue to use 
-#PBS -q batch 
- 
 # Set the batch job parameters. # Set the batch job parameters.
-#PBS -l nodes=1 +#PBS -l ncpus=1 
-#PBS -l mem=512MB+###PBS -l mem=512MB
 #PBS -l vmem=512MB #PBS -l vmem=512MB
-#PBS -l walltime=0:05:00+#PBS -l walltime=0:01:00 
 +</​code>​ 
 +Here we turn off the **mem** option. ​ This batch job would run with the default upper limit for memory.
  
-# User email address +===== Prepare the /data partition =====
-###PBS -M user@example.com +
-# Choose when to send email +
-###PBS -m ae+
  
-# Copy our executable to the worker ​node for running +We have a job script that copies one file into another, but we can't run it yet since our home directory (where you created ​the job script and input data file) doesn'​t exist on the cloud worker ​nodes.
-#PBS -W stagein=$HOME/​@cxin01.cloud.coepp.org.au:​$HOME/​job_test+
  
-# Some boilerplate ​to show starting directory ​and when started +The only filesystem common ​to the cxin nodes and the batch nodes is that under **/​data**. ​ If you look there you will see directories with the names of CoEPP users: 
-echo Starting directory ​is $(pwd) +<​code>​ 
-echo Running on host $(hostname) +-bash-4.1ls -l /data 
-echo Start time is $(date)+total 0 
 +drwxr-xr-x 1 abangert ​  ​people 0 Jun 26 04:22 alfred 
 +drwxr-xr-x 1 adunn      people 0 Jun 26 04:22 smith 
 +drwxr-xr-x 1 yicai      people 0 Jun 26 04:23 xerxes 
 +-bash-4.1 
 +</​code>​
  
-##### +You must submit jobs that create data files from the **/data** directory, and you must also copy any input data files to the **/data** directory. 
-# Your code goes after this point +It's easier if we just copy the entire job directory to the **/data** area: 
-##### +<code> 
- +cp -r /home/smith/job_test /data/smith
-cd ~/job_test +
-./fib 50+
 </​code>​ </​code>​
  
-That looks a little forbidding, but it's not that bad, and most of it won't change when you run different jobs ​We'​ll step through each line and explain what it is, and whether you need to change it if your job requirements change.+This creates the directory **/​data/​smith/​job_test** which contains the **run_job.sh** scripts ​and the input data file 
 +(**job_test.data**). 
 +===== Submitting the job =====
  
-The first parameter:+Finally we get to run our job in the cloud. ​ You could test the job on the cxin01 or cxin02 nodes before running in the cloud by just running the job script file, as it's still just a shell program. ​ All the torque job parameters look like a comment to the shell interpreter. ​ You might want to change a real job so that it doesn'​t take a lot of time or other resources when you do.  And don't forget to change it back before submitting to the cloud!
  
-  # Batch job gets all environment variables of submitting session +Here's our directory **/​data/​smith/​job_test** before we submit the job: 
-  ​#PBS -V+<​code>​ 
 +bash-4.1$ ls -l /​data/​smith/​job_test 
 +total 24 
 +-rw-r--r-- 1 smith people ​  20 Jun 27 04:59 job_test.data 
 +-rw-r--r-- 1 smith people ​ 309 Jun 27 04:41 run_job.sh 
 +bash-4.1$ 
 +</​code> ​
  
-is a command that says your job running on the cloud has access to all the environment variables that were defined when you submitted your job to the cloud. ​ You can almost always just leave this line as-is and not worry about its meaning.+Now we can submit our job
 +<​code>​ 
 +bash-4.1$ qsub /​data/​smith/​job_test/​run_job.sh 
 +11342.c3torque.cloud.coepp.org.au 
 +bash-4.1$  
 +</​code>​
  
-The next parameter sets the name of the job while it is running in the cloud: +We see that the job was acceptedwas given the job number of **11342**. 
- +===== Checking the job =====
-  # Set the name of this batch job +
-  #PBS -N my_test +
- +
-You should make an attempt to change the job name **my_test** to something meaningful, but it isn't required - you can have fifty jobs running in the cloud all named **my_test** if you want.  The cloud won't get confused, but //you// might! +
- +
-The next parameter:​ +
- +
-  # Specify the batch queue to use +
-  #PBS -q batch +
- +
-sets the queue you want to run on.  You don't really need to know much about this, either, as the batch queue specified is a special //routing// queue that will decide how to run your job based on your requirements (mentioned next). ​ So you could just always use the //batch// queue. +
- +
-The next set of parameters are more important:​ +
- +
-  # Set the batch job parameters. +
-  #PBS -l nodes=1 +
-  #PBS -l mem=512MB +
-  #PBS -l vmem=512MB +
-  #PBS -l walltime=0:​05:​00 +
- +
-Here you tell the cloud what sort of cloud node you want your job to run on, how much memory it will require (real and virtual) and how long it should take to run in human hours and minutes (not CPU time). +
- +
-The **nodes=1** bit says you need only one CPU.  Jobs that need more than one CPU are beyond the scope of this introduction. +
- +
-The **mem=512MB** parameter says your job requires no more than 512MB to run, and the following **vmem=512MB** parameters says your job's virtual memory requirements are below 512MB. ​ You can use the the //GB// suffix to denote gigabytes of memory. +
- +
-The final parameter **walltime=0:​05:​00** sets the wall-clock time required for your job to run.  The format is //​HH:​MM:​SS//​ indicating hours, minutes and seconds. ​ There are other formats using only one or two time fields instead of the three shown here, but when starting out it's best to stick to the three-field form. +
- +
-The CPU parameter and some of the other parameters are used by the cloud to determine what sort of cloud node your job should run on.  If your parameters require a type of node that doesn'​t exist on the cloud you will get an error. +
- +
-The memory and walltime parameters set upper limits to the memory and runtime that your job may use.  If your job exceeds these limits it will be terminated. ​ If you aren't sure of the memory/time requirements of your job it is usual to overestimate the limits, but don't get carried away.  If you're not sure of the requirements of your job you can try running smaller versions (if possible) to get an idea of resource usage and work up to the final requirements. ​ Note that you should run these exploratory jobs //in the cloud// and not on the interactive node(s). ​ It's a good idea to turn on email when investigating a jobs requirements as the email will contain a summary of the resources used, including memory and walltime. +
- +
-The parameters:​ +
- +
-  # User email address +
-  ###PBS -M user@example.com +
-  # Choose when to send email +
-  ###PBS -m ae +
- +
-specify if you want email sent to you when your job terminates and who to send the email to.  The parameters as shown are commented out, so no email will be sent.  If you want you can turn email on by changing this to: +
- +
-  # User email address +
-  #PBS -M my.email.address@example.com +
-  # Choose when to send email +
-  #PBS -m ae +
- +
-The **-m ae** parameter says to send email on job **a**bort or job **e**nd. ​ If you want email only on job abort, use **-m a**. +
- +
-The stagein parameter:​ +
- +
-  # Copy our executable to the worker node for running +
-  #PBS -W stagein=$HOME/​@cxin01.cloud.coepp.org.au:​$HOME/​job_test +
- +
-Tells the batch server to copy your job_test directory to your home directory on the worker node, so you can run the executable you've compiled. By default, home directories on worker nodes are empty, and any data or programs you need for your jobs must be transferred explicitly. +
- +
-The next part of the file: +
- +
-  # Some boilerplate to show starting directory and when started +
-  echo Working directory is $(pwd) +
-  echo Running on host $(hostname) +
-  echo Start time is $(date) +
- +
-is not required by the cloud. ​ It is just some useful stuff to show the directory your job starts in, which cloud node it's running on, and the time it starts. ​ As far as the cloud is concerned it is your part of the script and you are free to change it or delete it. +
- +
-The final part of the file shows the execution of your job (remember your job?!): +
- +
-  cd ~/​job_test +
-  ./fib 50 +
- +
-Basically, this just moves to the directory holding your job executable and runs it.  No surprises there. +
- +
-==== Submitting the job ==== +
- +
-Finally we get to run your job in the cloud. ​ You could test the job on the interactive node before running in the cloud by just running the job script file, as it's still just a shell program: all the cloud stuff looks like a comment to the shell interpreter. ​ You will want to change **./fib 50** to something lighter like **./fib 10** before doing that.  And don't forget to change it back before submitting to the cloud! +
- +
-OK, to submit your job to the cloud just do: +
- +
-  bash-3.2$ qsub run_job.sh +
-  10401.c3torque.cloud.coepp.org.au +
-  bash-3.2$  +
- +
-We see that the job was accepted ​and was given the job number of **10401**. +
- +
-==== Checking the job ====+
  
 Once you have submitted your job, you can check its progress by: Once you have submitted your job, you can check its progress by:
 +<​code>​
 +bash-4.1$ qstat
 +Job id                    Name             ​User ​           Time Use S Queue
 +------------------------- ---------------- --------------- -------- - -----
 +11342.c3torque ​            ​my_test ​         smith                  0 Q short      ​
 +bash-4.1$ ​
 +</​code>​
  
-  bash-3.2$ qstat +We see the job with id of **11342**.  ​That job is currently queued because the status is **Q**.
-  Job id                    ​Name ​            ​User ​           Time Use S Queue +
-  ------------------------- ---------------- --------------- -------- - ----- +
-  10401.poller ​              ​my_test ​         rwilson ​        ​00:​00:​18 R short           +
-  bash-3.2$ +
  
-We see the job with id of **10401**. ​ The job name is **my_test** and the used CPU time is **00:00:18**, ie, 18 seconds so far.  The *R* in the 'S' column is the job status**R** means //running//.+If we do another ''​qstat''​ we see that the job is now running (status ​**R**and the used CPU time is **00:00:00**, ie, seconds so far. 
 +<​code>​ 
 +bash-4.1$ qstat 
 +Job id                    Name             ​User ​           Time Use Queue 
 +------------------------- ---------------- --------------- -------- - ----- 
 +11342.c3torque ​            ​my_test ​         smith           00:00:00 short           
 +bash-4.1$  
 +</code>
  
-The queue name is **short** even though we specified **batch** in the job script file.  This is because **batch** is a routing queue and the cloud figures out that your job should run on the **short** queue since its required walltime is only 5 minutes.+The queue name is **short** even though we specified **batch** in the job script file.  This is because **batch** is a routing queue and the cloud figures out that your job should run on the **short** queue since its required walltime is only 1 minute.
  
-If the cloud is a little slow to start running your job you might see a status of **Q**, meaning your job is queued but not yet running.+If you are quick another ''​qstat'' ​might show: 
 +<​code>​ 
 +bash-3.2$ qstat 
 +Job id                    Name             ​User ​           Time Use S Queue 
 +------------------------- ---------------- --------------- -------- - ----- 
 +11342.c3torque ​            ​my_test ​         smith           ​00:​00:​01 E short      
 +</​code>  ​
  
-After a while another //qstat// will show:+Here our job has completed (the **E** status) after consuming 1 second of CPU time. 
 +===== Examining output =====
  
-  ​bash-3.2$ ​qstat +If you look in the directory ''​job_test''​ under your **/​data/​smith** directory now you will see: 
-  Job id                    Name             ​User ​           Time Use S Queue +<​code>​ 
-  ------------------------- ---------------- --------------- -------- - ----- +bash-3.2$ ​ls -l  
-  10401.poller ​              ​my_test ​         rwilson ​        ​00:​00:​27 C short          ​ +total 1 
-  ​bash-3.2$ ​+-rw-r--r-- 1 smith people ​ 20 Jun 27 05:48 job_test.data 
 +-rw-r--r-- 1 smith people ​ 20 Jul 30 03:28 job_test.output 
 +-rw-r--r-- 1 smith people 544 Jul 29 06:27 run_job.sh 
 +-rw------- ​1 smith people ​  6 Jul 30 03:28 my_test.o11342 
 +</​code>​
  
-Here our job has completed (the **C** status) after consuming 27 seconds of CPU time.+The file **job_test.output** ​has been copied back into your directory and contains the same text as the **job_test.data** file: 
 +<​code>​ 
 +A simple text file. 
 +</​code>​
  
-==== Examining ​output ​====+Note the **my_test.o11342** file which is the combined standard error and output ​file from your **11342** job.  This file contains: 
 +<​code>​ 
 +Done! 
 +</​code>​
  
-If you look in the directory ​//job_test// under your home directory you will see:+We also got an email on job termination:​ 
 +<​code>​ 
 +PBS Job Id: 11342.c3torque.cloud.coepp.org.au 
 +Job Name:   ​my_test 
 +Exec host:  vm-118-138-241-121.erc.monash.edu.au/
 +Execution terminated 
 +Exit_status=0 
 +resources_used.cput=00:00:00 
 +resources_used.mem=0kb 
 +resources_used.vmem=0kb 
 +resources_used.walltime=00:​00:​01 
 +</​code>​
  
-  bash-3.2$ ls -l +Congratulations!  ​Your job has completed successfully.
-  total 20 +
-  rwxr-xr-x 1 rwilson people 7311 Nov 11 22:13 fib +
-  rw-r--r-- 1 rwilson people ​ 491 Nov 11 22:32 fib.c +
-  rw-r--r-- 1 rwilson people ​ 811 Nov 11 23:23 run_job.sh +
-  rw------- 1 rwilson people ​ 174 Nov 12 00:27 my_test.o10401 +
-  bash-3.2$  +
- +
-Note the **my_test.o10401** file which is the output file from your **10401** job.  This file contains: +
- +
-  Working directory is /​home/​users/​rwilson +
-  Running on host vm-115-146-94-228.rc.melbourne.nectar.org.au +
-  Time is Mon Nov 12 11:22:25 EST 2012 +
-  Doing fib(50)=12586269025 +
- +
-Congratulations!  ​You have run your first cloud job and determined that fib(50) is 12586269025.+
  
 (Note that it can be a short while before your output file appears in the result directory.) (Note that it can be a short while before your output file appears in the result directory.)
 +====== Reporting problems ======
  
-=== More advanced usage === +If you have a problemyou can report it or get help at:
- +
-You can get some more advanced examples from the web: +
-  * http://​librarian.phys.washington.edu/​athena/​index.php/​Job_Submission_Tutorial +
-  * http://​wiki.ibest.uidaho.edu/​index.php/​Tutorial:​_Submitting_a_job_using_qsub +
-You should be aware that the above links discuss things the local cloud doesn'​t ​have, like **MOAB** and **SGE**. +
- +
-The manual page for qsub might be helpful:+
  
 <​code>​ <​code>​
-bash-3.2$ man qsub +rc@coepp.org.au 
-qsub(1B) ​                             PBS                             ​qsub(1B)+</​code>​
  
-NAME 
-       qsub - submit pbs job 
  
-SYNOPSIS +====== ​See Also ======
-       ​qsub ​ [-a  date_time] ​ [-A  account_string] ​ [-b secs] [-c checkpoint_options] [-C directive_prefix] [-d +
-       path] [-D path] [-e path] [-f] [-h] [-I] [-j join] [-k keep] [-l resource_list] ​ [-m  mail_options] ​ [-M +
-       ​user_list] ​ [-N  name]  [-o path] [-p priority] [-P proxy_username[:​group]] ​ [-q destination] [-r c] [-S +
-       ​path_list] [-t array_request] [-T prologue/​epilogue script_name] [-u user_list] [-v variable_list] ​ [-V] +
-       [-w] path [-W additional_attributes] [-x] [-X] [-z] [script] +
- +
-DESCRIPTION +
-       ​To ​ create ​ a  job  is  to  submit an executable script to a batch server. ​ The batch server will be the +
-       ​default server unless the -q option is specified.  ​See discussion of PBS_DEFAULT under Environment Vari- +
-       ​ables ​ below. ​ Typically, the script is a shell script which will be executed by a command shell such as +
-       sh or csh. +
- +
-       ​Options on the qsub command allow the specification of attributes which affect the behavior of the  job. +
-        +
-...MUCH MORE STUFF HERE... +
-</​code>​+
  
-Further help can also be obtained by asking CoEPP support staff.+[[:​cloud:​tmux|tmux]] for a method to recover from disconnections to remote machines.
  
 +[[:​cloud|Cloud Home]] and [[cloud:​faq|FAQ]]
cloud/a_walk_in_the_cloud_old.txt · Last modified: 2013/11/07 16:11 by rwilson
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki