CoEPP RC
 

Spartan GPU

Spartan is the HPC Cluster at the University of Melbourne. CoEPP and Astro (Michele Trenti) bought 2 dedicated GPU nodes, and have access to an additional 3.

On spartan, these nodes are:

  • spartan-gpu[003,004] - our dedicated nodes
  • spartan-gpu[001,002,005] - the HPC GPU nodes

Getting an account

Sean and Lucien are able to invite people into the CoEPP/Astro project on Spartan. This project is called punim0011.

If you already have a Spartan account, or you need one, please let rc@coepp.org.au know.

Logging in

Once you have an account, you can log in to spartan.hpc.unimelb.edu.au via SSH

ssh username@spartan.hpc.unimelb.edu.au

This is a log in node, and does not have access to the GPUs. You need to submit a job to our GPU nodes to get access to the GPUs

Resources available

Disk

Spartan has a variety of disk resources available. The following are available on every node

/home/<username> - your home directory
/data/cephfs/punim0011 - our project directory
/scratch - scratch directory for jobs

On our GPU nodes, we also have a very high speed PCI-E NVMe card

/var/local/tmp - PCI-E NVMe card. 1.5TB in size

CPU

Each of our GPU nodes has 12 cores (2×6 core)

Memory

Each of our GPU nodes has 256GB CPU RAM

GPU

Each of our GPU nodes has 4 GPUs (2×2 NVidia Tesla K80)

Submitting jobs

Spartan uses the batch system SLURM (https://slurm.schedmd.com). There's a slightly different way of submitting jobs than Torque

  • sbatch submits a job to the queue
  • sinteractive requests an interactive job slot

For a list of the options to sbatch, see this page - https://slurm.schedmd.com/sbatch.html

Partitions

The machines that make up Spartan are broken up into partitions

[scrosby@spartan ~]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
cloud*          up 30-00:00:0      4    mix spartan-rc[001,062-063,098]
cloud*          up 30-00:00:0     89  alloc spartan-rc[004-040,044-059,064-097,099-100]
cloud*          up 30-00:00:0      7   idle spartan-rc[002-003,041-043,060-061]
cloudtest       up 30-00:00:0      1    mix spartan-rc226
cloudtest       up 30-00:00:0     12  alloc spartan-rc[227-230,233-240]
cloudtest       up 30-00:00:0      2   idle spartan-rc[231-232]
physical        up 30-00:00:0      1 drain* spartan-bm021
physical        up 30-00:00:0      1 alloc* spartan-bm020
physical        up 30-00:00:0      2    mix spartan-bm[004,015]
physical        up 30-00:00:0     17  alloc spartan-bm[001-003,005-014,016,018,022-023]
physicaltest    up 30-00:00:0      2   idle spartan-bm[017,019]
punim0095       up 30-00:00:0      2    mix spartan-rc[104,260]
punim0095       up 30-00:00:0    143  alloc spartan-rc[101-103,105-225,241-259]
water           up 30-00:00:0      1    mix spartan-water01
water           up 30-00:00:0      2   idle spartan-water[02-03]
ashley          up 30-00:00:0     10   idle spartan-rc[261-270]
gpu             up 30-00:00:0      1 drain* spartan-gpu005
gpu             up 30-00:00:0      2  idle* spartan-gpu[001-002]
physics-gpu     up 30-00:00:0      1 drain* spartan-gpu005
physics-gpu     up 30-00:00:0      2  idle* spartan-gpu[001-002]
physics-gpu     up 30-00:00:0      2   idle spartan-gpu[003-004]
debug           up 30-00:00:0      2 drain* spartan-bm021,spartan-gpu005
debug           up 30-00:00:0      1 alloc* spartan-bm020
debug           up 30-00:00:0      2  idle* spartan-gpu[001-002]
debug           up 30-00:00:0      1  drain spartan-rc000
debug           up 30-00:00:0     10    mix spartan-bm[004,015],spartan-rc[001,062-063,098,104,226,260],spartan-water01
debug           up 30-00:00:0    261  alloc spartan-bm[001-003,005-014,016,018,022-023],spartan-rc[004-040,044-059,064-097,099-103,105-225,227-230,233-259]
debug           up 30-00:00:0     25   idle spartan-bm[017,019],spartan-gpu[003-004],spartan-rc[002-003,041-043,060-061,231-232,261-270],spartan-water[02-03]

The partition name is on the left, and the nodes which make up the partition is on the right.

We have exclusive access to the partition “physics-gpu”. You must specify the partition in your sbatch or sinteractive submission to be able to use our GPUs.

Queues

There's no such thing as queues in SLURM. You just specify the requirements your job has, and it will route it to the right node(s)

Common options

  • walltime
--time=hh:mm:ss
  • memory
--mem-per-cpu=NG
  • cpus
--cpus-per-task=N
  • generic resources (gres)
--gres=requirement
  • partition
--partiton=partition
  • account - This is helpful if you are part of more than one project on Spartan. You can change the account (i.e. group/project) to run your job as
--account=punim0011

e.g.

sbatch --time=48:00:00 --mem-per-cpu=20G --cpus-per-task=3 --gres=gpu:1 --partition=physics-gpu test.sh

will ask for 48hrs walltime, 3 CPUs, each having 20GB of RAM, 1 GPU, and a node from the physics-gpu partition.

For 2 GPUs use –gres=gpu:2 and –cpus-per-task=6 - Do not alter –mem-per-cpu=20G.

Getting job status

The equivalent of qstat on SLURM is squeue. It takes multiple options to allow you to limit/expand the output

e.g.

To show the jobs in the queue in the physics-gpu partition

[scrosby@spartan ~]$ squeue -p physics-gpu
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            471610 physics-g submitSl   ahawth  R      19:06      1 spartan-gpu003

To show my jobs

[scrosby@spartan ~]$ squeue -u scrosby
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

To show job requirements/exit status etc

[scrosby@spartan ~]$ scontrol show job 471621
JobId=471621 JobName=submitSlurm.sh
   UserId=ahawth(10278) GroupId=unimelb(10000) MCS_label=N/A
   Priority=56752 Nice=0 Account=punim0011 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:03:57 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2017-01-17T14:24:57 EligibleTime=2017-01-17T14:24:57
   StartTime=2017-01-17T14:24:58 EndTime=2017-01-19T14:24:58 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=physics-gpu AllocNode:Sid=spartan:1319
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=spartan-gpu003
   BatchHost=spartan-gpu003
   NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:*
   TRES=cpu=6,mem=120G,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=6 MinMemoryCPU=20G MinTmpDiskNode=0
   Features=(null) Gres=gpu:2 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/ahawth/tensorFlowWork/rawAndKinematicCombined/submitSlurm.sh
   WorkDir=/home/ahawth/tensorFlowWork/rawAndKinematicCombined
   StdErr=/home/ahawth/tensorFlowWork/rawAndKinematicCombined/slurm-471621.out
   StdIn=/dev/null
   StdOut=/home/ahawth/tensorFlowWork/rawAndKinematicCombined/slurm-471621.out
   Power=

To kill a job with a given JOBID use scancel, For example:

scancel 471610

TensorFlow

  • To install TensorFlow
module load CUDA/8.0.44
wget https://bootstrap.pypa.io/ez_setup.py -O ez_setup.py
python ez_setup.py --user
easy_install --user pip
export LD_LIBRARY_PATH=/data/projects/punim0011/cuda/lib64:$LD_LIBRARY_PATH
pip install --user tensorflow-gpu

To run tensorflow programs, do

module load CUDA/8.0.44

e.g.

[scrosby@spartan ~]$ module load CUDA/8.0.44
[scrosby@spartan ~]$ python
Python 2.7.5 (default, Aug  2 2016, 04:20:16)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally

You'll have to add those lines to your bash script which you submit to the queue as well.

For example in file submitSlurm.sh:

#!/bin/bash
module load CUDA/8.0.44

python deepLearningTrain.py

Where deepLearningTrain.py is the TensorFlow script. No changes are needed from TensorFlow-CPU, the operations will be automatically allocated to the GPUs.

Run with:

sbatch -p physics-gpu --gres=gpu:2 --mem-per-cpu=20G --cpus-per-task=6 --time=48:00:00 submitSlurm.sh
cloud/spartangpu.txt · Last modified: 2018/09/05 15:22 by scrosby
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki