Spartan is the HPC Cluster at the University of Melbourne. CoEPP and Astro (Michele Trenti) bought 2 dedicated GPU nodes, and have access to an additional 3.
On spartan, these nodes are:
Sean and Lucien are able to invite people into the CoEPP/Astro project on Spartan. This project is called punim0011.
If you already have a Spartan account, or you need one, please let rc@coepp.org.au know.
Once you have an account, you can log in to spartan.hpc.unimelb.edu.au via SSH
ssh username@spartan.hpc.unimelb.edu.au
This is a log in node, and does not have access to the GPUs. You need to submit a job to our GPU nodes to get access to the GPUs
Spartan has a variety of disk resources available. The following are available on every node
/home/<username> - your home directory /data/cephfs/punim0011 - our project directory /scratch - scratch directory for jobs
On our GPU nodes, we also have a very high speed PCI-E NVMe card
/var/local/tmp - PCI-E NVMe card. 1.5TB in size
Each of our GPU nodes has 12 cores (2×6 core)
Each of our GPU nodes has 256GB CPU RAM
Each of our GPU nodes has 4 GPUs (2×2 NVidia Tesla K80)
Spartan uses the batch system SLURM (https://slurm.schedmd.com). There's a slightly different way of submitting jobs than Torque
For a list of the options to sbatch, see this page - https://slurm.schedmd.com/sbatch.html
The machines that make up Spartan are broken up into partitions
[scrosby@spartan ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cloud* up 30-00:00:0 4 mix spartan-rc[001,062-063,098] cloud* up 30-00:00:0 89 alloc spartan-rc[004-040,044-059,064-097,099-100] cloud* up 30-00:00:0 7 idle spartan-rc[002-003,041-043,060-061] cloudtest up 30-00:00:0 1 mix spartan-rc226 cloudtest up 30-00:00:0 12 alloc spartan-rc[227-230,233-240] cloudtest up 30-00:00:0 2 idle spartan-rc[231-232] physical up 30-00:00:0 1 drain* spartan-bm021 physical up 30-00:00:0 1 alloc* spartan-bm020 physical up 30-00:00:0 2 mix spartan-bm[004,015] physical up 30-00:00:0 17 alloc spartan-bm[001-003,005-014,016,018,022-023] physicaltest up 30-00:00:0 2 idle spartan-bm[017,019] punim0095 up 30-00:00:0 2 mix spartan-rc[104,260] punim0095 up 30-00:00:0 143 alloc spartan-rc[101-103,105-225,241-259] water up 30-00:00:0 1 mix spartan-water01 water up 30-00:00:0 2 idle spartan-water[02-03] ashley up 30-00:00:0 10 idle spartan-rc[261-270] gpu up 30-00:00:0 1 drain* spartan-gpu005 gpu up 30-00:00:0 2 idle* spartan-gpu[001-002] physics-gpu up 30-00:00:0 1 drain* spartan-gpu005 physics-gpu up 30-00:00:0 2 idle* spartan-gpu[001-002] physics-gpu up 30-00:00:0 2 idle spartan-gpu[003-004] debug up 30-00:00:0 2 drain* spartan-bm021,spartan-gpu005 debug up 30-00:00:0 1 alloc* spartan-bm020 debug up 30-00:00:0 2 idle* spartan-gpu[001-002] debug up 30-00:00:0 1 drain spartan-rc000 debug up 30-00:00:0 10 mix spartan-bm[004,015],spartan-rc[001,062-063,098,104,226,260],spartan-water01 debug up 30-00:00:0 261 alloc spartan-bm[001-003,005-014,016,018,022-023],spartan-rc[004-040,044-059,064-097,099-103,105-225,227-230,233-259] debug up 30-00:00:0 25 idle spartan-bm[017,019],spartan-gpu[003-004],spartan-rc[002-003,041-043,060-061,231-232,261-270],spartan-water[02-03]
The partition name is on the left, and the nodes which make up the partition is on the right.
We have exclusive access to the partition “physics-gpu”. You must specify the partition in your sbatch or sinteractive submission to be able to use our GPUs.
There's no such thing as queues in SLURM. You just specify the requirements your job has, and it will route it to the right node(s)
--time=hh:mm:ss
--mem-per-cpu=NG
--cpus-per-task=N
--gres=requirement
--partiton=partition
--account=punim0011
e.g.
sbatch --time=48:00:00 --mem-per-cpu=20G --cpus-per-task=3 --gres=gpu:1 --partition=physics-gpu test.sh
will ask for 48hrs walltime, 3 CPUs, each having 20GB of RAM, 1 GPU, and a node from the physics-gpu partition.
For 2 GPUs use –gres=gpu:2 and –cpus-per-task=6 - Do not alter –mem-per-cpu=20G.
The equivalent of qstat on SLURM is squeue. It takes multiple options to allow you to limit/expand the output
e.g.
To show the jobs in the queue in the physics-gpu partition
[scrosby@spartan ~]$ squeue -p physics-gpu JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 471610 physics-g submitSl ahawth R 19:06 1 spartan-gpu003
To show my jobs
[scrosby@spartan ~]$ squeue -u scrosby JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
To show job requirements/exit status etc
[scrosby@spartan ~]$ scontrol show job 471621 JobId=471621 JobName=submitSlurm.sh UserId=ahawth(10278) GroupId=unimelb(10000) MCS_label=N/A Priority=56752 Nice=0 Account=punim0011 QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:03:57 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2017-01-17T14:24:57 EligibleTime=2017-01-17T14:24:57 StartTime=2017-01-17T14:24:58 EndTime=2017-01-19T14:24:58 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=physics-gpu AllocNode:Sid=spartan:1319 ReqNodeList=(null) ExcNodeList=(null) NodeList=spartan-gpu003 BatchHost=spartan-gpu003 NumNodes=1 NumCPUs=6 NumTasks=1 CPUs/Task=6 ReqB:S:C:T=0:0:*:* TRES=cpu=6,mem=120G,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=6 MinMemoryCPU=20G MinTmpDiskNode=0 Features=(null) Gres=gpu:2 Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/ahawth/tensorFlowWork/rawAndKinematicCombined/submitSlurm.sh WorkDir=/home/ahawth/tensorFlowWork/rawAndKinematicCombined StdErr=/home/ahawth/tensorFlowWork/rawAndKinematicCombined/slurm-471621.out StdIn=/dev/null StdOut=/home/ahawth/tensorFlowWork/rawAndKinematicCombined/slurm-471621.out Power=
To kill a job with a given JOBID use scancel, For example:
scancel 471610
module load CUDA/8.0.44 wget https://bootstrap.pypa.io/ez_setup.py -O ez_setup.py python ez_setup.py --user easy_install --user pip export LD_LIBRARY_PATH=/data/projects/punim0011/cuda/lib64:$LD_LIBRARY_PATH pip install --user tensorflow-gpu
To run tensorflow programs, do
module load CUDA/8.0.44
e.g.
[scrosby@spartan ~]$ module load CUDA/8.0.44 [scrosby@spartan ~]$ python Python 2.7.5 (default, Aug 2 2016, 04:20:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
You'll have to add those lines to your bash script which you submit to the queue as well.
For example in file submitSlurm.sh:
#!/bin/bash module load CUDA/8.0.44 python deepLearningTrain.py
Where deepLearningTrain.py is the TensorFlow script. No changes are needed from TensorFlow-CPU, the operations will be automatically allocated to the GPUs.
Run with:
sbatch -p physics-gpu --gres=gpu:2 --mem-per-cpu=20G --cpus-per-task=6 --time=48:00:00 submitSlurm.sh