How do I?

ssh user@linux.cs.uchicago.edu
man sbatch
user@csilcomputer:~$ srun --pty --mem 500 /bin/bash 
user@slurm1:~$ mkdir -p /scratch/$USER
user@slurm1:~$ cd /scratch/$USER/
user@slurm1:/scratch/user$ scp user@csilcomputer:~/foo .
foo                         100%  103KB 102.7KB/s   00:00    
user@slurm1:/scratch/user$ ls -l foo 
-rw------- 1 user user 105121 Dec 29  2015 foo
 mkdir -p $HOME/slurm/out
#!/bin/bash
#
#SBATCH --mail-user=cnetid@cs.uchicago.edu
#SBATCH --mail-type=ALL
#SBATCH --output=/home/cnetid/slurm/out/%j.%N.stdout
#SBATCH --error=/home/cnetid/slurm/out/%j.%N.stderr
#SBATCH --workdir=/home/cnetid/slurm
#SBATCH --partition=debug
#SBATCH --job-name=check_hostname_of_node
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=500
#SBATCH --time=15:00

hostname
man sbatch
sbatch hostname.job
user@host:~$ srun -n2 hostname
research2
research2
user@host:~$ srun -n1 sleep 400
user@host:~$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   29     debug    sleep     user  R       0:11      1 research2
scancel 29
user@host:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up       30:00      1   idle slurm1
general      up 14-00:00:00      6   idle slurm[2-6,8]
pascal       up  3-00:00:00      1   idle gpu2
tesla        up  3-00:00:00      1   idle gpu1
squeue -u cnetid
squeue -j 7894
 srun -p general --pty --cpus-per-task 1 --mem 500 -t 0-06:00 /bin/bash
sbatch --gres=gpu:titan:2 ....
  --gpu=gpu:N    # where 'N' is the number of GPUs requested.
                 # Please try to limit yourself to one GPU per person.
Depends on:
  ''%%pip3 install --user tensorflow-gpu%%''
  ''%%export PATH=$HOME/.local/bin:$PATH%%''
<code>
  #!/usr/bin/env python3
  from tensorflow.python.client import device_lib
  print(device_lib.list_local_devices())
</code>
  kauffman3@bulldozer:~$ srun -p titan --pty /bin/bash
  kauffman3@gpu3:~$ ./f 2>&1 | grep physical_device_desc
  kauffman3@gpu3:~$
  kauffman3@bulldozer:~$ srun -p titan --pty --gres=gpu:1 /bin/bash
  kauffman3@gpu3:~$ ./f 2>&1 | grep physical_device_desc
  physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:19:00.0, compute capability: 6.1"
kauffman3@bulldozer:~$ srun -p titan --pty --gres=gpu:2 /bin/bash
kauffman3@gpu3:~$ ./f 2>&1 | grep physical_device_desc
  physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:19:00.0, compute capability: 6.1"
  physical_device_desc: "device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:1a:00.0, compute capability: 6.1"
  kauffman3@bulldozer:~$ srun -p titan --pty --gres=gpu:5 /bin/bash
  srun: error: Unable to allocate resources: Requested node configuration is not available
$ sinfo -O partition,nodelist,gres,features,available
PARTITION           NODELIST            GRES                FEATURES            AVAIL               
debug*              slurm1              (null)              (null)              up                  
general             slurm[2-6,8]        (null)              (null)              up                  
pascal              gpu2                gpu:gtx1080:1       'pascal,gtx1080'    up                  
titan               gpu3                gpu:gtx1080ti:4     'pascal,gtx1080ti'  up 
export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH=/usr/local/cuda/lib
#!/bin/bash
#
#SBATCH --mail-user=cnetid@cs.uchicago.edu
#SBATCH --mail-type=ALL
#SBATCH --output=/home/cnetid/slurm/slurm_out/%j.%N.stdout
#SBATCH --error=/home/cnetid/slurm/slurm_out/%j.%N.stderr
#SBATCH --workdir=/home/cnetid/slurm
#SBATCH --partition=gpu
#SBATCH --job-name=get_tesla_info

export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH=/usr/local/cuda/lib

cat << EOF > /tmp/getinfo.cu
#include <stdio.h>

int main() {
  int nDevices;

  cudaGetDeviceCount(&nDevices);
  for (int i = 0; i < nDevices; i++) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, i);
    printf("Device Number: %d\n", i);
    printf("  Device name: %s\n", prop.name);
    printf("  Memory Clock Rate (KHz): %d\n",
           prop.memoryClockRate);
    printf("  Memory Bus Width (bits): %d\n",
           prop.memoryBusWidth);
    printf("  Peak Memory Bandwidth (GB/s): %f\n\n",
           2.0*prop.memoryClockRate*(prop.memoryBusWidth/8)/1.0e6);
  }
}
EOF

/usr/local/cuda/bin/nvcc /tmp/getinfo.cu -o /tmp/a.out
/tmp/a.out
rm /tmp/a.out
rm /tmp/getinfo.cu
cnetid@linux1:~$ cat $HOME/slurm/slurm_out/12567.gpu1.stdout 
Device Number: 0
  Device name: Tesla M2090
  Memory Clock Rate (KHz): 1848000
  Memory Bus Width (bits): 384
  Peak Memory Bandwidth (GB/s): 177.408000

Partition Name	Description
debug	The partition your job will be submitted to if none is specified. The purpose of this partition is to make sure your code is running as it should before submitting a long running job to the general queue.
general	All jobs that have been thoroughly tested can be submitted here. This partition will have access to more nodes and will process most of the jobs. If you need to use the `--exclusive` flag it should be done here.
gpu	Contains servers with graphics cards. As of May 2016 there is only one node containing a Tesla M2090. You will be forced to use this server exclusively for now. Please keep your time in interactive mode to a minimum.

	SLURM	Example
Submit a batch serial job	sbatch	sbatch runscript.sh
Run a script interatively	srun	srun –pty -p interact -t 10 –mem 1000 /bin/bash /bin/hostname
Kill a job	scancel	scancel 4585
View status of queues	squeue	squeue -u cnetid
Check current job by id	sacct	sacct -j 999999

Error	What does it mean?
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT	You did not specify enough time for your job to run. The `-t` flag will allow you to set the time limit.
Job <jobid> exceeded <mem> memory limit, being killed	Your job is attempting to use more memory that you have requested for it. Either increase the amount of memory you have requested or reduce the amount of memory usage your application is trying to use.
JOB <jobid> CANCELLED AT <time> DUE TO NODE FAILURE	There can be many reasons for this message, but most often it means that the node your job was set to run on can no longer be contacted by the the SLURM controller.
error: Unable to allocate resources: More processors requested than permitted	It usually has nothing to do with priviledges you may or may not have. Rather, it usually means that you have allocated more processors than one compute node actually has.

How do I?

Table of Contents

Notice

Peanut Job Submission Cluster

Where to begin

Mailing List

Documentation

Resources

Infrastructure

Hardware

Storage

Access

Example

Performance is slow

Utilization Dashboard

Partitions / Queues

Job Submission

Command Summary

Usage

Default Quotas

Exclusive access to a node

sbatch

Sample script

Submitting job script

srun

squeue

scancel

sinfo

Monitoring Jobs

Interactive Jobs

Job Scheduling

Common Issues

Using the GPU

GRES Multiple GPU's on one system

Ok, but I don't want to read the wall of text above

Cool, but how do I know where and what resources are available

Paths

Example

Output

More