This is an old revision of the document!
Cluster is up and running now. Anyone with a CS account who wishes to test it out should do so.
Feedback is requested:
#ai-cluster Discord channel or email Phil Kauffman (kauffman@cs dot uchicago dot edu).
Knowledge of how to use Slurm already is preferred at this stage of testing.
The information from the older cluster mostly applies and I suggest you read that documentation: https://howto.cs.uchicago.edu/techstaff:slurm
Summary of nodes installed on the cluster
kauffman3 is my CS test account.
$ ssh firstname.lastname@example.org
I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now.
kauffman3@fe01:~$ cs-sinfo NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FEATURES REASON GRES a[001-006] 6 geforce* idle 64 2:16:2 190000 0 1 'turing,geforce,rtx2080ti,11g' none gpu:rtx2080ti:4 a[007-008] 2 quadro idle 64 2:16:2 383000 0 1 'turing,quadro,rtx8000,48g' none gpu:rtx8000:4
kauffman3@fe01:~$ cs-squeue JOBID PARTITION USER NAME NODELIST TRES_PER_NSTATE TIME
# List the device number of the devices I've requested from Slurm. # These numbers map to /dev/nvidia?
kauffman3@fe01:~$ cat ./show_cuda_devices.sh #!/bin/bash hostname echo $CUDA_VISIBLE_DEVICES
Give me all four GPUs on systems 1-6
kauffman3@fe01:~$ srun -p geforce --gres=gpu:4 -w a[001-006] ./show_cuda_devices.sh a001 0,1,2,3 a002 0,1,2,3 a006 0,1,2,3 a005 0,1,2,3 a004 0,1,2,3 a003 0,1,2,3
# give me all GPUs on systems 7-8 # these are the Quadro RTX 8000s
kauffman3@fe01:~$ srun -p quadro --gres=gpu:4 -w a[007-008] ./show_cuda_devices.sh a008 0,1,2,3 a007 0,1,2,3
By default all usage is tracked and charged to a users default account. A fairshare value is computed and used in prioritizing a job on submission.
Details are being worked out for anyone that donates to the cluster. This will be some sort of tiered system where you get to use a higher priority when you need it.
You will need to charge an account on job submission
--account=<name> and most likely select the priority level you wish to use and that you are allowed to use:
Do we have a max job runtime?
Yes. 4 hours. This is done per partition. You are expected to write your code to accommodate for this.
PartitionName=geforce Nodes=a[001-006] Default=YES DefMemPerCPU=2900 MaxTime=04:00:00 State=UP Shared =YES PartitionName=quadro Nodes=a[007-008] Default=NO DefMemPerCPU=5900 MaxTime=04:00:00 State=UP Shared= YES