Cluster is up and running now. Anyone with a CS account who wishes to test it out should do so. Please send in a ticket requesting to be added if it is your first time using the AI cluster.
Feedback is requested:
The information from the older cluster mostly applies and I suggest you read that documentation: https://howto.cs.uchicago.edu/slurm
Summary of nodes installed on the cluster.
guestas the username and password to login.
Anyone with a CS account who has previously sent in a ticket to request access to be added is allowed to login.
There are a set of front end nodes that give you access to the Slurm cluster. You will connect through these nodes and need to be on these nodes to submit jobs to the cluster.
You will use the FE nodes to transfer your files onto the cluster storage infrastructure. The network connections on those nodes are 2x 10G each.
kauffman3 is my CS test account.
$ ssh firstname.lastname@example.org
I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now.
kauffman3@fe01:~$ cs-sinfo NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FEATURES REASON GRES a[001-006] 6 geforce* idle 64 2:16:2 190000 0 1 'turing,geforce,rtx2080ti,11g' none gpu:rtx2080ti:4 a[007-008] 2 quadro idle 64 2:16:2 383000 0 1 'turing,quadro,rtx8000,48g' none gpu:rtx8000:4
kauffman3@fe01:~$ cs-squeue JOBID PARTITION USER NAME NODELIST TRES_PER_NSTATE TIME
# List the device number of the devices I've requested from Slurm. # These numbers map to /dev/nvidia?
kauffman3@fe01:~$ cat ./show_cuda_devices.sh #!/bin/bash hostname echo $CUDA_VISIBLE_DEVICES
Give me all four GPUs on systems 1-6
kauffman3@fe01:~$ srun -p geforce --gres=gpu:4 -w a[001-006] ./show_cuda_devices.sh a001 0,1,2,3 a002 0,1,2,3 a006 0,1,2,3 a005 0,1,2,3 a004 0,1,2,3 a003 0,1,2,3
# give me all GPUs on systems 7-8 # these are the Quadro RTX 8000s
kauffman3@fe01:~$ srun -p quadro --gres=gpu:4 -w a[007-008] ./show_cuda_devices.sh a008 0,1,2,3 a007 0,1,2,3
Do we have a max job runtime?
Yes. 4 hours. This is done per partition. You are expected to write your code to accommodate for this.
PartitionName=geforce Nodes=a[001-006] Default=YES DefMemPerCPU=2900 MaxTime=04:00:00 State=UP Shared =YES PartitionName=quadro Nodes=a[007-008] Default=NO DefMemPerCPU=5900 MaxTime=04:00:00 State=UP Shared= YES
The process for a batch job is very similar.
#!/bin/bash unset XDG_RUNTIME_DIR NODEIP=$(hostname -i) NODEPORT=$(( $RANDOM + 1024)) echo "ssh command: ssh -N -L 8888:$NODEIP:$NODEPORT `email@example.com" . ~/myenv/bin/activate jupyter-notebook --ip=$NODEIP --port=$NODEPORT --no-browser
Check the output of your job to find the ssh command to use when accessing your notebook.
Make a new ssh connection to tunnel your traffic. The format will be something like:
ssh -N -L 8888:###.###.###.###:#### firstname.lastname@example.org
This command will appear to hang since we are using the -N option which tells ssh not to run any commands including a shell on the remote machine.
Open your local browser and visit:
srun --pty bashrun an interactive job
unset XDG_RUNTIME_DIRjupyter tries to use the value of this environment variable to store some files, by defaut it is set to
''and that causes errors when trying to run juypter notebook.
export NODEIP=$(hostname -i)get the ip address of the node you are using
export NODEPORT=$(( $RANDOM + 1024 ))get a random port above 1024
echo $NODEIP:$NODEPORTecho the env var values to use later
jupyter-notebook --ip=$NODEIP --port=$NODEPORT --no-browserstart the jupyter notebook
ssh -N -L 8888:$NODEIP:$NODEPORT email@example.com the values not variables
$NODEIP:$NODEPORTvia the ssh tunnel. This command will appear to hang since we are using the -N option which tells ssh not to run any commands including a shell on the remote machine.
This section can be ignored by most people. If you contributed to the cluster or are in a group that has you can read more here.