Last updated: May 2025
Please send in a ticket requesting to be added if it is your first time using the AI cluster. You will need to be involved in research with a CS faculty member.
Feedback is requested. Find us in the Slack ai-cluster channel (channel ID: C02KW3M0BDK).
Summary of nodes installed on the cluster:
AI Cluster Specs
CPU Cores: 2960 System Mem: 34389 GB GPU Memory: 8032 GB GPUs:
92 A40 48GB 52 L40S 48GB 14 H100 80GB
Storage: 483 TB
We like the alphabet, so we have compute node groups for just about every letter in it.
There are a set of front end nodes (currently, fe01 and fe02) that give you access to the Slurm cluster. You will connect through these nodes and need to be on these nodes to submit jobs to the cluster.
If you ssh to just "fe", it will pick one for you.
ssh cnetid@fe.ai.cs.uchicago.edu
You will use the FE nodes to transfer your files onto the cluster storage infrastructure over SSH, with a tool like rsync or scp.
kauffman3 is my CS test account.
$ ssh kauffman3@fe.ai.cs.uchicago.edu
I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now.
kauffman3@fe01:~$ cs-sinfo NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FEATURES REASON GRES a[001-006] 6 geforce* idle 64 2:16:2 190000 0 1 'turing,geforce,rtx2080ti,11g' none gpu:rtx2080ti:4 a[007-008] 2 quadro idle 64 2:16:2 383000 0 1 'turing,quadro,rtx8000,48g' none gpu:rtx8000:4
kauffman3@fe01:~$ cs-squeue JOBID PARTITION USER NAME NODELIST TRES_PER_NSTATE TIME
Run my job script on four GPUs in "f" series systems 1-4, using the Slurm partition your research group has access to. (Ask your advisor if there is such a special partition available to you.)
kauffman3@fe01:~$ sbatch -p my-advisors-partition --gres=gpu:4 -w f[001-004] job.sh
Refer to 'man sbatch' and 'man srun' for more.
Do we have a max job runtime?
Yes, 4 hours. This is done per partition. You are expected to write your code to accommodate for this.
The process for a batch job is very similar.
jupyter-notebook.sbatch
#!/bin/bash unset XDG_RUNTIME_DIR NODEIP=$(hostname -i) NODEPORT=$(( $RANDOM + 1024)) echo "ssh command: ssh -N -L 8888:$NODEIP:$NODEPORT `whoami`@fe01.ai.cs.uchicago.edu" . ~/myenv/bin/activate jupyter-notebook --ip=$NODEIP --port=$NODEPORT --no-browser
Check the output of your job to find the ssh command to use when accessing your notebook.
Make a new ssh connection to tunnel your traffic. The format will be something like:
ssh -N -L 8888:###.###.###.###:#### user@fe01.ai.cs.uchicago.edu
This command will appear to hang since we are using the -N option which tells ssh not to run any commands including a shell on the remote machine.
Open your local browser and visit: http://localhost:8888
srun --pty bash
run an interactive jobunset XDG_RUNTIME_DIR
jupyter tries to use the value of this environment variable to store some files, by defaut it is set to ''
and that causes errors when trying to run juypter notebook.export NODEIP=$(hostname -i)
get the ip address of the node you are usingexport NODEPORT=$(( $RANDOM + 1024 ))
get a random port above 1024echo $NODEIP:$NODEPORT
echo the env var values to use laterjupyter-notebook --ip=$NODEIP --port=$NODEPORT --no-browser
start the jupyter notebookssh -N -L 8888:$NODEIP:$NODEPORT user@fe01.ai.cs.uchicago.edu
using the values not variableslocalhost:8888
to $NODEIP:$NODEPORT
via the ssh tunnel. This command will appear to hang since we are using the -N option which tells ssh not to run any commands including a shell on the remote machine.http://localhost:8888
Copy the following code snippt to the interactive node directly:
unset XDG_RUNTIME_DIR NODEIP=$(hostname -i) NODEPORT=$(( $RANDOM + 1024)) echo "ssh command: ssh -N -L 8888:$NODEIP:$NODEPORT `whoami`@fe01.ai.cs.uchicago.edu" jupyter-notebook --ip=$NODEIP --port=$NODEPORT --no-browser
This section can be ignored by most people. If you contributed to the cluster or are in a group that has you can read more here.