User Tools

Site Tools


techstaff:aicluster

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
techstaff:aicluster [2020/11/11 11:45]
kauffman
techstaff:aicluster [2021/01/06 16:11]
kauffman
Line 1: Line 1:
 ====== AI Cluster - Slurm ====== ====== AI Cluster - Slurm ======
-Cluster is up and running now. Anyone with a CS account who wishes to test it out should do so. +[[slurm:ai|This page has moved]]
- +
- +
-Feedback is requested:​ +
- +
- [[https://​discord.gg/​ZVjX8Gv|#ai-cluster Discord channel]] or email Phil Kauffman (kauffman@cs dot uchicago dot edu). +
- +
-Knowledge of how to use Slurm already is preferred at this stage of testing. +
- +
- +
-The information from the older cluster mostly applies and I suggest you read that documentation:​ https://​howto.cs.uchicago.edu/​techstaff:​slurm +
- +
- +
-====== Demo ====== +
- +
-kauffman3 is my CS test account. +
- +
-<​code>​ +
-$ ssh kauffman3@fe.ai.cs.uchicago.edu +
-</​code>​ +
-I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now. +
-<​code>​ +
-kauffman3@fe01:​~$ cs-sinfo +
-NODELIST ​   NODES  PARTITION ​ STATE  CPUS  S:C:T   ​MEMORY ​ TMP_DISK WEIGHT ​ AVAIL_FEATURES ​                 REASON ​ GRES +
-a[001-006] ​ 6      geforce* ​  ​idle ​  ​64 ​   2:​16:​2 ​ 190000 ​ 0         ​1 ​  '​turing,​geforce,​rtx2080ti,​11g' ​ none    gpu:​rtx2080ti:​4 +
-a[007-008] ​ 2      quadro ​    ​idle ​  ​64 ​   2:​16:​2 ​ 383000 ​ 0         ​1 ​  '​turing,​quadro,​rtx8000,​48g' ​    ​none ​   gpu:​rtx8000:​4 +
-</​code>​ +
-<​code>​ +
-kauffman3@fe01:​~$ cs-squeue +
-JOBID   ​PARTITION ​  ​USER ​          ​NAME ​                    ​NODELIST TRES_PER_NSTATE ​    ​TIME +
-</​code>​ +
- +
-# List the device number of the devices I've requested from Slurm. +
-# These numbers map to /​dev/​nvidia?​ +
-<​code>​ +
-kauffman3@fe01:​~$ cat ./​show_cuda_devices.sh +
-#​!/​bin/​bash +
-hostname +
-echo $CUDA_VISIBLE_DEVICES +
-</​code>​ +
- +
-Give me all four GPUs on systems 1-6 +
-<​code>​ +
-kauffman3@fe01:​~$ srun -p geforce --gres=gpu:​4 -w a[001-006] ./​show_cuda_devices.sh +
-a001 +
-0,1,2,3 +
-a002 +
-0,1,2,3 +
-a006 +
-0,1,2,3 +
-a005 +
-0,1,2,3 +
-a004 +
-0,1,2,3 +
-a003 +
-0,1,2,3 +
-</​code>​ +
-# give me all GPUs on systems 7-8 +
-#   these are the Quadro RTX 8000s +
-<​code>​ +
-kauffman3@fe01:​~$ srun -p quadro --gres=gpu:​4 -w a[007-008] ./​show_cuda_devices.sh +
-a008 +
-0,1,2,3 +
-a007 +
-0,1,2,3 +
-</​code>​ +
- +
- +
- +
- +
- +
- +
-====== Storage ====== +
- +
-  /​net/​scratch:​ +
-     ​Create ​ yourself a directory /​net/​scratch/​$USER. Use it for whatever you want. +
- +
-  /​net/​projects:​ (Please ignore this for now) +
-    Lives on the home directory server. +
-    Idea would be to create a dataset with a quota for people to use. +
-    Normal LDAP groups that you are used to and available everywhere else would control access to these directories. +
-    e.g. jonaslab, sandlab +
- +
- +
-Currently there is no quota on home directories. +
- +
-homes and scratch each connected via 2x 25G. Both are SSD only so the storage should be FAST. +
- +
- +
-Each compute node (nodes with gpus) has a zfs mirror mounted at /local +
-I set compression to lz4 by default. Usually this has a performance gain as less data is read and written to disk with a small overhead in CPU usage. +
-As of right now there is no mechanism to clean up /local. At some point I'll probably put a find command in cron that deletes files older than 90 days or so. +
- +
- +
-====== Asked Questions ====== +
- +
-> Do we have a max job runtime? +
- +
-Yes. 4 hours. This is done per partition. You are expected to write your code to accommodate for this. +
- +
-<​code>​ +
-PartitionName=geforce Nodes=a[001-006Default=YES DefMemPerCPU=2900 MaxTime=04:​00:​00 State=UP Shared +
-=YES +
-PartitionName=quadro ​ Nodes=a[007-008Default=NO DefMemPerCPU=5900 MaxTime=04:​00:​00 State=UP Shared= +
-YES +
-</​code>​ +
- +
/var/lib/dokuwiki/data/pages/techstaff/aicluster.txt · Last modified: 2021/01/06 16:11 by kauffman