User Tools

Site Tools


techstaff:aicluster

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
techstaff:aicluster [2020/11/11 12:06]
kauffman
techstaff:aicluster [2021/01/06 16:11]
kauffman
Line 1: Line 1:
 ====== AI Cluster - Slurm ====== ====== AI Cluster - Slurm ======
-Cluster is up and running now. Anyone with a CS account who wishes to test it out should do so. +[[slurm:ai|This page has moved]]
- +
- +
-Feedback is requested:​ +
- +
- [[https://​discord.gg/​ZVjX8Gv|#​ai-cluster Discord channel]] or email Phil Kauffman (kauffman@cs dot uchicago dot edu). +
- +
-Knowledge of how to use Slurm already is preferred at this stage of testing. +
- +
- +
-The information from the older cluster mostly applies and I suggest you read that documentation:​ https://​howto.cs.uchicago.edu/​techstaff:​slurm +
- +
- +
-====== Infrastructure ====== +
-Summary of nodes installed on the cluster +
- +
-===== Computer/​GPU Nodes ===== +
-  * 6x nodes +
-    * 2x Xeon Gold 6130 CPU @ 2.10GHz (64 threads) +
-    * 192G RAM +
-    * 4x Nvidia GeForce RTX2080Ti +
- +
-  * 2x nodes +
-    * 2x Xeon Gold 6130 CPU @ 2.10GHz (64 threads) +
-    * 384G RAM +
-    * 4x Nvidia Quadro RTX 8000 +
- +
-  * all: +
-    * zfs mirror mounted at /local +
-      * compression to lz4: Usually this has a performance gain as less data is read and written to disk with a small overhead in CPU usage. +
-      * As of right now there is no mechanism to clean up /local. At some point I'll probably put a find command in cron that deletes files older than 90 days or so. +
- +
-===== Storage ===== +
- +
-  * ai-storage1:​ +
-    * 41T total storage +
-    * uplink to cluster network: 2x 25G +
-    * /​home/<​username>​ +
-      * We intend to set user quotas, however, there are no quotas right now. +
-    * /​net/​projects:​ (Please ignore this for now) +
-      * Lives on the home directory server. +
-      * Idea would be to create a dataset with a quota for people to use. +
-      * Normal LDAP groups that you are used to and available everywhere else would control access to these directories. e.g. jonaslab, sandlab +
- +
-  * ai-storage2:​ +
-    * 41T total storage +
-    * uplink to cluster network: 2x 25G +
-    * /​net/​scratch:​ Create ​ yourself a directory /​net/​scratch/​$USER. Use it for whatever you want. +
-    * Eventually data will be auto deleted after X amount of time. Maybe 90 days or whatever we determine makes sense. +
- +
-  * ai-storage3:​ +
-    * zfs mirror with previous snapshots of '​ai-storage1'​. +
-    * NOT a backup. +
-    * Not enabled yet. +
- +
- +
-====== Demo ====== +
- +
-kauffman3 is my CS test account. +
- +
-<​code>​ +
-$ ssh kauffman3@fe.ai.cs.uchicago.edu +
-</​code>​ +
-I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now. +
-<​code>​ +
-kauffman3@fe01:​~$ cs-sinfo +
-NODELIST ​   NODES  PARTITION ​ STATE  CPUS  S:C:T   ​MEMORY ​ TMP_DISK WEIGHT ​ AVAIL_FEATURES ​                 REASON ​ GRES +
-a[001-006] ​ 6      geforce* ​  ​idle ​  ​64 ​   2:​16:​2 ​ 190000 ​ 0         ​1 ​  '​turing,​geforce,​rtx2080ti,​11g' ​ none    gpu:​rtx2080ti:​4 +
-a[007-008] ​ 2      quadro ​    ​idle ​  ​64 ​   2:​16:​2 ​ 383000 ​ 0         ​1 ​  '​turing,​quadro,​rtx8000,​48g' ​    ​none ​   gpu:​rtx8000:​4 +
-</​code>​ +
-<​code>​ +
-kauffman3@fe01:​~$ cs-squeue +
-JOBID   ​PARTITION ​  ​USER ​          ​NAME ​                    ​NODELIST TRES_PER_NSTATE ​    ​TIME +
-</​code>​ +
- +
-# List the device number of the devices I've requested from Slurm. +
-# These numbers map to /​dev/​nvidia?​ +
-<​code>​ +
-kauffman3@fe01:​~$ cat ./​show_cuda_devices.sh +
-#​!/​bin/​bash +
-hostname +
-echo $CUDA_VISIBLE_DEVICES +
-</​code>​ +
- +
-Give me all four GPUs on systems 1-6 +
-<​code>​ +
-kauffman3@fe01:​~$ srun -p geforce --gres=gpu:​4 -w a[001-006] ./​show_cuda_devices.sh +
-a001 +
-0,1,2,3 +
-a002 +
-0,1,2,3 +
-a006 +
-0,1,2,3 +
-a005 +
-0,1,2,3 +
-a004 +
-0,1,2,3 +
-a003 +
-0,1,2,3 +
-</​code>​ +
-# give me all GPUs on systems 7-8 +
-#   these are the Quadro RTX 8000s +
-<​code>​ +
-kauffman3@fe01:​~$ srun -p quadro --gres=gpu:​4 -w a[007-008] ./​show_cuda_devices.sh +
-a008 +
-0,1,2,3 +
-a007 +
-0,1,2,3 +
-</​code>​ +
- +
- +
-===== Fairshare/​QOS ===== +
-By default all usage is tracked and charged to a users default account. A fairshare value is computed and used in prioritizing a job on submission. +
- +
-Details are being worked out for anyone that donates to the cluster. ​This will be some sort of tiered system where you get to use a higher priority when you need it. +
-You will need to charge an account on job submission ''​%%--account=<​name>​%%''​ and most likely select the priority level you wish to use and that you are allowed to use: ''​%%--qos=<​level>​%%''​ +
- +
- +
-====== Asked Questions ====== +
- +
-> Do we have a max job runtime? +
- +
-Yes. 4 hours. This is done per partition. You are expected to write your code to accommodate for this. +
- +
-<​code>​ +
-PartitionName=geforce Nodes=a[001-006Default=YES DefMemPerCPU=2900 MaxTime=04:​00:​00 State=UP Shared +
-=YES +
-PartitionName=quadro ​ Nodes=a[007-008Default=NO DefMemPerCPU=5900 MaxTime=04:​00:​00 State=UP Shared= +
-YES +
-</​code>​ +
- +
/var/lib/dokuwiki/data/pages/techstaff/aicluster.txt · Last modified: 2021/01/06 16:11 by kauffman