User Tools

Site Tools


techstaff:aicluster

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
techstaff:aicluster [2020/11/11 16:21] kauffmantechstaff:aicluster [2021/01/06 16:11] (current) kauffman
Line 1: Line 1:
 ====== AI Cluster - Slurm ====== ====== AI Cluster - Slurm ======
-Cluster is up and running now. Anyone with a CS account who wishes to test it out should do so. +[[slurm:ai|This page has moved]]
- +
- +
-Feedback is requested: +
- +
- [[https://discord.gg/ZVjX8Gv|#ai-cluster Discord channel]] or email Phil Kauffman (kauffman@cs dot uchicago dot edu). +
- +
-Knowledge of how to use Slurm already is preferred at this stage of testing. +
- +
- +
-The information from the older cluster mostly applies and I suggest you read that documentation: https://howto.cs.uchicago.edu/techstaff:slurm +
- +
- +
-====== Infrastructure ====== +
-Summary of nodes installed on the cluster +
- +
-===== Computer/GPU Nodes ===== +
-  * 6x nodes +
-    * 2x Xeon Gold 6130 CPU @ 2.10GHz (64 threads) +
-    * 192G RAM +
-    * 4x Nvidia GeForce RTX2080Ti +
- +
-  * 2x nodes +
-    * 2x Xeon Gold 6130 CPU @ 2.10GHz (64 threads) +
-    * 384G RAM +
-    * 4x Nvidia Quadro RTX 8000 +
- +
-  * all: +
-    * zfs mirror mounted at /local +
-      * compression to lz4: Usually this has a performance gain as less data is read and written to disk with a small overhead in CPU usage. +
-      * As of right now there is no mechanism to clean up /local. At some point I'll probably put a find command in cron that deletes files older than 90 days or so. +
- +
-===== Storage ===== +
- +
-  * ai-storage1: +
-    * 41T total storage +
-    * uplink to cluster network: 2x 25G +
-    * /home/<username> +
-      * We intend to set user quotas, however, there are no quotas right now. +
-    * /net/projects: (Please ignore this for now) +
-      * Lives on the home directory server. +
-      * Idea would be to create a dataset with a quota for people to use. +
-      * Normal LDAP groups that you are used to and available everywhere else would control access to these directories. e.g. jonaslab, sandlab +
- +
-  * ai-storage2: +
-    * 41T total storage +
-    * uplink to cluster network: 2x 25G +
-    * /net/scratch: Create  yourself a directory /net/scratch/$USER. Use it for whatever you want. +
-    * Eventually data will be auto deleted after X amount of time. Maybe 90 days or whatever we determine makes sense. +
- +
-  * ai-storage3: +
-    * zfs mirror with previous snapshots of 'ai-storage1'+
-    * NOT a backup. +
-    * Not enabled yet. +
- +
- +
-====== Login ====== +
-There are a set of front end nodes that give you access to the Slurm cluster. You will connect through these nodes and need to be on these nodes to submit jobs to the cluster. +
- +
-    fe.ai.cs.uchicago.edu +
- +
-  * Requires a CS account. +
-  * ssh cnetid@fe.ai.cs.uchicago.edu +
- +
- +
-====== Demo ====== +
- +
-kauffman3 is my CS test account. +
- +
-<code> +
-$ ssh kauffman3@fe.ai.cs.uchicago.edu +
-</code> +
-I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now. +
-<code> +
-kauffman3@fe01:~$ cs-sinfo +
-NODELIST    NODES  PARTITION  STATE  CPUS  S:C:T   MEMORY  TMP_DISK WEIGHT  AVAIL_FEATURES                  REASON  GRES +
-a[001-006]  6      geforce*   idle   64    2:16: 190000  0           'turing,geforce,rtx2080ti,11g'  none    gpu:rtx2080ti:+
-a[007-008]  2      quadro     idle   64    2:16: 383000  0           'turing,quadro,rtx8000,48g'     none    gpu:rtx8000:+
-</code> +
-<code> +
-kauffman3@fe01:~$ cs-squeue +
-JOBID   PARTITION   USER           NAME                     NODELIST TRES_PER_NSTATE     TIME +
-</code> +
- +
-# List the device number of the devices I've requested from Slurm. +
-# These numbers map to /dev/nvidia? +
-<code> +
-kauffman3@fe01:~$ cat ./show_cuda_devices.sh +
-#!/bin/bash +
-hostname +
-echo $CUDA_VISIBLE_DEVICES +
-</code> +
- +
-Give me all four GPUs on systems 1-6 +
-<code> +
-kauffman3@fe01:~$ srun -p geforce --gres=gpu:4 -w a[001-006] ./show_cuda_devices.sh +
-a001 +
-0,1,2,3 +
-a002 +
-0,1,2,3 +
-a006 +
-0,1,2,3 +
-a005 +
-0,1,2,3 +
-a004 +
-0,1,2,3 +
-a003 +
-0,1,2,3 +
-</code> +
-# give me all GPUs on systems 7-8 +
-#   these are the Quadro RTX 8000s +
-<code> +
-kauffman3@fe01:~$ srun -p quadro --gres=gpu:4 -w a[007-008] ./show_cuda_devices.sh +
-a008 +
-0,1,2,3 +
-a007 +
-0,1,2,3 +
-</code> +
- +
- +
-===== Fairshare/QOS ===== +
-By default all usage is tracked and charged to a users default account. A fairshare value is computed and used in prioritizing a job on submission. +
- +
-Details are being worked out for anyone that donates to the cluster. This will be some sort of tiered system where you get to use a higher priority when you need it. +
-You will need to charge an account on job submission ''%%--account=<name>%%'' and most likely select the priority level you wish to use and that you are allowed to use: ''%%--qos=<level>%%'' +
- +
- +
-====== Asked Questions ====== +
- +
-> Do we have a max job runtime? +
- +
-Yes. 4 hours. This is done per partition. You are expected to write your code to accommodate for this. +
- +
-<code> +
-PartitionName=geforce Nodes=a[001-006Default=YES DefMemPerCPU=2900 MaxTime=04:00:00 State=UP Shared +
-=YES +
-PartitionName=quadro  Nodes=a[007-008Default=NO DefMemPerCPU=5900 MaxTime=04:00:00 State=UP Shared= +
-YES +
-</code> +
- +
/var/lib/dokuwiki/data/attic/techstaff/aicluster.1605133262.txt.gz · Last modified: 2020/11/11 16:21 by kauffman

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki