User Tools

Site Tools


techstaff:aicluster

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
techstaff:aicluster [2020/11/11 11:58] kauffmantechstaff:aicluster [2021/01/06 16:11] (current) kauffman
Line 1: Line 1:
 ====== AI Cluster - Slurm ====== ====== AI Cluster - Slurm ======
-Cluster is up and running now. Anyone with a CS account who wishes to test it out should do so. +[[slurm:ai|This page has moved]]
- +
- +
-Feedback is requested: +
- +
- [[https://discord.gg/ZVjX8Gv|#ai-cluster Discord channel]] or email Phil Kauffman (kauffman@cs dot uchicago dot edu). +
- +
-Knowledge of how to use Slurm already is preferred at this stage of testing. +
- +
- +
-The information from the older cluster mostly applies and I suggest you read that documentation: https://howto.cs.uchicago.edu/techstaff:slurm +
- +
- +
-====== Infrastructure ====== +
-Summary of nodes installed on the cluster +
- +
-===== Computer/GPU Nodes ===== +
-  * 6x nodes +
-    * 2x Xeon Gold 6130 CPU @ 2.10GHz (64 threads) +
-    * 192G RAM +
-    * 4x Nvidia GeForce RTX2080Ti +
- +
-  * 2x nodes +
-    * 2x Xeon Gold 6130 CPU @ 2.10GHz (64 threads) +
-    * 384G RAM +
-    * 4x Nvidia Quadro RTX 8000 +
- +
-  * all: +
-    * zfs mirror mounted at /local +
-      * compression to lz4: Usually this has a performance gain as less data is read and written to disk with a small overhead in CPU usage. +
-      * As of right now there is no mechanism to clean up /local. At some point I'll probably put a find command in cron that deletes files older than 90 days or so. +
- +
-===== Storage ===== +
- +
-  * ai-storage1: +
-    * 41T total storage +
-    * uplink to cluster network: 2x 25G +
-    * /home/<username> +
-      * We intend to set user quotas, however, there are no quotas right now. +
-    * /net/projects: (Please ignore this for now) +
-      * Lives on the home directory server. +
-      * Idea would be to create a dataset with a quota for people to use. +
-      * Normal LDAP groups that you are used to and available everywhere else would control access to these directories. e.g. jonaslab, sandlab +
- +
-  * ai-storage2: +
-    * 41T total storage +
-    * uplink to cluster network: 2x 25G +
-    * /net/scratch: Create  yourself a directory /net/scratch/$USER. Use it for whatever you want. +
-    * Eventually data will be auto deleted after X amount of time. Maybe 90 days or whatever we determine makes sense. +
- +
-  * ai-storage3: +
-    * zfs mirror with previous snapshots of 'ai-storage1'+
-    * NOT a backup. +
-    * Not enabled yet. +
- +
- +
-====== Demo ====== +
- +
-kauffman3 is my CS test account. +
- +
-<code> +
-$ ssh kauffman3@fe.ai.cs.uchicago.edu +
-</code> +
-I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now. +
-<code> +
-kauffman3@fe01:~$ cs-sinfo +
-NODELIST    NODES  PARTITION  STATE  CPUS  S:C:T   MEMORY  TMP_DISK WEIGHT  AVAIL_FEATURES                  REASON  GRES +
-a[001-006]  6      geforce*   idle   64    2:16: 190000  0           'turing,geforce,rtx2080ti,11g'  none    gpu:rtx2080ti:+
-a[007-008]  2      quadro     idle   64    2:16: 383000  0           'turing,quadro,rtx8000,48g'     none    gpu:rtx8000:+
-</code> +
-<code> +
-kauffman3@fe01:~$ cs-squeue +
-JOBID   PARTITION   USER           NAME                     NODELIST TRES_PER_NSTATE     TIME +
-</code> +
- +
-# List the device number of the devices I've requested from Slurm. +
-# These numbers map to /dev/nvidia? +
-<code> +
-kauffman3@fe01:~$ cat ./show_cuda_devices.sh +
-#!/bin/bash +
-hostname +
-echo $CUDA_VISIBLE_DEVICES +
-</code> +
- +
-Give me all four GPUs on systems 1-6 +
-<code> +
-kauffman3@fe01:~$ srun -p geforce --gres=gpu:4 -w a[001-006] ./show_cuda_devices.sh +
-a001 +
-0,1,2,3 +
-a002 +
-0,1,2,3 +
-a006 +
-0,1,2,3 +
-a005 +
-0,1,2,3 +
-a004 +
-0,1,2,3 +
-a003 +
-0,1,2,3 +
-</code> +
-# give me all GPUs on systems 7-8 +
-#   these are the Quadro RTX 8000s +
-<code> +
-kauffman3@fe01:~$ srun -p quadro --gres=gpu:4 -w a[007-008] ./show_cuda_devices.sh +
-a008 +
-0,1,2,3 +
-a007 +
-0,1,2,3 +
-</code> +
- +
- +
- +
- +
- +
- +
- +
-====== Asked Questions ====== +
- +
-> Do we have a max job runtime? +
- +
-Yes. 4 hours. This is done per partition. You are expected to write your code to accommodate for this. +
- +
-<code> +
-PartitionName=geforce Nodes=a[001-006Default=YES DefMemPerCPU=2900 MaxTime=04:00:00 State=UP Shared +
-=YES +
-PartitionName=quadro  Nodes=a[007-008Default=NO DefMemPerCPU=5900 MaxTime=04:00:00 State=UP Shared= +
-YES +
-</code> +
- +
/var/lib/dokuwiki/data/attic/techstaff/aicluster.1605117515.txt.gz · Last modified: 2020/11/11 11:58 by kauffman

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki