Differences

This shows you the differences between two versions of the page.

--- techstaff:aicluster [2020/11/11 11:40] – kauffman
+++ techstaff:aicluster [2021/01/06 16:11] (current) – kauffman
@@ Line 1: / Line 1: @@
 ====== AI Cluster - Slurm ======
-Cluster is up and running now. Anyone with a CS account who wishes to test it out should do so.
+[[slurm:ai|This page has moved]]
-Feedback is requested:
- #ai-cluster Discord channel or email Phil Kauffman (kauffman@cs dot uchicago dot edu).
-Knowledge of how to use Slurm already is preferred at this stage of testing.
-The information from the older cluster mostly applies and I suggest you read that documentation: https://howto.cs.uchicago.edu/techstaff:slurm
-====== Demo ======
-kauffman3 is my CS test account.
-<code>
-$ ssh kauffman3@fe.ai.cs.uchicago.edu
-</code>
-I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now.
-<code>
-kauffman3@fe01:~$ cs-sinfo
-NODELIST    NODES  PARTITION  STATE  CPUS  S:C:T   MEMORY  TMP_DISK WEIGHT  AVAIL_FEATURES                  REASON  GRES
-a[001-006]  6      geforce*   idle   64    2:16:2  190000  0         1   'turing,geforce,rtx2080ti,11g'  none    gpu:rtx2080ti:4
-a[007-008]  2      quadro     idle   64    2:16:2  383000  0         1   'turing,quadro,rtx8000,48g'     none    gpu:rtx8000:4
-</code>
-<code>
-kauffman3@fe01:~$ cs-squeue
-JOBID   PARTITION   USER           NAME                     NODELIST TRES_PER_NSTATE     TIME
-</code>
-# List the device number of the devices I've requested from Slurm.
-# These numbers map to /dev/nvidia?
-<code>
-kauffman3@fe01:~$ cat ./show_cuda_devices.sh
-#!/bin/bash
-hostname
-echo $CUDA_VISIBLE_DEVICES
-</code>
-Give me all four GPUs on systems 1-6
-<code>
-kauffman3@fe01:~$ srun -p geforce --gres=gpu:4 -w a[001-006] ./show_cuda_devices.sh
-a001
-,1,2,3
-a002
-,1,2,3
-a006
-,1,2,3
-a005
-,1,2,3
-a004
-,1,2,3
-a003
-,1,2,3
-</code>
-# give me all GPUs on systems 7-8
-#   these are the Quadro RTX 8000s
-<code>
-kauffman3@fe01:~$ srun -p quadro --gres=gpu:4 -w a[007-008] ./show_cuda_devices.sh
-a008
-,1,2,3
-a007
-,1,2,3
-</code>
-====== Storage ======
-  /net/scratch:
-     Create  yourself a directory /net/scratch/$USER. Use it for whatever you want.
-  /net/projects: (Please ignore this for now)
-    Lives on the home directory server.
-    Idea would be to create a dataset with a quota for people to use.
-    Normal LDAP groups that you are used to and available everywhere else would control access to these directories.
-    e.g. jonaslab, sandlab
-Currently there is no quota on home directories.
-homes and scratch each connected via 2x 25G. Both are SSD only so the storage should be FAST.
-Each compute node (nodes with gpus) has a zfs mirror mounted at /local
-I set compression to lz4 by default. Usually this has a performance gain as less data is read and written to disk with a small overhead in CPU usage.
-As of right now there is no mechanism to clean up /local. At some point I'll probably put a find command in cron that deletes files older than 90 days or so.
-====== Asked Questions ======
-> Do we have a max job runtime?
-Yes. 4 hours. This is done per partition. You are expected to write your code to accommodate for this.
-<code>
-PartitionName=geforce Nodes=a[001-006] Default=YES DefMemPerCPU=2900 MaxTime=04:00:00 State=UP Shared
-=YES
-PartitionName=quadro  Nodes=a[007-008] Default=NO DefMemPerCPU=5900 MaxTime=04:00:00 State=UP Shared=
-YES
-</code>