User Tools

Site Tools


techstaff:aicluster

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
techstaff:aicluster [2020/11/11 11:40] kauffmantechstaff:aicluster [2021/01/06 16:11] (current) kauffman
Line 1: Line 1:
 ====== AI Cluster - Slurm ====== ====== AI Cluster - Slurm ======
-Cluster is up and running now. Anyone with a CS account who wishes to test it out should do so. +[[slurm:ai|This page has moved]]
- +
- +
-Feedback is requested: +
- +
- #ai-cluster Discord channel or email Phil Kauffman (kauffman@cs dot uchicago dot edu). +
- +
-Knowledge of how to use Slurm already is preferred at this stage of testing. +
- +
- +
-The information from the older cluster mostly applies and I suggest you read that documentation: https://howto.cs.uchicago.edu/techstaff:slurm +
- +
- +
-====== Demo ====== +
- +
-kauffman3 is my CS test account. +
- +
-<code> +
-$ ssh kauffman3@fe.ai.cs.uchicago.edu +
-</code> +
-I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now. +
-<code> +
-kauffman3@fe01:~$ cs-sinfo +
-NODELIST    NODES  PARTITION  STATE  CPUS  S:C:T   MEMORY  TMP_DISK WEIGHT  AVAIL_FEATURES                  REASON  GRES +
-a[001-006]  6      geforce*   idle   64    2:16: 190000  0           'turing,geforce,rtx2080ti,11g'  none    gpu:rtx2080ti:+
-a[007-008]  2      quadro     idle   64    2:16:2  383000  0           'turing,quadro,rtx8000,48g'     none    gpu:rtx8000:+
-</code> +
-<code> +
-kauffman3@fe01:~$ cs-squeue +
-JOBID   PARTITION   USER           NAME                     NODELIST TRES_PER_NSTATE     TIME +
-</code> +
- +
-# List the device number of the devices I've requested from Slurm. +
-# These numbers map to /dev/nvidia? +
-<code> +
-kauffman3@fe01:~$ cat ./show_cuda_devices.sh +
-#!/bin/bash +
-hostname +
-echo $CUDA_VISIBLE_DEVICES +
-</code> +
- +
-Give me all four GPUs on systems 1-6 +
-<code> +
-kauffman3@fe01:~$ srun -p geforce --gres=gpu:4 -w a[001-006] ./show_cuda_devices.sh +
-a001 +
-0,1,2,3 +
-a002 +
-0,1,2,3 +
-a006 +
-0,1,2,3 +
-a005 +
-0,1,2,3 +
-a004 +
-0,1,2,3 +
-a003 +
-0,1,2,3 +
-</code> +
-# give me all GPUs on systems 7-8 +
-#   these are the Quadro RTX 8000s +
-<code> +
-kauffman3@fe01:~$ srun -p quadro --gres=gpu:4 -w a[007-008] ./show_cuda_devices.sh +
-a008 +
-0,1,2,3 +
-a007 +
-0,1,2,3 +
-</code> +
- +
- +
- +
- +
- +
- +
-====== Storage ====== +
- +
-  /net/scratch: +
-     Create  yourself a directory /net/scratch/$USER. Use it for whatever you want. +
- +
-  /net/projects: (Please ignore this for now) +
-    Lives on the home directory server. +
-    Idea would be to create a dataset with a quota for people to use. +
-    Normal LDAP groups that you are used to and available everywhere else would control access to these directories. +
-    e.g. jonaslab, sandlab +
- +
- +
-Currently there is no quota on home directories. +
- +
-homes and scratch each connected via 2x 25G. Both are SSD only so the storage should be FAST. +
- +
- +
-Each compute node (nodes with gpus) has a zfs mirror mounted at /local +
-I set compression to lz4 by default. Usually this has a performance gain as less data is read and written to disk with a small overhead in CPU usage. +
-As of right now there is no mechanism to clean up /local. At some point I'll probably put a find command in cron that deletes files older than 90 days or so. +
- +
- +
-====== Asked Questions ====== +
- +
-> Do we have a max job runtime? +
- +
-Yes. 4 hours. This is done per partition. You are expected to write your code to accommodate for this. +
- +
-<code> +
-PartitionName=geforce Nodes=a[001-006Default=YES DefMemPerCPU=2900 MaxTime=04:00:00 State=UP Shared +
-=YES +
-PartitionName=quadro  Nodes=a[007-008Default=NO DefMemPerCPU=5900 MaxTime=04:00:00 State=UP Shared= +
-YES +
-</code> +
- +
/var/lib/dokuwiki/data/attic/techstaff/aicluster.1605116404.txt.gz · Last modified: 2020/11/11 11:40 by kauffman

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki