slurm:ai
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
slurm:ai [2021/01/06 16:01] – created kauffman | slurm:ai [2022/04/04 10:58] (current) – fix typos and add code snippet for interactive jupyter notebook chaochunh | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== AI Cluster (Slurm) ====== | + | ====== AI Cluster |
+ | Please send in a ticket requesting to be added if it is your first time using the AI cluster. | ||
+ | |||
+ | |||
+ | Feedback is requested: | ||
+ | |||
+ | | ||
+ | |||
+ | |||
+ | |||
+ | The information from the older cluster mostly applies and I suggest you read that documentation: | ||
+ | |||
+ | |||
+ | ====== Infrastructure ====== | ||
+ | Summary of nodes installed on the cluster. | ||
+ | |||
+ | * [[ http:// | ||
+ | * [[ https:// | ||
+ | * Use '' | ||
+ | |||
+ | ===== Computer/ | ||
+ | * 6x nodes | ||
+ | * 2x Xeon Gold 6130 CPU @ 2.10GHz | ||
+ | * 192G RAM | ||
+ | * 4x Nvidia GeForce RTX2080Ti | ||
+ | |||
+ | * 2x nodes | ||
+ | * 2x Xeon Gold 6130 CPU @ 2.10GHz (64 threads) | ||
+ | * 384G RAM | ||
+ | * 4x Nvidia Quadro RTX 8000 | ||
+ | |||
+ | * 3x nodes | ||
+ | * 2x AMD EPYC 7302 16-Core Processor | ||
+ | * 512G RAM | ||
+ | * 4x Nvidia A40 | ||
+ | |||
+ | * all: | ||
+ | * zfs mirror mounted at /local | ||
+ | * compression to lz4: Usually this has a performance gain as less data is read and written to disk with a small overhead in CPU usage. | ||
+ | * As of right now there is no mechanism to clean up /local. At some point I'll probably put a find command in cron that deletes files older than 90 days or so. | ||
+ | |||
+ | ===== Storage ===== | ||
+ | |||
+ | * ai-storage1: | ||
+ | * 41T total storage | ||
+ | * uplink to cluster network: 2x 25G | ||
+ | * / | ||
+ | * 20G quota per user. | ||
+ | * / | ||
+ | * Lives on the home directory server. | ||
+ | * Idea would be to create a dataset with a quota for people to use. | ||
+ | * Normal LDAP groups that you are used to and available everywhere else would control access to these directories. e.g. jonaslab, sandlab | ||
+ | |||
+ | * ai-storage2: | ||
+ | * 41T total storage | ||
+ | * uplink to cluster network: 2x 25G | ||
+ | * / | ||
+ | * Eventually data will be auto deleted after X amount of time. Maybe 90 days or whatever we determine makes sense. | ||
+ | |||
+ | * ai-storage3: | ||
+ | * zfs mirror with previous snapshots of ' | ||
+ | * NOT a backup. | ||
+ | |||
+ | |||
+ | |||
+ | ====== Login ====== | ||
+ | |||
+ | Anyone with a CS account who has previously sent in a ticket to request access to be added is allowed to login. | ||
+ | |||
+ | There are a set of front end nodes that give you access to the Slurm cluster. You will connect through these nodes and need to be on these nodes to submit jobs to the cluster. | ||
+ | |||
+ | ssh cnetid@fe.ai.cs.uchicago.edu | ||
+ | ==== File Transfer ==== | ||
+ | You will use the FE nodes to transfer your files onto the cluster storage infrastructure. The network connections on those nodes are 2x 10G each. | ||
+ | |||
+ | === Quota === | ||
+ | * By default users are given a quota of 20G. | ||
+ | |||
+ | ====== Demo ====== | ||
+ | |||
+ | kauffman3 is my CS test account. | ||
+ | |||
+ | < | ||
+ | $ ssh kauffman3@fe.ai.cs.uchicago.edu | ||
+ | </ | ||
+ | I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now. | ||
+ | < | ||
+ | kauffman3@fe01: | ||
+ | NODELIST | ||
+ | a[001-006] | ||
+ | a[007-008] | ||
+ | </ | ||
+ | < | ||
+ | kauffman3@fe01: | ||
+ | JOBID | ||
+ | </ | ||
+ | |||
+ | # List the device number of the devices I've requested from Slurm. | ||
+ | # These numbers map to / | ||
+ | < | ||
+ | kauffman3@fe01: | ||
+ | # | ||
+ | hostname | ||
+ | echo $CUDA_VISIBLE_DEVICES | ||
+ | </ | ||
+ | |||
+ | Give me all four GPUs on systems 1-6 | ||
+ | < | ||
+ | kauffman3@fe01: | ||
+ | a001 | ||
+ | 0,1,2,3 | ||
+ | a002 | ||
+ | 0,1,2,3 | ||
+ | a006 | ||
+ | 0,1,2,3 | ||
+ | a005 | ||
+ | 0,1,2,3 | ||
+ | a004 | ||
+ | 0,1,2,3 | ||
+ | a003 | ||
+ | 0,1,2,3 | ||
+ | </ | ||
+ | # give me all GPUs on systems 7-8 | ||
+ | # these are the Quadro RTX 8000s | ||
+ | < | ||
+ | kauffman3@fe01: | ||
+ | a008 | ||
+ | 0,1,2,3 | ||
+ | a007 | ||
+ | 0,1,2,3 | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ====== Asked Questions ====== | ||
+ | |||
+ | > Do we have a max job runtime? | ||
+ | |||
+ | Yes. 4 hours. This is done per partition. You are expected to write your code to accommodate for this. | ||
+ | |||
+ | < | ||
+ | PartitionName=geforce Nodes=a[001-006] Default=YES DefMemPerCPU=2900 MaxTime=04: | ||
+ | =YES | ||
+ | PartitionName=quadro | ||
+ | YES | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== Jupyter Notebook Tips ===== | ||
+ | ==== Batch ==== | ||
+ | The process for a batch job is very similar. | ||
+ | |||
+ | jupyter-notebook.sbatch | ||
+ | < | ||
+ | # | ||
+ | unset XDG_RUNTIME_DIR | ||
+ | NODEIP=$(hostname -i) | ||
+ | NODEPORT=$(( $RANDOM + 1024)) | ||
+ | echo "ssh command: ssh -N -L 8888: | ||
+ | . ~/ | ||
+ | jupyter-notebook --ip=$NODEIP --port=$NODEPORT --no-browser | ||
+ | </ | ||
+ | |||
+ | Check the output of your job to find the ssh command to use when accessing your notebook. | ||
+ | |||
+ | Make a new ssh connection to tunnel your traffic. The format will be something like: | ||
+ | |||
+ | '' | ||
+ | |||
+ | This command will appear to hang since we are using the -N option which tells ssh not to run any commands including a shell on the remote machine. | ||
+ | |||
+ | Open your local browser and visit: '' | ||
+ | ==== Interactive ==== | ||
+ | - '' | ||
+ | - '' | ||
+ | - '' | ||
+ | - '' | ||
+ | - '' | ||
+ | - '' | ||
+ | - Make a new ssh connection with a tunnel to access your notebook | ||
+ | - '' | ||
+ | - This will make an ssh tunnel on your local machine that forwards traffic sent to '' | ||
+ | - Open your local browser and visit: '' | ||
+ | |||
+ | Copy the following code snippt to the interactive node directly: | ||
+ | < | ||
+ | unset XDG_RUNTIME_DIR | ||
+ | NODEIP=$(hostname -i) | ||
+ | NODEPORT=$(( $RANDOM + 1024)) | ||
+ | echo "ssh command: ssh -N -L 8888: | ||
+ | jupyter-notebook --ip=$NODEIP --port=$NODEPORT --no-browser | ||
+ | </ | ||
+ | |||
+ | ====== Contribution Policy | ||
+ | This section can be ignored by most people. [[techstaff: |
/var/lib/dokuwiki/data/attic/slurm/ai.1609970505.txt.gz · Last modified: 2021/01/06 16:01 by kauffman