slurm
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| slurm [2021/01/19 09:39] – [Clusters] kauffman | slurm [2025/06/30 17:59] (current) – amcguire | ||
|---|---|---|---|
| Line 16: | Line 16: | ||
| ==== Discord ==== | ==== Discord ==== | ||
| - | There is a dedicated text channel '' | + | There is a dedicated text channel '' |
| ===== Clusters ===== | ===== Clusters ===== | ||
| Line 27: | Line 27: | ||
| ==== Peanut Cluster ==== | ==== Peanut Cluster ==== | ||
| - | Think of these machines as a dumping ground for discrete computing tasks that might be rude or disruptive to execute on the main (shared) shell servers (i.e., | + | Think of these machines as a dumping ground for discrete computing tasks that might be rude or disruptive to execute on the main (shared) shell servers (i.e., |
| Additionally, | Additionally, | ||
| Line 39: | Line 39: | ||
| ===== Where to begin ===== | ===== Where to begin ===== | ||
| - | Slurm is a set of command line utilities that can be accessed via the command line from **most** any computer science system you can login to. Using our main shell servers (linux.cs.uchicago.edu) is expected to be our most common use case, so you should start there. | + | Slurm is a set of command line utilities that can be accessed via the command line from **most** any computer science system you can login to. Using our main shell servers ('' |
| ssh user@linux.cs.uchicago.edu | ssh user@linux.cs.uchicago.edu | ||
| - | If you want to use the AI Cluster you will need to login into: | + | If you want to use the AI Cluster you will need to have previously requested access by sending in a ticket. Afterwards, you may login into: |
| ssh user@fe.ai.cs.uchicago.edu | ssh user@fe.ai.cs.uchicago.edu | ||
| Line 88: | Line 88: | ||
| === Default Quotas === | === Default Quotas === | ||
| By default we set a job to be run on one CPU and allocate 100MB of RAM. If you require more than that you should specify what you need. Using the following options will do: '' | By default we set a job to be run on one CPU and allocate 100MB of RAM. If you require more than that you should specify what you need. Using the following options will do: '' | ||
| + | |||
| + | === MPI Usage === | ||
| + | The AI cluster supports the use of MPI. The following example illustrates its basic use. | ||
| + | |||
| + | < | ||
| + | amcguire@fe01: | ||
| + | #include < | ||
| + | #include < | ||
| + | #include < | ||
| + | |||
| + | int main(int argc, char **argv) { | ||
| + | // Initialize MPI | ||
| + | MPI_Init(& | ||
| + | |||
| + | // Get the number of processes in the global communicator | ||
| + | int count; | ||
| + | MPI_Comm_size(MPI_COMM_WORLD, | ||
| + | |||
| + | // Get the rank of the current process | ||
| + | int rank; | ||
| + | MPI_Comm_rank(MPI_COMM_WORLD, | ||
| + | |||
| + | // Get the current hostname | ||
| + | char hostname[1024]; | ||
| + | gethostname(hostname, | ||
| + | |||
| + | // Print a hello world message for this rank | ||
| + | printf(" | ||
| + | |||
| + | // Finalize the MPI environment before exiting | ||
| + | MPI_Finalize(); | ||
| + | } | ||
| + | amcguire@fe01: | ||
| + | #!/bin/bash | ||
| + | #SBATCH -J mpi-hello | ||
| + | #SBATCH -n 2 # Number of processes | ||
| + | #SBATCH -t 0: | ||
| + | #SBATCH -o hello-job.out | ||
| + | |||
| + | # Disable the Infiniband transport for OpenMPI (not present on all clusters) | ||
| + | #export OMPI_MCA_btl=" | ||
| + | |||
| + | # Run the job (assumes the batch script is submitted from the same directory) | ||
| + | mpirun -np 2 ./mpi-hello | ||
| + | |||
| + | amcguire@fe01: | ||
| + | amcguire@fe01: | ||
| + | -rwxrwx--- 1 amcguire amcguire 16992 Jun 30 10:49 mpi-hello | ||
| + | amcguire@fe01: | ||
| + | Submitted batch job 1196702 | ||
| + | amcguire@fe01: | ||
| + | Hello from process 0 of 2 on host p001 | ||
| + | Hello from process 1 of 2 on host p002 | ||
| + | </ | ||
| === Exclusive access to a node === | === Exclusive access to a node === | ||
| Line 347: | Line 401: | ||
| Please make sure you specify $CUDA_HOME and if you want to take advantage of CUDNN libraries you will need to append / | Please make sure you specify $CUDA_HOME and if you want to take advantage of CUDNN libraries you will need to append / | ||
| - | cuda_version=9.2 | + | cuda_version=11.1 |
| export CUDA_HOME=/ | export CUDA_HOME=/ | ||
| export LD_LIBRARY_PATH=$LD_LIBRARY_PATH: | export LD_LIBRARY_PATH=$LD_LIBRARY_PATH: | ||
| Line 363: | Line 417: | ||
| The variable name is actually misleading; since it does NOT mean the amount of devices, but rather the physical device number assigned by the kernel (e.g. / | The variable name is actually misleading; since it does NOT mean the amount of devices, but rather the physical device number assigned by the kernel (e.g. / | ||
| - | For example: If you requested multiple gpu's from Slurm (--gres=gpu: | + | For example: If you requested multiple gpu's from Slurm (--gres=gpu: |
| + | |||
| + | The numbering is relative and specific to you. For example: two users with one job which require two gpus each could be assigned non-sequential gpu numbers. However CUDA_VISIBLE_DEVICES will look like this for both users: 0,1 | ||
| Line 412: | Line 469: | ||
| STDOUT will look something like this: | STDOUT will look something like this: | ||
| < | < | ||
| - | cnetid@linux1:~$ cat $HOME/ | + | cnetid@focal0:~$ cat $HOME/ |
| Device Number: 0 | Device Number: 0 | ||
| Device name: Tesla M2090 | Device name: Tesla M2090 | ||
/var/lib/dokuwiki/data/attic/slurm.1611070759.txt.gz · Last modified: 2021/01/19 09:39 by kauffman