Differences

This shows you the differences between two versions of the page.

--- techstaff:slurm [2018/05/04 12:32] – kauffman
+++ techstaff:slurm [2018/12/07 12:35] – [Using the GPU] kauffman
@@ Line 98: / Line 98: @@
 | **debug** | The partition your job will be submitted to if none is specified. The purpose of this partition is to make sure your code is running as it should before submitting a long running job to the general queue. |
 | **general** | All jobs that have been thoroughly tested can be submitted here. This partition will have access to more nodes and will process most of the jobs. If you need to use the ''%%--exclusive%%'' flag it should be done here.|
-| **gpu** | Contains servers with graphics cards. As of May 2016 there is only one node containing a Tesla M2090. You will be forced to use this server exclusively for now. Please keep your time in interactive mode to a minimum.|
+| **pascal** | 2018-05-04: 1x Nvidia GTX1080. You will be forced to use this server exclusively for now. Please keep your time in interactive mode to a minimum.|
+| **titan** | 2018-05-04: 4x Nvidia GTX1080Ti. This partition is shared and you MUST use the ''%%--gres%%'' to specify the resources you wish to use. It is also encouraged to specify cpu and memory.|
 ====== Job Submission ======
@@ Line 111: / Line 112: @@
 | ^ SLURM ^ Example ^
 ^ Submit a batch serial job | sbatch | sbatch runscript.sh |
-^ Run a script interatively | srun | srun --pty -p interact -t 10 --mem 1000 \\ /bin/bash \\ /bin/hostname |
+^ Run a script interactively | srun | srun --pty -p interact -t 10 --mem 1000 \\ /bin/bash \\ /bin/hostname |
 ^ Kill a job | scancel | scancel 4585 |
 ^ View status of queues | squeue | squeue -u cnetid |
@@ Line 245: / Line 246: @@
 ====== Using the GPU ======
+[[ techstaff:cuda | Environment Variables and more ]]
+===== CUDA_VISIBLE_DEVICES =====
+Do not set this variable. It will be set for you by SLURM.
+The variable name is actually misleading; since it does NOT mean the amount of devices, but rather the physical device number assigned by the kernel (e.g. /dev/nvidia2).
+For example: If you requested multiple gpu's from SLURM (--gres=gpu:2), the CUDA_VISIBLE_DEVICES variable should contain two numbers(0-3 in this case) separated by a comma (e.g. 1,3).
 ===== GRES Multiple GPU's on one system =====
@@ Line 268: / Line 277: @@
 Example when using tensorflow:
-Give the file 'f':
+Given the file ''%%f%%'':
-  Depends on:
+<code>
-    ''%%pip3 install --user tensorflow-gpu%%''
+#!/usr/bin/env python3
-    ''%%export PATH=$HOME/.local/bin:$PATH%%''
+from tensorflow.python.client import device_lib
-  <code>
+print(device_lib.list_local_devices())
-    #!/usr/bin/env python3
+</code>
-    from tensorflow.python.client import device_lib
-    print(device_lib.list_local_devices())
-  </code>
 Here we can see that no GPU was allocated to us because we did not specify the ''%%--gres%%'' option
 <code>
-  kauffman3@bulldozer:~$ srun -p titan --pty /bin/bash
+user@bulldozer:~$ srun -p titan --pty /bin/bash
-  kauffman3@gpu3:~$ ./f 2>&1 | grep physical_device_desc
+user@gpu3:~$ ./f 2>&1 | grep physical_device_desc
-  kauffman3@gpu3:~$
+user@gpu3:~$
 </code>
 If we request only 1 GPU.
 <code>
-  kauffman3@bulldozer:~$ srun -p titan --pty --gres=gpu:1 /bin/bash
+user@bulldozer:~$ srun -p titan --pty --gres=gpu:1 /bin/bash
-  kauffman3@gpu3:~$ ./f 2>&1 | grep physical_device_desc
+user@gpu3:~$ ./f 2>&1 | grep physical_device_desc
-  physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:19:00.0, compute capability: 6.1"
+physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:19:00.0, compute capability: 6.1"
 </code>
 If we request 2 GPUs.
 <code>
-kauffman3@bulldozer:~$ srun -p titan --pty --gres=gpu:2 /bin/bash
+user@bulldozer:~$ srun -p titan --pty --gres=gpu:2 /bin/bash
-kauffman3@gpu3:~$ ./f 2>&1 | grep physical_device_desc
+user@gpu3:~$ ./f 2>&1 | grep physical_device_desc
-  physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:19:00.0, compute capability: 6.1"
+physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:19:00.0, compute capability: 6.1"
-  physical_device_desc: "device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:1a:00.0, compute capability: 6.1"
+physical_device_desc: "device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:1a:00.0, compute capability: 6.1"
 </code>
 If we request more GPUs then are available.
 <code>
-  kauffman3@bulldozer:~$ srun -p titan --pty --gres=gpu:5 /bin/bash
+kauffman3@bulldozer:~$ srun -p titan --pty --gres=gpu:5 /bin/bash
-  srun: error: Unable to allocate resources: Requested node configuration is not available
+srun: error: Unable to allocate resources: Requested node configuration is not available
 </code>
@@ Line 321: / Line 327: @@
 GRES: Don't depend on this being accurate, however it will definitely give you a clue as to how many generic resources are in a partition.
+==== Checking how many Generic RESources are being consumed ====
+Simple use the ''%%-O%%'' option for ''%%squeue%%'' and you can see how many generic resources any particular job is consuming.
+<code>
+$ squeue -O username,nodelist,gres
+USER                NODELIST            GRES
+someusername        gpu3                gpu:1
+otherusername       gpu3                gpu:3
+...
+</code>
 ===== Paths =====
-You will need to add the following to your $PATH and $LD_LIBRARY_PATH.
+You will need to add the following to your ''%%$PATH%%'' and ''%%$LD_LIBRARY_PATH%%''.
   export PATH=$PATH:/usr/local/cuda/bin