User Tools

Site Tools


techstaff:slurm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
techstaff:slurm [2018/05/03 09:53] – [Using the GPU] kauffmantechstaff:slurm [2018/05/04 12:38] kauffman
Line 247: Line 247:
  
 ===== GRES Multiple GPU's on one system ===== ===== GRES Multiple GPU's on one system =====
-Jobs will not be allocated any generic resources unless specifically requested at job submit time using the --gres option supported by the salloc, sbatch and srun commands. The option requires an argument specifying which generic resources are required and how many resources. The resource specification is of the form name[:type:count]. The name is the same name as specified by the GresTypes and Gres configuration parameters. type identifies a specific type of that generic resource (e.g. a specific model of GPU). count specifies how many resources are required and has a default value of 1. For example: +GRES: Generic Resource. As of 2018-05-04 these only include GPU's. 
-sbatch --gres=gpu:kepler:2 ....+ 
 +Jobs will not be allocated any generic resources unless specifically requested at job submit time using the ''%%--gres%%'' option supported by the ''%%salloc%%''''%%sbatch%%'' and ''%%srun%%'' commands. The option requires an argument specifying which generic resources are required and how many resources. The resource specification is of the form ''%%name[:type:count]%%''. The name is the same name as specified by the GresTypes and Gres configuration parameters. type identifies a specific type of that generic resource (e.g. a specific model of GPU). count specifies how many resources are required and has a default value of 1. For example: 
 +<code>sbatch --gres=gpu:titan:2 ....</code>
  
 Jobs will be allocated specific generic resources as needed to satisfy the request. If the job is suspended, those resources do not become available for use by other jobs. Jobs will be allocated specific generic resources as needed to satisfy the request. If the job is suspended, those resources do not become available for use by other jobs.
  
-Job steps can be allocated generic resources from those allocated to the job using the --gres option with the srun command as described above. By default, a job step will be allocated all of the generic resources allocated to the job. If desired, the job step may explicitly specify a different generic resource count than the job. This design choice was based upon a scenario where each job executes many job steps. If job steps were granted access to all generic resources by default, some job steps would need to explicitly specify zero generic resource counts, which we considered more confusing. The job step can be allocated specific generic resources and those resources will not be available to other job steps. A simple example is shown below.+Job steps can be allocated generic resources from those allocated to the job using the ''%%--gres%%'' option with the ''%%srun%%'' command as described above. By default, a job step will be allocated all of the generic resources allocated to the job. If desired, the job step may explicitly specify a different generic resource count than the job. This design choice was based upon a scenario where each job executes many job steps. If job steps were granted access to all generic resources by default, some job steps would need to explicitly specify zero generic resource counts, which we considered more confusing. The job step can be allocated specific generic resources and those resources will not be available to other job steps. A simple example is shown below. 
 + 
 +==== Ok, but I don't want to read the wall of text above ==== 
 +Fine. 
 + 
 +The ''%%--gres%%'' (man srun) is required if you want to make use of a gpu. 
 + 
 +<code> 
 +  --gpu=gpu:   # where 'N' is the number of GPUs requested. 
 +                 # Please try to limit yourself to one GPU per person. 
 +</code> 
 + 
 +Example when using tensorflow: 
 + 
 +Give the file 'f':    
 +<code> 
 +#!/usr/bin/env python3 
 +from tensorflow.python.client import device_lib 
 +print(device_lib.list_local_devices()) 
 +</code> 
 + 
 +Here we can see that no GPU was allocated to us because we did not specify the ''%%--gres%%'' option 
 +<code> 
 +  kauffman3@bulldozer:~$ srun -p titan --pty /bin/bash 
 +  kauffman3@gpu3:~$ ./f 2>&1 | grep physical_device_desc 
 +  kauffman3@gpu3:~$ 
 +</code> 
 + 
 +If we request only 1 GPU. 
 +<code> 
 +  kauffman3@bulldozer:~$ srun -p titan --pty --gres=gpu:1 /bin/bash 
 +  kauffman3@gpu3:~$ ./f 2>&1 | grep physical_device_desc 
 +  physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:19:00.0, compute capability: 6.1" 
 +</code> 
 + 
 +If we request 2 GPUs. 
 +<code> 
 +kauffman3@bulldozer:~$ srun -p titan --pty --gres=gpu:2 /bin/bash 
 +kauffman3@gpu3:~$ ./f 2>&1 | grep physical_device_desc 
 +  physical_device_desc: "device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:19:00.0, compute capability: 6.1" 
 +  physical_device_desc: "device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:1a:00.0, compute capability: 6.1" 
 +</code> 
 + 
 +If we request more GPUs then are available. 
 +<code> 
 +  kauffman3@bulldozer:~$ srun -p titan --pty --gres=gpu:5 /bin/bash 
 +  srun: error: Unable to allocate resources: Requested node configuration is not available 
 +</code> 
 + 
 +==== Cool, but how do I know where and what resources are available ==== 
 +Turns out the ''%%sinfo%%'' command is super useful. 
 +<code> 
 +$ sinfo -O partition,nodelist,gres,features,available 
 +PARTITION           NODELIST            GRES                FEATURES            AVAIL                
 +debug*              slurm1              (null)              (null)              up                   
 +general             slurm[2-6,8]        (null)              (null)              up                   
 +pascal              gpu2                gpu:gtx1080:      'pascal,gtx1080'    up                   
 +titan               gpu3                gpu:gtx1080ti:    'pascal,gtx1080ti'  up  
 +</code> 
 + 
 +FEATURES: Is actually just an arbitrary string in the configuration file that defines a node. However, techstaff hopes it actually provides some useful info. 
 + 
 +GRES: Don't depend on this being accurate, however it will definitely give you a clue as to how many generic resources are in a partition. 
  
  
/var/lib/dokuwiki/data/pages/techstaff/slurm.txt · Last modified: 2021/01/06 16:13 by kauffman

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki