techstaff:aicluster
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
techstaff:aicluster [2020/10/19 10:50] – kauffman | techstaff:aicluster [2021/01/06 16:11] (current) – kauffman | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== AI Cluster - Slurm ====== | ====== AI Cluster - Slurm ====== | ||
- | Cluster is up and running now. People involved in the cluster creation should go check it out. | + | [[slurm:ai|This page has moved]] |
- | + | ||
- | The nodes aren't locked down so anyone with a CS account can login so feel free to get some students to check it out. You are responsible for communicating with anyone you bring on though for now. | + | |
- | Any communication should be done via this email chain or in the #ai-cluster Discord channel. | + | |
- | + | ||
- | + | ||
- | ===== TODO ===== | + | |
- | + | ||
- | Since I'm still working on it, I don't guarantee any uptime yet. Mainly I need to make sure TRES tracking is working like we want. This will involve restarting slurmd and slurmctld which will kill running jobs. | + | |
- | + | ||
- | + | ||
- | * < | + | |
- | * groups (Slurm ' | + | |
- | * e.g. ericj_group: | + | |
- | * assign extra fairshare prio | + | |
- | * All users in this group get to use the extra fair share priority. | + | |
- | * I think both systems can exist simultaneously. Testing is required. | + | |
- | * grab fairshare data from somewhere (gsheet or some kind of DB) | + | |
- | * < | + | |
- | * < | + | |
- | * < | + | |
- | * research on slurm plugin to force GRES selection on job submit. Might be able to use: | + | |
- | * SallocDefaultCommand | + | |
- | * Otherwise look for ' | + | |
- | * jobs that do not specify a specific gpu type (e.g. gpu:rtx8000 or gpu: | + | |
- | * From ' | + | |
- | * ganglia for Slurm: http://ai-mgmt1.ai.cs.uchicago.edu | + | |
- | * figure why summary view is no longer a thing. | + | |
- | * < | + | |
- | * < | + | |
- | * < | + | |
- | * setup backups for home dirs | + | |
- | + | ||
- | + | ||
- | ====== Demo ====== | + | |
- | + | ||
- | kauffman3 is my test CS account. | + | |
- | + | ||
- | < | + | |
- | $ ssh kauffman3@fe.ai.cs.uchicago.edu | + | |
- | </ | + | |
- | I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now. | + | |
- | < | + | |
- | kauffman3@fe01: | + | |
- | NODELIST | + | |
- | a[001-006] | + | |
- | a[007-008] | + | |
- | </ | + | |
- | < | + | |
- | kauffman3@fe01: | + | |
- | JOBID | + | |
- | </ | + | |
- | + | ||
- | # List the device number of the devices I've requested from Slurm. | + | |
- | # These numbers map to / | + | |
- | < | + | |
- | kauffman3@fe01: | + | |
- | # | + | |
- | hostname | + | |
- | echo $CUDA_VISIBLE_DEVICES | + | |
- | </ | + | |
- | + | ||
- | Give me all four GPUs on systems 1-6 | + | |
- | < | + | |
- | kauffman3@fe01: | + | |
- | a001 | + | |
- | 0,1,2,3 | + | |
- | a002 | + | |
- | 0,1,2,3 | + | |
- | a006 | + | |
- | 0,1,2,3 | + | |
- | a005 | + | |
- | 0,1,2,3 | + | |
- | a004 | + | |
- | 0,1,2,3 | + | |
- | a003 | + | |
- | 0,1,2,3 | + | |
- | </ | + | |
- | # give me all GPUs on systems 7-8 | + | |
- | # these are the Quadro RTX 8000s | + | |
- | < | + | |
- | kauffman3@fe01: | + | |
- | a008 | + | |
- | 0,1,2,3 | + | |
- | a007 | + | |
- | 0,1,2,3 | + | |
- | </ | + | |
- | + | ||
- | + | ||
- | ===== Fairshare ===== | + | |
- | + | ||
- | # Check out the fairshare values | + | |
- | < | + | |
- | kauffman3@fe01: | + | |
- | | + | |
- | -------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------ | + | |
- | kauffman3 | + | |
- | | + | |
- | kauffman4 | + | |
- | | + | |
- | </ | + | |
- | + | ||
- | + | ||
- | We are using the FairTree (fairshare algorithm). | + | |
- | + | ||
- | As the system exists now. One Account per User. | + | |
- | + | ||
- | < | + | |
- | | + | |
- | | + | |
- | User: kauffman | + | |
- | </ | + | |
- | We will probably assign fairshare points to accounts, not users. | + | |
- | + | ||
- | + | ||
- | + | ||
- | ====== Storage ====== | + | |
- | + | ||
- | / | + | |
- | | + | |
- | + | ||
- | / | + | |
- | Lives on the home directory server. | + | |
- | Idea would be to create a dataset with a quota for people to use. | + | |
- | Normal LDAP groups that you are used to and available everywhere else would control access to these directories. | + | |
- | e.g. jonaslab, sandlab | + | |
- | + | ||
- | + | ||
- | Currently there is no quota on home directories. This is set per user per dataset. | + | |
- | + | ||
- | + | ||
- | I was able to get homes and scratch each connected via 2x 25G. Both are SSD only so the storage should be FAST. | + | |
- | + | ||
- | + | ||
- | Each compute node (nodes with gpus) has a zfs mirror mounted at /local | + | |
- | I set compression to lz4 by default. Usually this has a performance gain as less data is read and written to disk with a small overhead in CPU usage. | + | |
- | As of right now there is no mechanism to clean up /local. At some point I'll probably put a find command in cron that deletes files older than 90 days or so. | + | |
- | + | ||
- | + | ||
- | ====== Asked Questions ====== | + | |
- | + | ||
- | > Do we have a max job runtime? | + | |
- | + | ||
- | Yes. 4 hours. This is done per partition. | + | |
- | < | + | |
- | PartitionName=geforce Nodes=a[001-006] Default=YES DefMemPerCPU=2900 MaxTime=04: | + | |
- | =YES | + | |
- | PartitionName=quadro | + | |
- | YES | + | |
- | </ | + | |
- | + | ||
- | + | ||
- | You can take a look at all the values we set here: | + | |
- | + | ||
- | fe0[1,2]$ cat / | + | |
- | + | ||
- | The man page: https:// | + |
/var/lib/dokuwiki/data/attic/techstaff/aicluster.1603122604.txt.gz · Last modified: 2020/10/19 10:50 by kauffman