techstaff:aicluster
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
techstaff:aicluster [2020/11/11 11:58] – kauffman | techstaff:aicluster [2021/01/06 16:11] (current) – kauffman | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== AI Cluster - Slurm ====== | ====== AI Cluster - Slurm ====== | ||
- | Cluster is up and running now. Anyone with a CS account who wishes to test it out should do so. | + | [[slurm:ai|This page has moved]] |
- | + | ||
- | + | ||
- | Feedback is requested: | + | |
- | + | ||
- | [[https:// | + | |
- | + | ||
- | Knowledge of how to use Slurm already is preferred at this stage of testing. | + | |
- | + | ||
- | + | ||
- | The information from the older cluster mostly applies and I suggest you read that documentation: | + | |
- | + | ||
- | + | ||
- | ====== Infrastructure ====== | + | |
- | Summary of nodes installed on the cluster | + | |
- | + | ||
- | ===== Computer/ | + | |
- | * 6x nodes | + | |
- | * 2x Xeon Gold 6130 CPU @ 2.10GHz (64 threads) | + | |
- | * 192G RAM | + | |
- | * 4x Nvidia GeForce RTX2080Ti | + | |
- | + | ||
- | * 2x nodes | + | |
- | * 2x Xeon Gold 6130 CPU @ 2.10GHz (64 threads) | + | |
- | * 384G RAM | + | |
- | * 4x Nvidia Quadro RTX 8000 | + | |
- | + | ||
- | * all: | + | |
- | * zfs mirror mounted at /local | + | |
- | * compression to lz4: Usually this has a performance gain as less data is read and written to disk with a small overhead in CPU usage. | + | |
- | * As of right now there is no mechanism to clean up /local. At some point I'll probably put a find command in cron that deletes files older than 90 days or so. | + | |
- | + | ||
- | ===== Storage ===== | + | |
- | + | ||
- | * ai-storage1: | + | |
- | * 41T total storage | + | |
- | * uplink to cluster network: 2x 25G | + | |
- | * / | + | |
- | * We intend to set user quotas, however, there are no quotas right now. | + | |
- | * / | + | |
- | * Lives on the home directory server. | + | |
- | * Idea would be to create a dataset with a quota for people to use. | + | |
- | * Normal LDAP groups that you are used to and available everywhere else would control access to these directories. e.g. jonaslab, sandlab | + | |
- | + | ||
- | * ai-storage2: | + | |
- | * 41T total storage | + | |
- | * uplink to cluster network: 2x 25G | + | |
- | * / | + | |
- | * Eventually data will be auto deleted after X amount of time. Maybe 90 days or whatever we determine makes sense. | + | |
- | + | ||
- | * ai-storage3: | + | |
- | * zfs mirror with previous snapshots of ' | + | |
- | * NOT a backup. | + | |
- | * Not enabled yet. | + | |
- | + | ||
- | + | ||
- | ====== Demo ====== | + | |
- | + | ||
- | kauffman3 is my CS test account. | + | |
- | + | ||
- | < | + | |
- | $ ssh kauffman3@fe.ai.cs.uchicago.edu | + | |
- | </ | + | |
- | I've created a couple scripts that run some of the Slurm commands but with more useful output. cs-sinfo and cs-squeue being the only two right now. | + | |
- | < | + | |
- | kauffman3@fe01: | + | |
- | NODELIST | + | |
- | a[001-006] | + | |
- | a[007-008] | + | |
- | </ | + | |
- | < | + | |
- | kauffman3@fe01: | + | |
- | JOBID | + | |
- | </ | + | |
- | + | ||
- | # List the device number of the devices I've requested from Slurm. | + | |
- | # These numbers map to / | + | |
- | < | + | |
- | kauffman3@fe01: | + | |
- | # | + | |
- | hostname | + | |
- | echo $CUDA_VISIBLE_DEVICES | + | |
- | </ | + | |
- | + | ||
- | Give me all four GPUs on systems 1-6 | + | |
- | < | + | |
- | kauffman3@fe01: | + | |
- | a001 | + | |
- | 0,1,2,3 | + | |
- | a002 | + | |
- | 0,1,2,3 | + | |
- | a006 | + | |
- | 0,1,2,3 | + | |
- | a005 | + | |
- | 0,1,2,3 | + | |
- | a004 | + | |
- | 0,1,2,3 | + | |
- | a003 | + | |
- | 0,1,2,3 | + | |
- | </ | + | |
- | # give me all GPUs on systems 7-8 | + | |
- | # these are the Quadro RTX 8000s | + | |
- | < | + | |
- | kauffman3@fe01: | + | |
- | a008 | + | |
- | 0,1,2,3 | + | |
- | a007 | + | |
- | 0,1,2,3 | + | |
- | </ | + | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | ====== Asked Questions ====== | + | |
- | + | ||
- | > Do we have a max job runtime? | + | |
- | + | ||
- | Yes. 4 hours. | + | |
- | + | ||
- | < | + | |
- | PartitionName=geforce Nodes=a[001-006] Default=YES DefMemPerCPU=2900 MaxTime=04: | + | |
- | =YES | + | |
- | PartitionName=quadro | + | |
- | YES | + | |
- | </ | + | |
- | + |
/var/lib/dokuwiki/data/attic/techstaff/aicluster.1605117515.txt.gz · Last modified: 2020/11/11 11:58 by kauffman