techstaff:aicluster
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
techstaff:aicluster [2020/11/11 11:45] – kauffman | techstaff:aicluster [2020/12/10 12:55] – [Jupyter Notebook Tips] kauffman | ||
---|---|---|---|
Line 12: | Line 12: | ||
The information from the older cluster mostly applies and I suggest you read that documentation: | The information from the older cluster mostly applies and I suggest you read that documentation: | ||
+ | |||
+ | ====== Infrastructure ====== | ||
+ | Summary of nodes installed on the cluster | ||
+ | |||
+ | ===== Computer/ | ||
+ | * 6x nodes | ||
+ | * 2x Xeon Gold 6130 CPU @ 2.10GHz (64 threads) | ||
+ | * 192G RAM | ||
+ | * 4x Nvidia GeForce RTX2080Ti | ||
+ | |||
+ | * 2x nodes | ||
+ | * 2x Xeon Gold 6130 CPU @ 2.10GHz (64 threads) | ||
+ | * 384G RAM | ||
+ | * 4x Nvidia Quadro RTX 8000 | ||
+ | |||
+ | * all: | ||
+ | * zfs mirror mounted at /local | ||
+ | * compression to lz4: Usually this has a performance gain as less data is read and written to disk with a small overhead in CPU usage. | ||
+ | * As of right now there is no mechanism to clean up /local. At some point I'll probably put a find command in cron that deletes files older than 90 days or so. | ||
+ | |||
+ | ===== Storage ===== | ||
+ | |||
+ | * ai-storage1: | ||
+ | * 41T total storage | ||
+ | * uplink to cluster network: 2x 25G | ||
+ | * / | ||
+ | * We intend to set user quotas, however, there are no quotas right now. | ||
+ | * / | ||
+ | * Lives on the home directory server. | ||
+ | * Idea would be to create a dataset with a quota for people to use. | ||
+ | * Normal LDAP groups that you are used to and available everywhere else would control access to these directories. e.g. jonaslab, sandlab | ||
+ | |||
+ | * ai-storage2: | ||
+ | * 41T total storage | ||
+ | * uplink to cluster network: 2x 25G | ||
+ | * / | ||
+ | * Eventually data will be auto deleted after X amount of time. Maybe 90 days or whatever we determine makes sense. | ||
+ | |||
+ | * ai-storage3: | ||
+ | * zfs mirror with previous snapshots of ' | ||
+ | * NOT a backup. | ||
+ | * Not enabled yet. | ||
+ | |||
+ | |||
+ | ====== Login ====== | ||
+ | There are a set of front end nodes that give you access to the Slurm cluster. You will connect through these nodes and need to be on these nodes to submit jobs to the cluster. | ||
+ | |||
+ | ssh cnetid@fe.ai.cs.uchicago.edu | ||
+ | |||
+ | * Requires a CS account. | ||
+ | |||
+ | ==== File Transfer ==== | ||
+ | You will use the FE nodes to transfer your files onto the cluster storage infrastructure. The network connections on those nodes are 2x 10G each. | ||
+ | |||
+ | === Quota === | ||
+ | * By default users are given a quota of 20G. | ||
====== Demo ====== | ====== Demo ====== | ||
Line 67: | Line 123: | ||
</ | </ | ||
+ | ==== Notes on CUDA_VISIBLE_DEVICES ==== | ||
+ | CUDA_VISIBLE_DEVICES: | ||
+ | * This variable should NOT be modified. Ever. | ||
+ | * Relative means that if you requested one gpu it will show up as 0. Even if all other gpus on the server are being used by others. | ||
+ | ===== Fairshare/ | ||
+ | By default all usage is tracked and charged to a users default account. A fairshare value is computed and used in prioritizing a job on submission. | ||
- | + | Details are being worked out for anyone that donates to the cluster. This will be some sort of tiered system where you get to use a higher priority when you need it. | |
- | ====== Storage ====== | + | You will need to charge an account |
- | + | ||
- | / | + | |
- | | + | |
- | + | ||
- | / | + | |
- | Lives on the home directory server. | + | |
- | Idea would be to create | + | |
- | | + | |
- | e.g. jonaslab, sandlab | + | |
- | + | ||
- | + | ||
- | Currently there is no quota on home directories. | + | |
- | + | ||
- | homes and scratch each connected via 2x 25G. Both are SSD only so the storage should be FAST. | + | |
- | + | ||
- | + | ||
- | Each compute node (nodes with gpus) has a zfs mirror mounted at /local | + | |
- | I set compression | + | |
- | As of right now there is no mechanism to clean up /local. At some point I'll probably put a find command in cron that deletes files older than 90 days or so. | + | |
Line 107: | Line 150: | ||
</ | </ | ||
+ | |||
+ | ===== Jupyter Notebook Tips ===== | ||
+ | ==== Batch ==== | ||
+ | ? | ||
+ | |||
+ | ==== Interactive ==== | ||
+ | - '' | ||
+ | - '' | ||
+ | - '' | ||
+ | - '' | ||
+ | - '' | ||
+ | - '' | ||
+ | - Make a new ssh connection with a tunnel to access your notebook | ||
+ | - '' | ||
+ | - This will make an ssh tunnel on your local machine that fowards traffic sent to '' | ||
+ | - Open your local browser and visit: '' | ||
/var/lib/dokuwiki/data/pages/techstaff/aicluster.txt · Last modified: 2021/01/06 16:11 by kauffman