techstaff:aicluster-admin
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| techstaff:aicluster-admin [2020/12/02 10:43] – kauffman | techstaff:aicluster-admin [2021/02/23 19:58] (current) – kauffman | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== AI Cluster Policy Description ====== | ||
| + | ===== TODO ===== | ||
| + | - There are multiple methods used to calculate priority reflected on the spreadsheet. | ||
| + | - Double check the math for correctness. | ||
| + | - Does the math reflect our intent? I think it does. | ||
| + | - Multiple methods in calculating priority are reflected (blue to purple cells: '' | ||
| + | - Choose one calculation to use. This decision doesn' | ||
| + | - If you have a suggestion: Please show us the work in a new column or by cloning the sheet. | ||
| + | |||
| + | ===== Contribution Tracking and Priority Calculation Tool ===== | ||
| + | |||
| + | [[https:// | ||
| + | |||
| + | AI Cluster committee members have access. | ||
| + | |||
| + | ==== Sheet usage ==== | ||
| + | * Red: Do not edit | ||
| + | * Green: user input (This will be Techstaff 95% of the time) | ||
| + | * `groups` sheet: | ||
| + | * contributions get assigned a POSIX group. Group must have a primary contact, who then gets to set members for that group. | ||
| + | * calculates contribution amount for use in `contrib-priority`. | ||
| + | * tracks group name and primary owner | ||
| + | * `log` sheet | ||
| + | * All contributions will get entered here. | ||
| + | * Hardware contribution gets converted to USD by techstaff. A receipt of the purchase is good starting point. | ||
| + | * The group ' | ||
| + | * `contrib-priority` calculation references contrib amounts calculated in `groups`. | ||
| + | |||
| + | |||
| + | |||
| + | |||
| + | ===== Understanding Slurm Fairshare and Priority/ | ||
| + | Slurm comes with built in tools to calculation fair share priorities without anyone needing to do anything special. The cluster uses a [[https:// | ||
| + | |||
| + | ==== How Slurm calculates Job priority ==== | ||
| + | Generally Slurm will use this formula to determine a jobs priority. | ||
| + | < | ||
| + | Job_priority = | ||
| + | site_factor + | ||
| + | (PriorityWeightAge) * (age_factor) + | ||
| + | (PriorityWeightAssoc) * (assoc_factor) + | ||
| + | (PriorityWeightFairshare) * (fair-share_factor) + | ||
| + | (PriorityWeightJobSize) * (job_size_factor) + | ||
| + | (PriorityWeightPartition) * (partition_factor) + | ||
| + | (PriorityWeightQOS) * (QOS_factor) + | ||
| + | SUM(TRES_weight_cpu * TRES_factor_cpu, | ||
| + | TRES_weight_< | ||
| + | ...) | ||
| + | - nice_factor | ||
| + | </ | ||
| + | |||
| + | The factors on the left that start with '' | ||
| + | |||
| + | < | ||
| + | fe01:~$ cat / | ||
| + | PriorityType=priority/ | ||
| + | PriorityDecayHalfLife=08: | ||
| + | PriorityMaxAge=5-0 | ||
| + | PriorityWeightFairshare=500000 | ||
| + | PriorityWeightPartition=100000 | ||
| + | PriorityWeightQOS=0 | ||
| + | PriorityWeightJobSize=0 | ||
| + | PriorityWeightAge=0 | ||
| + | PriorityFavorSmall=YES | ||
| + | </ | ||
| + | *Note that this example may not be up to date when you read this. | ||
| + | |||
| + | ===== How we modify job priority to favor contributors ===== | ||
| + | |||
| + | * We adjust those priorities with partitions for those who have donated either monetarily or with hardware. Hardware donations get converted a monetary value when logged on the spreadsheet. | ||
| + | * Every contribution gets assigned a POSIX group, hereon referred to as '' | ||
| + | |||
| + | |||
| + | Here is a version of the partition configuration as it stands now (2021-02-10). | ||
| + | < | ||
| + | PartitionName=general Nodes=a[001-008] | ||
| + | # | ||
| + | PartitionName=cdac-contrib Nodes=a[001-008] AllowGroups=cdac Priority=5 | ||
| + | </ | ||
| + | |||
| + | ^Partition^Description^Priority^ | ||
| + | |general| For all users| 0 | | ||
| + | |${group}-own | Machines $group has donated. Enabled when asked. | 100 | | ||
| + | |${group}-contrib | A method to give slightly higher job priority to groups who have donated but do not own machines.| Variable based on spreadsheet calculation. | | ||
| + | |||
| + | The key thing to notice before you continue reading is that nodes can be added to multiple partitions. '' | ||
| + | |||
| + | |||
| + | ==== Calculating -contrib partition usage ==== | ||
| + | |||
| + | We do the following calculation to determine the '' | ||
| + | |||
| + | < | ||
| + | partition usage total time in seconds for 30 days | ||
| + | ------------------------------------------------------ = percent used | ||
| + | all partition usage total time in seconds for 30 days | ||
| + | </ | ||
| + | |||
| + | The percent will end up as an integer. | ||
| + | |||
| + | There is a [[https:// | ||
| + | |||
| + | |||
| + | You'll see on the spreadsheet we take subtract '' | ||
| + | |||
| + | Total amount of money contributed and " | ||
| + | |||
| + | This calculation will be run once a month and the relevant groups ${group}-contrib priority updated to reflect past months usage. | ||
| + | |||
| + | |||
| + | Note that the term " | ||
| + | |||
| + | |||
| + | |||
| + | |||
| ====== AI Cluster Admin ====== | ====== AI Cluster Admin ====== | ||
| Line 19: | Line 134: | ||
| * < | * < | ||
| * < | * < | ||
| - | * figure why summary view is no longer a thing. | + | * <del>figure why summary view is no longer a thing.</ |
| * < | * < | ||
| * < | * < | ||
| * < | * < | ||
| * home directory | * home directory | ||
| - | * setup backups for home dirs | + | * <del>setup backups for home dirs</ |
| * < | * < | ||
| - | * userland tool to check quota | + | * <del>userland tool to check quota</ |
| * < | * < | ||
| * monitoring | * monitoring | ||
| - | * basic node monitor | + | * <del>basic node monitor</ |
| * nfs or bandwidth monitoring | * nfs or bandwidth monitoring | ||
| + | * gpu | ||
| * sync script | * sync script | ||
| * < | * < | ||
| + | |||
/var/lib/dokuwiki/data/attic/techstaff/aicluster-admin.1606927380.txt.gz · Last modified: 2020/12/02 10:43 by kauffman