techstaff:aicluster-admin
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
techstaff:aicluster-admin [2020/12/02 10:43] – kauffman | techstaff:aicluster-admin [2021/02/23 19:58] (current) – kauffman | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== AI Cluster Policy Description ====== | ||
+ | ===== TODO ===== | ||
+ | - There are multiple methods used to calculate priority reflected on the spreadsheet. | ||
+ | - Double check the math for correctness. | ||
+ | - Does the math reflect our intent? I think it does. | ||
+ | - Multiple methods in calculating priority are reflected (blue to purple cells: '' | ||
+ | - Choose one calculation to use. This decision doesn' | ||
+ | - If you have a suggestion: Please show us the work in a new column or by cloning the sheet. | ||
+ | |||
+ | ===== Contribution Tracking and Priority Calculation Tool ===== | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | AI Cluster committee members have access. | ||
+ | |||
+ | ==== Sheet usage ==== | ||
+ | * Red: Do not edit | ||
+ | * Green: user input (This will be Techstaff 95% of the time) | ||
+ | * `groups` sheet: | ||
+ | * contributions get assigned a POSIX group. Group must have a primary contact, who then gets to set members for that group. | ||
+ | * calculates contribution amount for use in `contrib-priority`. | ||
+ | * tracks group name and primary owner | ||
+ | * `log` sheet | ||
+ | * All contributions will get entered here. | ||
+ | * Hardware contribution gets converted to USD by techstaff. A receipt of the purchase is good starting point. | ||
+ | * The group ' | ||
+ | * `contrib-priority` calculation references contrib amounts calculated in `groups`. | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Understanding Slurm Fairshare and Priority/ | ||
+ | Slurm comes with built in tools to calculation fair share priorities without anyone needing to do anything special. The cluster uses a [[https:// | ||
+ | |||
+ | ==== How Slurm calculates Job priority ==== | ||
+ | Generally Slurm will use this formula to determine a jobs priority. | ||
+ | < | ||
+ | Job_priority = | ||
+ | site_factor + | ||
+ | (PriorityWeightAge) * (age_factor) + | ||
+ | (PriorityWeightAssoc) * (assoc_factor) + | ||
+ | (PriorityWeightFairshare) * (fair-share_factor) + | ||
+ | (PriorityWeightJobSize) * (job_size_factor) + | ||
+ | (PriorityWeightPartition) * (partition_factor) + | ||
+ | (PriorityWeightQOS) * (QOS_factor) + | ||
+ | SUM(TRES_weight_cpu * TRES_factor_cpu, | ||
+ | TRES_weight_< | ||
+ | ...) | ||
+ | - nice_factor | ||
+ | </ | ||
+ | |||
+ | The factors on the left that start with '' | ||
+ | |||
+ | < | ||
+ | fe01:~$ cat / | ||
+ | PriorityType=priority/ | ||
+ | PriorityDecayHalfLife=08: | ||
+ | PriorityMaxAge=5-0 | ||
+ | PriorityWeightFairshare=500000 | ||
+ | PriorityWeightPartition=100000 | ||
+ | PriorityWeightQOS=0 | ||
+ | PriorityWeightJobSize=0 | ||
+ | PriorityWeightAge=0 | ||
+ | PriorityFavorSmall=YES | ||
+ | </ | ||
+ | *Note that this example may not be up to date when you read this. | ||
+ | |||
+ | ===== How we modify job priority to favor contributors ===== | ||
+ | |||
+ | * We adjust those priorities with partitions for those who have donated either monetarily or with hardware. Hardware donations get converted a monetary value when logged on the spreadsheet. | ||
+ | * Every contribution gets assigned a POSIX group, hereon referred to as '' | ||
+ | |||
+ | |||
+ | Here is a version of the partition configuration as it stands now (2021-02-10). | ||
+ | < | ||
+ | PartitionName=general Nodes=a[001-008] | ||
+ | # | ||
+ | PartitionName=cdac-contrib Nodes=a[001-008] AllowGroups=cdac Priority=5 | ||
+ | </ | ||
+ | |||
+ | ^Partition^Description^Priority^ | ||
+ | |general| For all users| 0 | | ||
+ | |${group}-own | Machines $group has donated. Enabled when asked. | 100 | | ||
+ | |${group}-contrib | A method to give slightly higher job priority to groups who have donated but do not own machines.| Variable based on spreadsheet calculation. | | ||
+ | |||
+ | The key thing to notice before you continue reading is that nodes can be added to multiple partitions. '' | ||
+ | |||
+ | |||
+ | ==== Calculating -contrib partition usage ==== | ||
+ | |||
+ | We do the following calculation to determine the '' | ||
+ | |||
+ | < | ||
+ | partition usage total time in seconds for 30 days | ||
+ | ------------------------------------------------------ = percent used | ||
+ | all partition usage total time in seconds for 30 days | ||
+ | </ | ||
+ | |||
+ | The percent will end up as an integer. | ||
+ | |||
+ | There is a [[https:// | ||
+ | |||
+ | |||
+ | You'll see on the spreadsheet we take subtract '' | ||
+ | |||
+ | Total amount of money contributed and " | ||
+ | |||
+ | This calculation will be run once a month and the relevant groups ${group}-contrib priority updated to reflect past months usage. | ||
+ | |||
+ | |||
+ | Note that the term " | ||
+ | |||
+ | |||
+ | |||
+ | |||
====== AI Cluster Admin ====== | ====== AI Cluster Admin ====== | ||
Line 19: | Line 134: | ||
* < | * < | ||
* < | * < | ||
- | * figure why summary view is no longer a thing. | + | * <del>figure why summary view is no longer a thing.</ |
* < | * < | ||
* < | * < | ||
* < | * < | ||
* home directory | * home directory | ||
- | * setup backups for home dirs | + | * <del>setup backups for home dirs</ |
* < | * < | ||
- | * userland tool to check quota | + | * <del>userland tool to check quota</ |
* < | * < | ||
* monitoring | * monitoring | ||
- | * basic node monitor | + | * <del>basic node monitor</ |
* nfs or bandwidth monitoring | * nfs or bandwidth monitoring | ||
+ | * gpu | ||
* sync script | * sync script | ||
* < | * < | ||
+ | |||
/var/lib/dokuwiki/data/pages/techstaff/aicluster-admin.txt · Last modified: 2021/02/23 19:58 by kauffman