techstaff:aicluster-admin
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionLast revisionBoth sides next revision | ||
techstaff:aicluster-admin [2021/02/10 12:03] – kauffman | techstaff:aicluster-admin [2021/02/10 18:52] – [AI Cluster Policy Description] kauffman | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== AI Cluster Policy Description ====== | ====== AI Cluster Policy Description ====== | ||
+ | ===== TODO ===== | ||
+ | - There are multiple methods used to calculate priority reflected on the spreadsheet. | ||
+ | - Double check the math for correctness. | ||
+ | - Does the math reflect our intent? I think it does. | ||
+ | - Multiple methods in calculating priority are reflected (blue to purple cells: '' | ||
+ | - Choose one calculation to use. This decision doesn' | ||
+ | - If you have a suggestion: Please show us the work in a new column or by cloning the sheet. | ||
+ | ===== Contribution Tracking and Priority Calculation Tool ===== | ||
- | > 2. Also, soon, as we pass this feature testing, we soon need to talk | + | [[https:// |
- | > about the " | + | |
- | > $10K annual subscription converts to the SLURM weights). | + | |
- | Bringing this thread back. I apologize for the wall of text but it is | + | AI Cluster committee members have access. |
- | necessary to for you to understand this before you can approve it. | + | |
- | After various iterations we (Bob, Har, and I) believe to have an | + | ==== Sheet usage ==== |
- | implementation | + | * Red: Do not edit |
- | previously. | + | * Green: user input (This will be Techstaff 95% of the time) |
+ | * `groups` sheet: | ||
+ | * contributions get assigned a POSIX group. Group must have a primary contact, who then gets to set members for that group. | ||
+ | * calculates contribution amount for use in `contrib-priority`. | ||
+ | * tracks group name and primary owner | ||
+ | * `log` sheet | ||
+ | * All contributions will get entered here. | ||
+ | * Hardware contribution gets converted to USD by techstaff. A receipt | ||
+ | * The group ' | ||
+ | * `contrib-priority` calculation references contrib amounts calculated in `groups`. | ||
- | The relevant spreadsheet: | ||
- | [[https:// | ||
+ | ===== Understanding Slurm Fairshare and Priority/ | ||
+ | Slurm comes with built in tools to calculation fair share priorities without anyone needing to do anything special. The cluster uses a [[https:// | ||
- | I'll be adding all in this chain as an editor | + | ==== How Slurm calculates Job priority ==== |
- | this email. | + | Generally Slurm will use this formula |
+ | < | ||
+ | Job_priority = | ||
+ | site_factor + | ||
+ | (PriorityWeightAge) * (age_factor) + | ||
+ | (PriorityWeightAssoc) * (assoc_factor) + | ||
+ | (PriorityWeightFairshare) * (fair-share_factor) + | ||
+ | (PriorityWeightJobSize) * (job_size_factor) + | ||
+ | (PriorityWeightPartition) * (partition_factor) + | ||
+ | (PriorityWeightQOS) * (QOS_factor) + | ||
+ | SUM(TRES_weight_cpu * TRES_factor_cpu, | ||
+ | TRES_weight_< | ||
+ | | ||
+ | - nice_factor | ||
+ | </ | ||
+ | The factors on the left that start with '' | ||
- | Details: | + | < |
+ | fe01:~$ cat / | ||
+ | PriorityType=priority/ | ||
+ | PriorityDecayHalfLife=08: | ||
+ | PriorityMaxAge=5-0 | ||
+ | PriorityWeightFairshare=500000 | ||
+ | PriorityWeightPartition=100000 | ||
+ | PriorityWeightQOS=0 | ||
+ | PriorityWeightJobSize=0 | ||
+ | PriorityWeightAge=0 | ||
+ | PriorityFavorSmall=YES | ||
+ | </ | ||
+ | *Note that this example may not be up to date when you read this. | ||
- | By default the cluster uses a fair share algorithm to adjust | + | ===== How we modify |
- | More reading if you want, but not necessary | + | |
- | email. | + | |
- | | + | |
- | | + | |
- | We adjust those priorities with partitions for those who have donated | + | * We adjust those priorities with partitions for those who have donated either monetarily or with hardware. Hardware donations get converted a monetary value when logged on the spreadsheet. |
- | either monetarily or with hardware. | + | * Every contribution gets assigned a POSIX group, hereon referred to as '' |
- | Here is a simplified version of the partition configuration as it stands | ||
- | now. | ||
+ | Here is a version of the partition configuration as it stands now (2021-02-10). | ||
< | < | ||
PartitionName=general Nodes=a[001-008] | PartitionName=general Nodes=a[001-008] | ||
Line 43: | Line 79: | ||
</ | </ | ||
+ | ^Partition^Description^Priority^ | ||
+ | |general| For all users| 0 | | ||
+ | |${group}-own | Machines $group has donated | 100 | | ||
+ | |${group}-contrib | A method to give slightly higher job priority to groups who have donated but do not own machines.| Variable based on spreadsheet calculation. | | ||
- | general: For all users | + | The key thing to notice before you continue reading is that nodes can be added to multiple partitions. '' |
- | ${group}-own: | + | |
- | ${group}-contrib: A method | + | |
- | groups who have donated | + | |
- | The key thing to notice before you continue reading is that nodes can be | + | ==== Calculating -contrib partition usage ==== |
- | added to multiple partitions. | + | |
- | ' | + | We do the following calculation |
- | priorities. | + | |
- | + | ||
- | Priority is normalized in the sheet to be 0-100. | + | |
- | + | ||
- | Understanding | + | |
- | 1. All users get access to partition | + | |
- | It has a default priority of 0. | + | |
- | 2. Group ' | + | |
- | priority on those machines (Priority=100). | + | |
- | This means that at most they would wait 4 hours for their job to be | + | |
- | submitted. | + | |
- | 3. 'cdac-contrib' | + | |
- | example, has donated to the cluster they should get a higher priority on | + | |
- | other machines as well. | + | |
- | + | ||
- | We do the following calculation to determine the contrib | + | |
- | (cdac-contrib) | + | |
- | cluster usage. | + | |
+ | < | ||
partition usage total time in seconds for 30 days | partition usage total time in seconds for 30 days | ||
------------------------------------------------------ = percent used | ------------------------------------------------------ = percent used | ||
all partition usage total time in seconds for 30 days | all partition usage total time in seconds for 30 days | ||
+ | </ | ||
The percent will end up as an integer. | The percent will end up as an integer. | ||
- | + | There is a [[https:// | |
- | You'll see on the spreadsheet we take subtract | + | |
- | ensure it's positive | + | |
- | + | ||
- | Total amount of money contributed and " | + | |
- | determining the priority of a groups ' | + | |
- | + | ||
- | This calculation will be run once a month and the relevant groups | + | |
- | ${group}-contrib priority updated | + | |
- | + | ||
- | + | ||
- | Note that the term " | + | |
- | know of way to actually calculate true idleness. I do believe that the | + | |
- | current calculation reflects the intent of the term. | + | |
- | + | ||
- | + | ||
- | TODO: | + | |
- | - There are multiple methods used to calculate priority reflected | + | |
- | the spreadsheet. | + | |
- | - Double check the math for correctness. | + | |
- | - Does the math reflect our intent? I think it does. | + | |
- | - Multiple methods in calculating priority are reflected | + | |
- | purple cells: har_priority, | + | |
- | weighted_normalized_priority | + | |
- | - Choose one calculation to use. This decision doesn' | + | |
- | from changing it later if we find another calculation works better. | + | |
- | - If you have a suggestion: Please show us the work in a new column | + | |
- | or by cloning the sheet. | + | |
- | Sheet usage: | + | You'll see on the spreadsheet |
- | - All donations will get logged into the spreadsheet | + | |
- | sheet. | + | |
- | - Hardware donation gets converted to USD by techstaff. A receipt of | + | |
- | the purchase is good starting point. | + | |
- | - donations get assigned a POSIX group. Group must have a primary | + | |
- | contact, who then gets to set members for that group. | + | |
- | - The group 'cs' | + | |
- | get any priority set. ' | + | |
- | of the total sum but is treated as a special case with 0 priority. | + | |
+ | Total amount of money contributed and " | ||
+ | This calculation will be run once a month and the relevant groups ${group}-contrib priority updated to reflect past months usage. | ||
+ | Note that the term " | ||
/var/lib/dokuwiki/data/pages/techstaff/aicluster-admin.txt · Last modified: 2021/02/23 19:58 by kauffman