techstaff:aicluster-admin
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
techstaff:aicluster-admin [2021/02/10 13:15] – kauffman | techstaff:aicluster-admin [2021/02/23 19:58] (current) – kauffman | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== AI Cluster Policy Description ====== | ====== AI Cluster Policy Description ====== | ||
+ | ===== TODO ===== | ||
+ | - There are multiple methods used to calculate priority reflected on the spreadsheet. | ||
+ | - Double check the math for correctness. | ||
+ | - Does the math reflect our intent? I think it does. | ||
+ | - Multiple methods in calculating priority are reflected (blue to purple cells: '' | ||
+ | - Choose one calculation to use. This decision doesn' | ||
+ | - If you have a suggestion: Please show us the work in a new column or by cloning the sheet. | ||
- | After various iterations we (Bob, Har, and I) believe to have an | + | ===== Contribution Tracking |
- | implementation of the policy that meets the requirements discussed | + | |
- | previously. | + | |
[[https:// | [[https:// | ||
Line 9: | Line 14: | ||
AI Cluster committee members have access. | AI Cluster committee members have access. | ||
+ | ==== Sheet usage ==== | ||
+ | * Red: Do not edit | ||
+ | * Green: user input (This will be Techstaff 95% of the time) | ||
+ | * `groups` sheet: | ||
+ | * contributions get assigned a POSIX group. Group must have a primary contact, who then gets to set members for that group. | ||
+ | * calculates contribution amount for use in `contrib-priority`. | ||
+ | * tracks group name and primary owner | ||
+ | * `log` sheet | ||
+ | * All contributions will get entered here. | ||
+ | * Hardware contribution gets converted to USD by techstaff. A receipt of the purchase is good starting point. | ||
+ | * The group ' | ||
+ | * `contrib-priority` calculation references contrib amounts calculated in `groups`. | ||
- | Details: | ||
- | By default the cluster uses a [[https:// | ||
+ | ===== Understanding Slurm Fairshare and Priority/ | ||
+ | Slurm comes with built in tools to calculation fair share priorities without anyone needing to do anything special. The cluster uses a [[https:// | ||
+ | |||
+ | ==== How Slurm calculates Job priority ==== | ||
Generally Slurm will use this formula to determine a jobs priority. | Generally Slurm will use this formula to determine a jobs priority. | ||
< | < | ||
Line 31: | Line 50: | ||
</ | </ | ||
- | The factors on the left that start with '' | + | The factors on the left that start with '' |
< | < | ||
Line 45: | Line 64: | ||
PriorityFavorSmall=YES | PriorityFavorSmall=YES | ||
</ | </ | ||
- | **Note that this example may not be up to date when you read this. | + | *Note that this example may not be up to date when you read this. |
- | We adjust those priorities with partitions for those who have donated either monetarily or with hardware. Hardware donations get converted a monetary value when logged on the spreadsheet. | + | ===== How we modify job priority to favor contributors ===== |
- | Every contribution gets assigned a POSIX group, hereon | + | * We adjust those priorities with partitions for those who have donated either monetarily or with hardware. Hardware donations get converted a monetary value when logged on the spreadsheet. |
+ | * Every contribution gets assigned a POSIX group, hereon | ||
- | + | Here is a version of the partition configuration as it stands now (2021-02-10). | |
- | Here is a simplified | + | |
< | < | ||
PartitionName=general Nodes=a[001-008] | PartitionName=general Nodes=a[001-008] | ||
- | PartitionName=cdac-own Nodes=a[005-008] AllowGroups=cdac Priority=100 | + | #PartitionName=cdac-own Nodes=a[005-008] AllowGroups=cdac Priority=100 |
PartitionName=cdac-contrib Nodes=a[001-008] AllowGroups=cdac Priority=5 | PartitionName=cdac-contrib Nodes=a[001-008] AllowGroups=cdac Priority=5 | ||
</ | </ | ||
- | ^Partition^ Description^ | + | ^Partition^Description^Priority^ |
- | |general| For all users| | + | |general| For all users| 0 | |
- | |${group}-own | Machines $group has donated | | + | |${group}-own | Machines $group has donated. Enabled when asked. | 100 | |
- | |${group}-contrib | A method to give slightly higher job priority to groups who have donated but do not own machines.| | + | |${group}-contrib | A method to give slightly higher job priority to groups who have donated but do not own machines.| Variable based on spreadsheet calculation. | |
+ | The key thing to notice before you continue reading is that nodes can be added to multiple partitions. '' | ||
- | The key thing to notice before you continue reading is that nodes can be | ||
- | added to multiple partitions. | ||
- | ' | + | ==== Calculating |
- | priorities. | + | |
- | Priority is normalized in the sheet to be 0-100. | + | We do the following calculation |
- | + | ||
- | Understanding | + | |
- | 1. All users get access to partition | + | |
- | It has a default priority of 0. | + | |
- | 2. Group ' | + | |
- | priority on those machines (Priority=100). | + | |
- | This means that at most they would wait 4 hours for their job to be | + | |
- | submitted. | + | |
- | 3. 'cdac-contrib' | + | |
- | example, has donated to the cluster they should get a higher priority on | + | |
- | other machines as well. | + | |
- | + | ||
- | We do the following calculation to determine the contrib | + | |
- | (cdac-contrib) | + | |
- | cluster usage. | + | |
+ | < | ||
partition usage total time in seconds for 30 days | partition usage total time in seconds for 30 days | ||
------------------------------------------------------ = percent used | ------------------------------------------------------ = percent used | ||
all partition usage total time in seconds for 30 days | all partition usage total time in seconds for 30 days | ||
+ | </ | ||
The percent will end up as an integer. | The percent will end up as an integer. | ||
- | + | There is a [[https:// | |
- | You'll see on the spreadsheet we take subtract | + | |
- | ensure it's positive | + | |
- | + | ||
- | Total amount of money contributed and " | + | |
- | determining the priority of a groups ' | + | |
- | + | ||
- | This calculation will be run once a month and the relevant groups | + | |
- | ${group}-contrib priority updated | + | |
- | + | ||
- | + | ||
- | Note that the term " | + | |
- | know of way to actually calculate true idleness. I do believe that the | + | |
- | current calculation reflects the intent of the term. | + | |
- | + | ||
- | + | ||
- | TODO: | + | |
- | - There are multiple methods used to calculate priority reflected | + | |
- | the spreadsheet. | + | |
- | - Double check the math for correctness. | + | |
- | - Does the math reflect our intent? I think it does. | + | |
- | - Multiple methods in calculating priority are reflected | + | |
- | purple cells: har_priority, | + | |
- | weighted_normalized_priority | + | |
- | - Choose one calculation to use. This decision doesn' | + | |
- | from changing it later if we find another calculation works better. | + | |
- | - If you have a suggestion: Please show us the work in a new column | + | |
- | or by cloning the sheet. | + | |
- | Sheet usage: | + | You'll see on the spreadsheet |
- | - All donations will get logged into the spreadsheet | + | |
- | sheet. | + | |
- | - Hardware donation gets converted to USD by techstaff. A receipt of | + | |
- | the purchase is good starting point. | + | |
- | - donations get assigned a POSIX group. Group must have a primary | + | |
- | contact, who then gets to set members for that group. | + | |
- | - The group 'cs' | + | |
- | get any priority set. ' | + | |
- | of the total sum but is treated as a special case with 0 priority. | + | |
+ | Total amount of money contributed and " | ||
+ | This calculation will be run once a month and the relevant groups ${group}-contrib priority updated to reflect past months usage. | ||
+ | Note that the term " | ||
/var/lib/dokuwiki/data/attic/techstaff/aicluster-admin.1612984516.txt.gz · Last modified: 2021/02/10 13:15 by kauffman