Differences

This shows you the differences between two versions of the page.

--- techstaff:aicluster-admin [2020/11/30 20:17] – kauffman
+++ techstaff:aicluster-admin [2021/02/23 19:58] (current) – kauffman
@@ Line 1: / Line 1: @@
-====== AI Cluster Admin ======
+====== AI Cluster Policy Description ======
 ===== TODO =====
+  - There are multiple methods used to calculate priority reflected on the spreadsheet.
+  - Double check the math for correctness.
+  - Does the math reflect our intent? I think it does.
+  - Multiple methods in calculating priority are reflected (blue to purple cells: ''%%har_priority%%'', ''%%normalized_priority%%'', ''%%weighted_normalized_priority%%''
+  - Choose one calculation to use. This decision doesn't prevent us from changing it later if we find another calculation works better.
+  - If you have a suggestion: Please show us the work in a new column or by cloning the sheet.
-Since I'm still working on it, I don't guarantee any uptime yet. Mainly I need to make sure TRES tracking is working like we want. This will involve restarting slurmd and slurmctld which will kill running jobs.
+===== Contribution Tracking and Priority Calculation Tool =====
+[[https://docs.google.com/spreadsheets/d/15o3jZOVqU84hMevIKnLKj8DhFYaQUdPdOYE1_itucqk|AI Cluster Tracking and Priority calculation spreadsheet]]
-  * <del>generate report of storage usage</del>
+AI Cluster committee members have access.
-  * <del>groups (Slurm 'Accounts') created for PI's.</del>
-    * <del>e.g. ericj_group: ericj, user1, user1, etc</del>
-  * <del>grab QOS data from somewhere (gsheet or some kind of DB)</del>
-  * <del>Properly deploy sync script</del>
-    * <del>Systemd unit</del>
-    * <del>main loop</del>
-  * <del>research on slurm plugin to force GRES selection on job submit. Might be able to use: </del>
-    * <del>SallocDefaultCommand</del>
-    * <del>Otherwise look for 'AccountingStorageTRES' and 'JobSubmitPlugins' and  /etc/slurm-llnl/job_submit.lua <= used to force user to specify '--gres'.</del>
-    * <del>jobs that do not specify a specific gpu type (e.g. gpu:rtx8000 or gpu:rtx2080ti) could be counted against either one but not specifically the you actually used.</del>
-    * <del>From 'AccountingStorageTRES' in slurm.conf: "Given a configuration of "AccountingStorageTRES=gres/gpu:tesla,gres/gpu:volta" Then "gres/gpu:tesla" and "gres/gpu:volta" will track jobs that explicitly request those GPU types. If a job requests GPUs, but does not explicitly specify the GPU type, then its resource allocation will be accounted for as either "gres/gpu:tesla" or "gres/gpu:volta", although the accounting may not match the actual GPU type allocated to the job and the GPUs allocated to the job could be heterogeneous. In an environment containing various GPU types, use of a job_submit plugin may be desired in order to force jobs to explicitly specify some GPU type."</del>
-  * <del>ganglia for Slurm: http://ai-mgmt2.ai.cs.uchicago.edu</del>
-    * figure why summary view is no longer a thing.
-  * <del>update 'coolgpus'. Lose VTs when this is running.</del>
-    * <del>coolgpus: sets fan speeds of all gpus in system.</del>
-    * <del>Goal is to statically set fan speeds to 80%. The only way to do this is with fake Xservers... but that means you lose all the VTs. Is this a compromise I'm willing to make?</del> It is.
-  * home directory
-    * setup backups for home dirs
-    * default quota
-    * home directory usage report
-  * monitoring
-    * basic node monitor
-    * nfs or bandwidth monitoring
+==== Sheet usage ====
+  * Red: Do not edit
+  * Green: user input (This will be Techstaff 95% of the time)
+  * `groups` sheet:
+    * contributions get assigned a POSIX group. Group must have a primary contact, who then gets to set members for that group.
+    * calculates contribution amount for use in `contrib-priority`.
+    * tracks group name and primary owner
+  * `log` sheet
+    * All contributions will get entered here.
+    * Hardware contribution gets converted to USD by techstaff. A receipt of the purchase is good starting point.
+  * The group 'cs' is calculated on the spreadsheet but doesn't actually get any priority set. 'general' is for all CS users. It needs to be part of the total sum but is treated as a special case with 0 priority.
+  * `contrib-priority` calculation references contrib amounts calculated in `groups`.
-===== Fairshare =====
-# Check out the fairshare values
+===== Understanding Slurm Fairshare and Priority/Multifactor =====
+Slurm comes with built in tools to calculation fair share priorities without anyone needing to do anything special. The cluster uses a [[https://slurm.schedmd.com/fair_tree.html|fair share algorithm]] with [[https://slurm.schedmd.com/priority_multifactor.html|multiple factors]] to adjust job priorities.
+==== How Slurm calculates Job priority ====
+Generally Slurm will use this formula to determine a jobs priority.
 <code>
-kauffman3@fe01:~$ sshare --long --accounts=kauffman3,kauffman4 --users=kauffman3,kauffman4
+Job_priority =
-             Account       User  RawShares  NormShares    RawUsage NormUsage  EffectvUsage  FairShare    LevelFS GrpTRESMins     TRESRunMins
+	site_factor +
--------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------
+	(PriorityWeightAge) * (age_factor) +
-kauffman3                                1    0.000094         428 1.000000      1.000000              0.000094 cpu=475,mem=2807810,energy=0,+
+	(PriorityWeightAssoc) * (assoc_factor) +
- kauffman3            kauffman3          1    1.000000         428 1.000000      1.000000   0.000094   1.000000 cpu=475,mem=2807810,energy=0,+
+	(PriorityWeightFairshare) * (fair-share_factor) +
-kauffman4                                1    0.000094           0 0.000000      0.000000                   inf cpu=0,mem=0,energy=0,node=0,b+
+	(PriorityWeightJobSize) * (job_size_factor) +
- kauffman4            kauffman4          1    1.000000           0 0.000000      0.000000   1.000000        inf cpu=0,mem=0,energy=0,node=0,b+
+	(PriorityWeightPartition) * (partition_factor) +
+	(PriorityWeightQOS) * (QOS_factor) +
+	SUM(TRES_weight_cpu * TRES_factor_cpu,
+	    TRES_weight_<type> * TRES_factor_<type>,
+	    ...)
+	- nice_factor
 </code>
+The factors on the left that start with ''%%PriorityWeight*%%'' can and are set in the Slurm cluster configuration file (found on all nodes at /etc/slurm-llnl/slurm.conf). The factors on the right are calculated based previous job submissions for that particular user.
-We are using the FairTree (fairshare algorithm). This is the default in Slurm these days and from what I can tell probably better suits our needs. It is no big deal to change to classic fairshare.
+<code>
+fe01:~$ cat /etc/slurm-llnl/slurm.conf |grep "^Priority"
+PriorityType=priority/multifactor
+PriorityDecayHalfLife=08:00:00
+PriorityMaxAge=5-0
+PriorityWeightFairshare=500000
+PriorityWeightPartition=100000
+PriorityWeightQOS=0
+PriorityWeightJobSize=0
+PriorityWeightAge=0
+PriorityFavorSmall=YES
+</code>
+*Note that this example may not be up to date when you read this.
+===== How we modify job priority to favor contributors =====
+  * We adjust those priorities with partitions for those who have donated either monetarily or with hardware. Hardware donations get converted a monetary value when logged on the spreadsheet.
+  * Every contribution gets assigned a POSIX group, hereon referred to as ''%%$group%%''.
-As the system exists now. One Account per User.
+Here is a version of the partition configuration as it stands now (2021-02-10).
 <code>
- Account: kauffman
+PartitionName=general Nodes=a[001-008]
-   Member: kauffman
+#PartitionName=cdac-own Nodes=a[005-008] AllowGroups=cdac Priority=100
- User: kauffman
+PartitionName=cdac-contrib Nodes=a[001-008] AllowGroups=cdac Priority=5
 </code>
-We will probably assign fairshare points to accounts, not users.
-====== QOS ======
+^Partition^Description^Priority^
+|general| For all users| 0 |
+|${group}-own | Machines $group has donated. Enabled when asked. | 100 |
+|${group}-contrib | A method to give slightly higher job priority to groups who have donated but do not own machines.| Variable based on spreadsheet calculation. |
+The key thing to notice before you continue reading is that nodes can be added to multiple partitions. ''%%general%%'' and ''%%cdac-contrib%%'' can submit to all nodes but with different priorities.
-When submitting jobs users will have to include '--account=<groupnmame>' to
-get the priority levels associated with that account.
-Priority levels:
+==== Calculating -contrib partition usage ====
-normal: [default] value=0
-low: value=100
-medium: value=500
-high: value=1000
-groupname is a Slurm 'account', with users of the cluster added.
+We do the following calculation to determine the ''%%*-contrib%%'' partitions usage over the past 30 days in comparison to total cluster usage.
-As an admin the following would be created:
+<code>
+partition usage total time in seconds for 30 days
+------------------------------------------------------ = percent used
+all partition usage total time in seconds for 30 days
+</code>
-# create group and set allowed QOS levels. Multiple levels can be specified.
+The percent will end up as an integer.
-# Meaning you can set 'low,medium,high' with a default QOS of low
-# sacctmgr create account jonaslab
-# sacctmgr -i modify account jonaslab set qos=low
-# sacctmgr -i modify account jonaslab set defaultqos=low
-# Now add 'kauffman3' to the group
+There is a [[https://vcs.cs.uchicago.edu/kauffman/slurm-tools/blob/master/cluster_partition_usage.py|python script]] that does this and sends techstaff a report. The repo is currently not available for everyone to see but I think that it should be eventually. In the mean time you can take a look at it on the front end nodes (/usr/local/slurm-tools/cluster_partition_usage.py).
-# sacctmgr create user kauffman3 account=jonaslab
-These values get used in the multifactor calculation to set the total
+You'll see on the spreadsheet we take subtract ''%%percent used%%'' from 100, ensure it's positive and call that "idleness".
-priority on any given job.
-The math/algorithm is available on SchedMDs site if anyone wants to come up
+Total amount of money contributed and "idleness" are the key factors in determining the priority of a groups 'contrib' partition.
-with something optimal. I've guessed at values that seem reasonable and
-should do what we want.
-https://slurm.schedmd.com/priority_multifactor.html#general
-The values on the left side of the + signs are values we can set.
+This calculation will be run once a month and the relevant groups ${group}-contrib priority updated to reflect past months usage.
-It will be up to us to know when to remove any groups access to higher
-priorities. I imagine some sort of boolean in a spreadsheet or database.
-If you do not use the '--account=<groupname>' switch it will use the users
+Note that the term "idleness" should not be taken literally. I don't know of way to actually calculate true idleness. I believe that the current calculation reflects the intended usage.
-default account which has the default priority (normal) set.
+====== AI Cluster Admin ======
+===== TODO =====
+Since I'm still working on it, I don't guarantee any uptime yet. Mainly I need to make sure TRES tracking is working like we want. This will involve restarting slurmd and slurmctld which will kill running jobs.
+  * <del>generate report of storage usage</del>
+  * <del>groups (Slurm 'Accounts') created for PI's.</del>
+    * <del>e.g. ericj_group: ericj, user1, user1, etc</del>
+  * <del>grab QOS data from somewhere (gsheet or some kind of DB)</del>
+  * <del>Properly deploy sync script</del>
+    * <del>Systemd unit</del>
+    * <del>main loop</del>
+  * <del>research on slurm plugin to force GRES selection on job submit. Might be able to use: </del>
+    * <del>SallocDefaultCommand</del>
+    * <del>Otherwise look for 'AccountingStorageTRES' and 'JobSubmitPlugins' and  /etc/slurm-llnl/job_submit.lua <= used to force user to specify '--gres'.</del>
+    * <del>jobs that do not specify a specific gpu type (e.g. gpu:rtx8000 or gpu:rtx2080ti) could be counted against either one but not specifically the you actually used.</del>
+    * <del>From 'AccountingStorageTRES' in slurm.conf: "Given a configuration of "AccountingStorageTRES=gres/gpu:tesla,gres/gpu:volta" Then "gres/gpu:tesla" and "gres/gpu:volta" will track jobs that explicitly request those GPU types. If a job requests GPUs, but does not explicitly specify the GPU type, then its resource allocation will be accounted for as either "gres/gpu:tesla" or "gres/gpu:volta", although the accounting may not match the actual GPU type allocated to the job and the GPUs allocated to the job could be heterogeneous. In an environment containing various GPU types, use of a job_submit plugin may be desired in order to force jobs to explicitly specify some GPU type."</del>
+  * <del>ganglia for Slurm: http://ai-mgmt2.ai.cs.uchicago.edu</del>
+    * <del>figure why summary view is no longer a thing.</del>
+  * <del>update 'coolgpus'. Lose VTs when this is running.</del>
+    * <del>coolgpus: sets fan speeds of all gpus in system.</del>
+    * <del>Goal is to statically set fan speeds to 80%. The only way to do this is with fake Xservers... but that means you lose all the VTs. Is this a compromise I'm willing to make?</del> It is.
+  * home directory
+    * <del>setup backups for home dirs</del>
+    * <del>default quota</del>
+    * <del>userland tool to check quota</del>
+    * <del>home directory usage report</del>
+  * monitoring
+    * <del>basic node monitor</del>
+    * nfs or bandwidth monitoring
+    * gpu
+  * sync script
+    * <del>fix bug to ensure accounts and users are created</del>
-Anyways... a more readable version of the policy would be helpful for me to
-try to match what we think we want to what we can do.