Skip to content

Conversation

simonLeary42
Copy link
Contributor

The definition for AccountInfo.queue is "the queue this account can use", but this is problematic:

  • "this account" is already muddy waters, because one AccountInfo is just one slurm association, so there are other "accounts" for the same slurm account with potentially other queues (or no queue)
  • as far as I know, the partition for a slurm association does not imply access, it just tells Slurm that the resource limits in that association should only be applied to jobs in that partition

I can't find any usage of this attribute. Please strike down this PR if I'm just not understanding or if this attribute is important.

@simonLeary42
Copy link
Contributor Author

If this change is liked, the sacctmgr command could also be changed so that it doesn't query for partition and the example output text files could be updated accordingly.

@johrstrom
Copy link
Contributor

I have to ask around to see what this actually does.

"this account" is already muddy waters, because one AccountInfo is just one slurm association, so there are other "accounts" for the same slurm account with potentially other queues (or no queue)

"this account" likely needs to represent many associations, because we're sort of a 'rosetta stone' for many types of schedulers. Or at least it should represent all the associations per cluster.

as far as I know, the partition for a slurm association does not imply access, it just tells Slurm that the resource limits in that association should only be applied to jobs in that partition

Could be right. Again, I need to ask around. As you may surmise, I edited that file by hand to provide partition_a and partition_b because we don't use partitions like that at OSC.

May I ask what sacctmgr -nP show users withassoc format=account,cluster,partition,qos where user=$(whoami) looks like on your system?

@johrstrom
Copy link
Contributor

It appears we don't use it in OnDemand, so yea maybe we can just get rid of it. And if we ever need it again, we'll just add it back.

@simonLeary42
Copy link
Contributor Author

May I ask what sacctmgr -nP show users withassoc format=account,cluster,partition,qos where user=$(whoami) looks like on your system?

$ sacctmgr -nP show users withassoc format=account,cluster,partition,qos where user=$(whoami)
pi_simonleary_umass_edu|unity||long,normal,short
pi_tbernard_umass_edu|unity||long,normal,short

@simonLeary42
Copy link
Contributor Author

simonLeary42 commented Jul 16, 2025

"this account" likely needs to represent many associations

I agree. I started work on another PR which combines the QOS and queue attributes of duplicate AccountInfo into a list, but then I thought this should come first.

@simonLeary42
Copy link
Contributor Author

simonLeary42 commented Jul 16, 2025

May I ask what sacctmgr -nP show users withassoc format=account,cluster,partition,qos where user=$(whoami) looks like on your system?

$ sacctmgr -nP show users withassoc format=account,cluster,partition,qos where user=$(whoami)
pi_simonleary_umass_edu|unity||long,normal,short
pi_tbernard_umass_edu|unity||long,normal,short

We do have associations with partitions, however, they're just special cases. For example, we have QOSes gpu-quota-{4,8,20} which we grant to certain accounts based on how many GPUs they paid for. Those associations are limited to the partition dedicated to the GPUs that people pooled together to buy.

@simonLeary42
Copy link
Contributor Author

I started work on another PR which combines the QOS and queue attributes of duplicate AccountInfo

#891

@simonLeary42
Copy link
Contributor Author

As you may surmise, I edited that file by hand to provide partition_a and partition_b because we don't use partitions like that at OSC.

Removed those edits since they would be no longer used

@simonLeary42
Copy link
Contributor Author

Removed those edits since they would be no longer used

here as well: OSC/ondemand@c38d407

@johrstrom
Copy link
Contributor

You seem to have a conflicting PR in #891. Which do you want?

@simonLeary42
Copy link
Contributor Author

You seem to have a conflicting PR in #891. Which do you want?

I would want this to be merged and then I'll resolve the conflict

@simonLeary42
Copy link
Contributor Author

You seem to have a conflicting PR in #891. Which do you want?

I would want this to be merged and then I'll resolve the conflict

They aren't mutually exclusive, after a merge conflict resolution that will still deduplicate AccountInfos and combine their .qos attributes into one list.

@johrstrom
Copy link
Contributor

Hmmmm, I'm not sure. This seems unnecessary given what's in #891. It seems to me that we should do #891 if we want to keep this information and just close this? Am I missing something?

@simonLeary42
Copy link
Contributor Author

simonLeary42 commented Jul 16, 2025

Hmmmm, I'm not sure. This seems unnecessary given what's in #891. It seems to me that we should do #891 if we want to keep this information and just close this? Am I missing something?

The partition in a Slurm association applies only to the QOS in that association. If we combine the QOS lists, then partition has no meaning because we don't know which QOS it applies to and which it doesn't.

@simonLeary42
Copy link
Contributor Author

Hmmmm, I'm not sure. This seems unnecessary given what's in #891. It seems to me that we should do #891 if we want to keep this information and just close this? Am I missing something?

The partition in a Slurm association applies only to the QOS in that association. If we combine the QOS lists, then partition has no meaning because we don't know which QOS it applies to and which it doesn't.

Sorry, that's backwards. If an association has a partition and QOSes, then those QOSes can only be used in that partition, and OOD could actually be reading that data so that auto_queues / auto_qos / auto_account can prevent you from submiting an un-runnable job. This would be a bunch of extra logic, and if it were implemented I would think that it should use some other attribute besides accounts, because I don't think it's likely that any other job scheduler is going to have a similar structure for what an account is.

@simonLeary42
Copy link
Contributor Author

If an association has a partition and QOSes, then those QOSes can only be used in that partition

I have not confirmed this behavior, I'm setting up a docker compose slurm cluster to test this out.

It could also be that partition in an association does nothing at all except provide context to resource limits such as PartitionTimeLimit.

@johrstrom
Copy link
Contributor

I meant more like the net result after all these changes.

I'm saying, isn't the net result of 890 AND 891 just basically 891 itself? I mean like why remove the queue attribute here and just replace it with queues in 891? Why not just forgo this change altogether if we're going to turn around and change it again?

@simonLeary42
Copy link
Contributor Author

My intention was to resolve the merge conflict by deleting queues from 891.

@simonLeary42
Copy link
Contributor Author

At the time I wasn't sure if deleting it was a good idea, so I thought we would talk about deleting it first and then I would know what to do with 891.

@johrstrom
Copy link
Contributor

At the time I wasn't sure if deleting it was a good idea, so I thought we would talk about deleting it first and then I would know what to do with 891.

It seems very two steps forward, one step back. Instead I'd prefer we just take one step forward and end up in the same place. It's easier for me to manage now, but then also easier to trace through git blame if changes are atomic.

@simonLeary42
Copy link
Contributor Author

ok, I merged 891 into this branch.

@johrstrom
Copy link
Contributor

Thanks! I'll need just a bit to double/triple check that we can get rid of this api.

@simonLeary42
Copy link
Contributor Author

OK I'm done messing with this, no more commits should trickle in.


with_modified_env({ OOD_UPCASE_ACCOUNTS: 'true'}) do
expect(subject.accounts.map(&:to_s).uniq).to eq(expected_accounts)
expect(subject.accounts.map(&:to_s).uniq.to_set).to eq(expected_accounts.to_set)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now that this can return duplicates I think we shouldn't use to_set and in fact ensure that duplicates are returned as that's.

It's possible (probable/) that I should not have started with uniq, but with no second pair of eyes, it slipped through.

In any case, if returning duplicates is expected out of this API, then we should ensure that we test for that duplication. Maybe we need to test against tuples with 2 keys account & cluster pairs?

@simonLeary42 simonLeary42 force-pushed the remove-queue-AccountInfo branch from f5020db to 0a62e97 Compare August 22, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants