Skip to content
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3bb795f
feat: initial commit of automated partition selection
cademirch Jun 6, 2025
6f475c3
fix: flatten expected yaml structure; remove partition description
cademirch Jun 6, 2025
f8d342e
fix: update cli arg name and help
cademirch Jun 6, 2025
f5511ff
docs: update docs and help
cademirch Jun 6, 2025
01a94c6
chore: linting; change automatic to dynamic partition selection
cademirch Jun 7, 2025
bb8c471
chore: format tests
cademirch Jun 7, 2025
daa202c
fix: merge conflicts
cmeesters Oct 30, 2025
0d95537
fix: merge conflicts
cmeesters Oct 30, 2025
25c043b
fix: syntax error after merge conflict
cmeesters Oct 30, 2025
7a6e582
fix: left over line from merge conflict
cmeesters Oct 30, 2025
4726ce6
fix: missing import
cmeesters Oct 30, 2025
3febf2c
fix: removed doubled import statement
cmeesters Oct 30, 2025
65396ac
fix: made an error when solving merge conflict - mock job for partiti…
cmeesters Oct 30, 2025
e652e69
fix: formatting
cmeesters Oct 30, 2025
982dfee
feat: allowing env variable to define the partition profile
cmeesters Nov 14, 2025
da1b009
feat: enabling environment variable for partition configuration file
cmeesters Nov 20, 2025
65f1007
feat: added 'threads' hardening selection score
cmeesters Nov 20, 2025
9d01b96
feat: support cluster selection in multi-cluster env
cmeesters Nov 20, 2025
38d7daf
fix: sound thread checking
cmeesters Nov 20, 2025
53ed396
fix: removed print statement (debugg leftover)
cmeesters Nov 20, 2025
2ee9f09
docs: added docs for this PR
cmeesters Nov 20, 2025
744b8b3
tests: added tests for this PR
cmeesters Nov 20, 2025
be655d1
fix: formatting
cmeesters Nov 20, 2025
dd40f24
fix: formatting
cmeesters Nov 20, 2025
fd2336e
fix: relative import
cmeesters Nov 20, 2025
adb276c
feat: first step to refactoring, trying Snakemake's dpath
cmeesters Nov 20, 2025
f96cdc5
fix: added missing whitespace
cmeesters Nov 20, 2025
64e99b9
fix: reordered
cmeesters Nov 20, 2025
5b53def
fix: no negative value for cpus_per_task
cmeesters Nov 20, 2025
0781a2b
fix: mock logger level
cmeesters Nov 20, 2025
f8538ed
fix: threads check abbreviated
cmeesters Nov 20, 2025
8aeb8c5
fix: import order
cmeesters Nov 20, 2025
db00ab1
fix: formatting, gnarf!
cmeesters Nov 20, 2025
eb9933d
fix: attempt to run modularized tests in one go without additional im…
cmeesters Nov 20, 2025
7c662bb
fix: mock warnings
cmeesters Nov 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions docs/further.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,79 @@ See the [snakemake documentation on profiles](https://snakemake.readthedocs.io/e
How and where you set configurations on factors like file size or increasing the runtime with every `attempt` of running a job (if [`--retries` is greater than `0`](https://snakemake.readthedocs.io/en/stable/executing/cli.html#snakemake.cli-get_argument_parser-behavior)).
[There are detailed examples for these in the snakemake documentation.](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#dynamic-resources)

#### Automatic Partition Selection

The SLURM executor plugin supports automatic partition selection based on job resource requirements, via the command line option `--slurm-partition-config`. This feature allows the plugin to choose the most appropriate partition for each job, without the need to manually specify partitions for different job types. This also enables variable partition selection as a job's resource requirements change based on [dynamic resources](#dynamic-resource-specification), ensuring that jobs are always scheduled to an appropriate partition.

*Jobs that explicitly specify a `slurm_partition` resource will bypass automatic selection and use the specified partition directly.*

##### Partition Limits Specification

To enable automatic partition selection, create a YAML configuration file that defines the available partitions and their resource limits. This file should be structured as follows:

```yaml
partitions:
some_partition:
max_runtime: 100
another_partition:
...
```
Where `some_partition` and `another_partition` are the names of the partition on your cluster, according to `sinfo`.

The following limits can be defined for each partition:

| Parameter | Type | Description | Default |
| ----------------------- | --------- | ---------------------------------- | --------- |
| `max_runtime` | int | Maximum walltime in minutes | unlimited |
| `max_mem_mb` | int | Maximum total memory in MB | unlimited |
| `max_mem_mb_per_cpu` | int | Maximum memory per CPU in MB | unlimited |
| `max_cpus_per_task` | int | Maximum CPUs per task | unlimited |
| `max_nodes` | int | Maximum number of nodes | unlimited |
| `max_tasks` | int | Maximum number of tasks | unlimited |
| `max_tasks_per_node` | int | Maximum tasks per node | unlimited |
| `max_gpu` | int | Maximum number of GPUs | 0 |
| `available_gpu_models` | list[str] | List of available GPU models | none |
| `max_cpus_per_gpu` | int | Maximum CPUs per GPU | unlimited |
| `supports_mpi` | bool | Whether MPI jobs are supported | true |
| `max_mpi_tasks` | int | Maximum MPI tasks | unlimited |
| `available_constraints` | list[str] | List of available node constraints | none |

##### Example Partition Configuration

```yaml
partitions:
standard:
max_runtime: 720 # 12 hours
max_mem_mb: 64000 # 64 GB
max_cpus_per_task: 24
max_nodes: 1

highmem:
max_runtime: 1440 # 24 hours
max_mem_mb: 512000 # 512 GB
max_mem_mb_per_cpu: 16000
max_cpus_per_task: 48
max_nodes: 1

gpu:
max_runtime: 2880 # 48 hours
max_mem_mb: 128000 # 128 GB
max_cpus_per_task: 32
max_gpu: 8
available_gpu_models: ["a100", "v100", "rtx3090"]
max_cpus_per_gpu: 8
```

##### How Partition Selection Works

When automatic partition selection is enabled, the plugin evaluates each job's resource requirements against the defined partition limits to ensure the job is placed on a partition that can accommodate all of its requirements. When multiple partitions are compatible, the plugin uses a scoring algorithm that favors partitions with limits closer to the job's needs, preventing jobs from being assigned to partitions with excessively high resource limits.

The scoring algorithm calculates a score by summing the ratios of requested resources to partition limits (e.g., if a job requests 8 CPUs and a partition allows 16, this contributes 0.5 to the score). Higher scores indicate better resource utilization, so a job requesting 8 CPUs would prefer a 16-CPU partition (score 0.5) over a 64-CPU partition (score 0.125).

##### Fallback Behavior

If no suitable partition is found based on the job's resource requirements, the plugin falls back to the default SLURM behavior, which typically uses the cluster's default partition or any partition specified explicitly in the job's resources.


#### Standard Resources

Expand Down
25 changes: 24 additions & 1 deletion snakemake_executor_plugin_slurm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
__email__ = "[email protected]"
__license__ = "MIT"

import atexit
import csv
from io import StringIO
import os
Expand Down Expand Up @@ -35,6 +36,7 @@
)
from .efficiency_report import create_efficiency_report
from .submit_string import get_submit_command
from .partitions import read_partition_file, get_best_partition
from .validation import validate_slurm_extra


Expand Down Expand Up @@ -113,6 +115,15 @@ class ExecutorSettings(ExecutorSettingsBase):
"required": False,
},
)
partition_config: Optional[Path] = field(
default=None,
metadata={
"help": "Path to YAML file defining partition limits for dynamic "
"partition selection. When provided, jobs will be dynamically "
"assigned to the best-fitting partition based on "
"See documentation for complete list of available limits.",
},
)
efficiency_report: bool = field(
default=False,
metadata={
Expand Down Expand Up @@ -201,6 +212,12 @@ def __post_init__(self, test_mode: bool = False):
if self.workflow.executor_settings.logdir
else Path(".snakemake/slurm_logs").resolve()
)
self._partitions = (
read_partition_file(self.workflow.executor_settings.partition_config)
if self.workflow.executor_settings.partition_config
else None
)
atexit.register(self.clean_old_logs)

def shutdown(self) -> None:
"""
Expand Down Expand Up @@ -305,6 +322,8 @@ def run_job(self, job: JobExecutorInterface):
if job.resources.get("slurm_extra"):
self.check_slurm_extra(job)

# NOTE removed partition from below, such that partition
# selection can benefit from resource checking as the call is built up.
job_params = {
"run_uuid": self.run_uuid,
"slurm_logfile": slurm_logfile,
Expand Down Expand Up @@ -698,9 +717,13 @@ def get_partition_arg(self, job: JobExecutorInterface):
returns a default partition, if applicable
else raises an error - implicetly.
"""
partition = None
if job.resources.get("slurm_partition"):
partition = job.resources.slurm_partition
else:
elif self._partitions:
partition = get_best_partition(self._partitions, job, self.logger)
# we didnt get a partition yet so try fallback.
if not partition:
if self._fallback_partition is None:
self._fallback_partition = self.get_default_partition(job)
partition = self._fallback_partition
Expand Down
Loading
Loading