feat: Add Azure ML compatibility to ParallelRunner #329

sali1293 · 2025-09-23T09:32:06Z

Add Azure ML compatibility to ParallelRunner via distributed env var support (RANK/WORLD_SIZE) and env:// init method

Description

This PR introduces compatibility for Azure Machine Learning Studio in the ParallelRunner by adding support for distributed environment variables (e.g., RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT) and using the 'env://' initialization method for torch.distributed when applicable.

What problem does this change solve?

It enables seamless parallel inference in cloud-based distributed environments like Azure ML, where Slurm (srun) is not used, by detecting and utilizing standard PyTorch distributed env vars instead of relying solely on Slurm or manual process spawning.

What issue or task does this change relate to?

N/A (This is an enhancement based on modifications for Azure ML compatibility; no specific GitHub issue linked.)

Additional notes

Changes are minimal and isolated to the _bootstrap_processes and _init_parallel methods in ParallelRunnerMixin to avoid disrupting existing Slurm or manual spawning workflows.
Added helper methods _using_distributed_env and _is_mpi_env for cleaner logic.
No breaking changes; falls back gracefully to existing behaviors.
Tested in a multi-GPU Azure ML environment; no updates to dependencies required.
MPI detection is optional and only used for non-CUDA backends when available.

Add Azure ML compatibility to ParallelRunner via distributed env var support (RANK/WORLD_SIZE) and env:// init method

for more information, see https://pre-commit.ci

gmertes

Thank for this, this looks okay at first glance. A while ago the idea did come up to separate this code out of the ParallelRunner and delegate it to something like a ClusterEnvironment class (taking inspiration from pytorch-lightning.

The setting of all these variables like global_rank, local_rank etc and the initialisation of the backend would then be done by a derived class for Slurm, MPI, Azur, etc.

Would you be interested in working on this kind of refactor? We can always split it into two PRs: first we merge this one with the ifs to get something that works for you, and then we refactor into delegated classes. I believe @cathalobrien will also have some suggestions on this.

cathalobrien · 2025-09-23T14:01:08Z

Nice work! I would be happy to work with you on this.

I think it would be good if we made a parallel runner base class with the following abstract methods:
_bootstrap_processes
_init_parallel

and then create local, SLURM, AzureML subclasses which implement these methods

gmertes · 2025-09-24T09:42:54Z

Would delegation be easier to manage instead of inheritance? Then the cluster environment can simply be part of the constructor through a lookup table, something like:

ENVIRONMENTS = {
  'mpi': MpiEnv,
  'slurm': SlurmEnv
}

class ParallelRunner:
  def __init__(self, env = 'slurm'):
    self.env = ENVIRONMENTS[env](self)   # pass self so env has access to runner attributes if needed

  self.env.bootstrap_processes()
  self.env.init_parallel()

sali1293 · 2025-09-24T10:51:49Z

Thank for this, this looks okay at first glance. A while ago the idea did come up to separate this code out of the ParallelRunner and delegate it to something like a ClusterEnvironment class (taking inspiration from pytorch-lightning.

The setting of all these variables like global_rank, local_rank etc and the initialisation of the backend would then be done by a derived class for Slurm, MPI, Azur, etc.

Would you be interested in working on this kind of refactor? We can always split it into two PRs: first we merge this one with the ifs to get something that works for you, and then we refactor into delegated classes. I believe @cathalobrien will also have some suggestions on this.

@gmertes happy with two PRs approach, merging this one first and then a second one with further changes / enhancements. Are you happy for me to publish the PR (it's in draft state currently)

gmertes · 2025-09-25T10:00:12Z

Yes that sounds good to me!

sali1293 · 2025-09-26T09:05:11Z

@cathalobrien can you please have a look as @gmertes requested. Thanks

cathalobrien · 2025-09-26T09:07:11Z

i'm on leave, i'll have a look on monday. cheers

sali1293 · 2025-10-01T12:58:33Z

Hi @cathalobrien, wondering if you would have time this week to have a look and possibly merge this. Thanks

cathalobrien · 2025-10-01T13:07:15Z

thanks for reminding me, I will have a look today

Update parallel.py

e7b06ea

Add Azure ML compatibility to ParallelRunner via distributed env var support (RANK/WORLD_SIZE) and env:// init method

github-project-automation bot added this to Anemoi-dev Sep 23, 2025

github-project-automation bot moved this to To be triaged in Anemoi-dev Sep 23, 2025

github-actions bot added the contributor label Sep 23, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

a078434

for more information, see https://pre-commit.ci

sali1293 changed the title ~~Update parallel.py~~ Add Azure ML compatibility to ParallelRunner Sep 23, 2025

sali1293 changed the title ~~Add Azure ML compatibility to ParallelRunner~~ feat: Add Azure ML compatibility to ParallelRunner Sep 23, 2025

gmertes reviewed Sep 23, 2025

View reviewed changes

Merge branch 'main' into main

aa9d962

sali1293 marked this pull request as ready for review September 25, 2025 11:11

Merge branch 'main' into main

6d1f28b

gmertes requested a review from cathalobrien September 25, 2025 14:27

mchantry added the ATS Approval not needed label Sep 26, 2025

Merge branch 'main' into main

c39939e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Azure ML compatibility to ParallelRunner #329

feat: Add Azure ML compatibility to ParallelRunner #329

Uh oh!

sali1293 commented Sep 23, 2025

Uh oh!

gmertes left a comment

Uh oh!

cathalobrien commented Sep 23, 2025

Uh oh!

gmertes commented Sep 24, 2025 •

edited

Loading

Uh oh!

sali1293 commented Sep 24, 2025

Uh oh!

gmertes commented Sep 25, 2025

Uh oh!

sali1293 commented Sep 26, 2025

Uh oh!

cathalobrien commented Sep 26, 2025

Uh oh!

sali1293 commented Oct 1, 2025

Uh oh!

cathalobrien commented Oct 1, 2025

Uh oh!

Uh oh!

feat: Add Azure ML compatibility to ParallelRunner #329

Are you sure you want to change the base?

feat: Add Azure ML compatibility to ParallelRunner #329

Uh oh!

Conversation

sali1293 commented Sep 23, 2025

Description

What problem does this change solve?

What issue or task does this change relate to?

Additional notes

Uh oh!

gmertes left a comment

Choose a reason for hiding this comment

Uh oh!

cathalobrien commented Sep 23, 2025

Uh oh!

gmertes commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sali1293 commented Sep 24, 2025

Uh oh!

gmertes commented Sep 25, 2025

Uh oh!

sali1293 commented Sep 26, 2025

Uh oh!

cathalobrien commented Sep 26, 2025

Uh oh!

sali1293 commented Oct 1, 2025

Uh oh!

cathalobrien commented Oct 1, 2025

Uh oh!

Uh oh!

gmertes commented Sep 24, 2025 •

edited

Loading