Skip to content

Conversation

sali1293
Copy link

Add Azure ML compatibility to ParallelRunner via distributed env var support (RANK/WORLD_SIZE) and env:// init method

Description

This PR introduces compatibility for Azure Machine Learning Studio in the ParallelRunner by adding support for distributed environment variables (e.g., RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT) and using the 'env://' initialization method for torch.distributed when applicable.

What problem does this change solve?

It enables seamless parallel inference in cloud-based distributed environments like Azure ML, where Slurm (srun) is not used, by detecting and utilizing standard PyTorch distributed env vars instead of relying solely on Slurm or manual process spawning.

What issue or task does this change relate to?

N/A (This is an enhancement based on modifications for Azure ML compatibility; no specific GitHub issue linked.)

Additional notes

  • Changes are minimal and isolated to the _bootstrap_processes and _init_parallel methods in ParallelRunnerMixin to avoid disrupting existing Slurm or manual spawning workflows.

  • Added helper methods _using_distributed_env and _is_mpi_env for cleaner logic.

  • No breaking changes; falls back gracefully to existing behaviors.

  • Tested in a multi-GPU Azure ML environment; no updates to dependencies required.

  • MPI detection is optional and only used for non-CUDA backends when available.

Add Azure ML compatibility to ParallelRunner via distributed env var support (RANK/WORLD_SIZE) and env:// init method
@sali1293 sali1293 changed the title Update parallel.py Add Azure ML compatibility to ParallelRunner Sep 23, 2025
@sali1293 sali1293 changed the title Add Azure ML compatibility to ParallelRunner feat: Add Azure ML compatibility to ParallelRunner Sep 23, 2025
Copy link
Member

@gmertes gmertes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for this, this looks okay at first glance. A while ago the idea did come up to separate this code out of the ParallelRunner and delegate it to something like a ClusterEnvironment class (taking inspiration from pytorch-lightning.

The setting of all these variables like global_rank, local_rank etc and the initialisation of the backend would then be done by a derived class for Slurm, MPI, Azur, etc.

Would you be interested in working on this kind of refactor? We can always split it into two PRs: first we merge this one with the ifs to get something that works for you, and then we refactor into delegated classes. I believe @cathalobrien will also have some suggestions on this.

@cathalobrien
Copy link
Contributor

Nice work! I would be happy to work with you on this.

I think it would be good if we made a parallel runner base class with the following abstract methods:
_bootstrap_processes
_init_parallel

and then create local, SLURM, AzureML subclasses which implement these methods

@gmertes
Copy link
Member

gmertes commented Sep 24, 2025

Would delegation be easier to manage instead of inheritance? Then the cluster environment can simply be part of the constructor through a lookup table, something like:

ENVIRONMENTS = {
  'mpi': MpiEnv,
  'slurm': SlurmEnv
}

class ParallelRunner:
  def __init__(self, env = 'slurm'):
    self.env = ENVIRONMENTS[env](self)   # pass self so env has access to runner attributes if needed

  self.env.bootstrap_processes()
  self.env.init_parallel()

@sali1293
Copy link
Author

Thank for this, this looks okay at first glance. A while ago the idea did come up to separate this code out of the ParallelRunner and delegate it to something like a ClusterEnvironment class (taking inspiration from pytorch-lightning.

The setting of all these variables like global_rank, local_rank etc and the initialisation of the backend would then be done by a derived class for Slurm, MPI, Azur, etc.

Would you be interested in working on this kind of refactor? We can always split it into two PRs: first we merge this one with the ifs to get something that works for you, and then we refactor into delegated classes. I believe @cathalobrien will also have some suggestions on this.

@gmertes happy with two PRs approach, merging this one first and then a second one with further changes / enhancements. Are you happy for me to publish the PR (it's in draft state currently)

@gmertes
Copy link
Member

gmertes commented Sep 25, 2025

Yes that sounds good to me!

@sali1293 sali1293 marked this pull request as ready for review September 25, 2025 11:11
@sali1293
Copy link
Author

@cathalobrien can you please have a look as @gmertes requested. Thanks

@cathalobrien
Copy link
Contributor

i'm on leave, i'll have a look on monday. cheers

@sali1293
Copy link
Author

sali1293 commented Oct 1, 2025

Hi @cathalobrien, wondering if you would have time this week to have a look and possibly merge this. Thanks

@cathalobrien
Copy link
Contributor

thanks for reminding me, I will have a look today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To be triaged
Development

Successfully merging this pull request may close these issues.

4 participants