-
Notifications
You must be signed in to change notification settings - Fork 20
feat: Add Azure ML compatibility to ParallelRunner #329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add Azure ML compatibility to ParallelRunner via distributed env var support (RANK/WORLD_SIZE) and env:// init method
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for this, this looks okay at first glance. A while ago the idea did come up to separate this code out of the ParallelRunner and delegate it to something like a ClusterEnvironment
class (taking inspiration from pytorch-lightning.
The setting of all these variables like global_rank, local_rank etc and the initialisation of the backend would then be done by a derived class for Slurm, MPI, Azur, etc.
Would you be interested in working on this kind of refactor? We can always split it into two PRs: first we merge this one with the ifs to get something that works for you, and then we refactor into delegated classes. I believe @cathalobrien will also have some suggestions on this.
Nice work! I would be happy to work with you on this. I think it would be good if we made a parallel runner base class with the following abstract methods: and then create local, SLURM, AzureML subclasses which implement these methods |
Would delegation be easier to manage instead of inheritance? Then the cluster environment can simply be part of the constructor through a lookup table, something like: ENVIRONMENTS = {
'mpi': MpiEnv,
'slurm': SlurmEnv
}
class ParallelRunner:
def __init__(self, env = 'slurm'):
self.env = ENVIRONMENTS[env](self) # pass self so env has access to runner attributes if needed
self.env.bootstrap_processes()
self.env.init_parallel() |
@gmertes happy with two PRs approach, merging this one first and then a second one with further changes / enhancements. Are you happy for me to publish the PR (it's in draft state currently) |
Yes that sounds good to me! |
@cathalobrien can you please have a look as @gmertes requested. Thanks |
i'm on leave, i'll have a look on monday. cheers |
Hi @cathalobrien, wondering if you would have time this week to have a look and possibly merge this. Thanks |
thanks for reminding me, I will have a look today |
Add Azure ML compatibility to ParallelRunner via distributed env var support (RANK/WORLD_SIZE) and env:// init method
Description
This PR introduces compatibility for Azure Machine Learning Studio in the ParallelRunner by adding support for distributed environment variables (e.g., RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT) and using the 'env://' initialization method for torch.distributed when applicable.
What problem does this change solve?
It enables seamless parallel inference in cloud-based distributed environments like Azure ML, where Slurm (srun) is not used, by detecting and utilizing standard PyTorch distributed env vars instead of relying solely on Slurm or manual process spawning.
What issue or task does this change relate to?
N/A (This is an enhancement based on modifications for Azure ML compatibility; no specific GitHub issue linked.)
Additional notes
Changes are minimal and isolated to the _bootstrap_processes and _init_parallel methods in ParallelRunnerMixin to avoid disrupting existing Slurm or manual spawning workflows.
Added helper methods _using_distributed_env and _is_mpi_env for cleaner logic.
No breaking changes; falls back gracefully to existing behaviors.
Tested in a multi-GPU Azure ML environment; no updates to dependencies required.
MPI detection is optional and only used for non-CUDA backends when available.