Validation with multiple gpus initializes multiple wandb logs #1911

iN1k1 · 2023-06-20T08:08:32Z

iN1k1
Jun 20, 2023

Hi all,
I am running a training job with the dist_train.sh script and logging the training details on wandb with the help of the LoggerHook and WandbGenVisBackend. Everything is fine during training and all the pars and losses are correctly logged in wandb under a single run.
However, when the MultiValLoop is executed, there will be multiple wandb initialization (one for each process that has rank>0). Each of these processes logs nothing. Validation results are saved only under the wandb run for rank=0. So, the problem is that there are multiple wandb inits that are useless and just adding noise on the wandb UI. Any way of avoiding multiple wand initialization during the validation loop?

Thanks

Z-Fran · 2023-06-21T08:25:55Z

Z-Fran
Jun 21, 2023
Collaborator

Hi @iN1k1 , this is an issue of Wandb. You can see the docs to learn how to use wandb on multiple gpus. You can use the group parameter when you initialize wandb to define a shared experiment and group the logged values together in the W&B App UI. like:

vis_backends = [dict(type='WandbVisBackend',init_kwargs=dict(group='xxx'))]

5 replies

iN1k1 Jun 21, 2023
Author

Hi @Z-Fran! Thanks for your prompt reply!
I have also spotted that from wandb, but hoped that there would be another way of merging things without the need for grouping since during the distributed training only 1 wandb instance is initialized..

Z-Fran Jun 21, 2023
Collaborator

Sorry, it's Wandb's limit. Wandb must be initialized on each process (each GPU). It's difficult to use only one wandb instance when distributed training in MMagic.

iN1k1 Jun 21, 2023
Author

Well. Maybe I wasn't clear enough. Sorry for that.
The thing is that when running a distributed training job (i.e., with dist_train.sh), everything is working correctly and only a single wandb instance is initialized and used to log all the parameters and losses (despite the model being trained in a distributed fashion with 8 gpus).
The issue happens only at validation time.. I was wondering that something is possibly handled differently with the two scripts either by magic or by the mmengine..

Z-Fran Jun 25, 2023
Collaborator

Sorry, this is a design defect of MMEngine. Actually, all logged parameters and losses are only from gpu rank=0, rather than all gpus.

iN1k1 Jun 27, 2023
Author

Oh, I see thanks. I will then use the grouping feature of wandb :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validation with multiple gpus initializes multiple wandb logs #1911

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Validation with multiple gpus initializes multiple wandb logs #1911

Uh oh!

iN1k1 Jun 20, 2023

Replies: 1 comment · 5 replies

Uh oh!

Z-Fran Jun 21, 2023 Collaborator

Uh oh!

iN1k1 Jun 21, 2023 Author

Uh oh!

Z-Fran Jun 21, 2023 Collaborator

Uh oh!

iN1k1 Jun 21, 2023 Author

Uh oh!

Z-Fran Jun 25, 2023 Collaborator

Uh oh!

iN1k1 Jun 27, 2023 Author

iN1k1
Jun 20, 2023

Replies: 1 comment 5 replies

Z-Fran
Jun 21, 2023
Collaborator

iN1k1 Jun 21, 2023
Author

Z-Fran Jun 21, 2023
Collaborator

iN1k1 Jun 21, 2023
Author

Z-Fran Jun 25, 2023
Collaborator

iN1k1 Jun 27, 2023
Author