Logging Metrics Manually in DDP mode with 2 GPUs #3088
-
|
The docs for torchmetrics when using pytorch lightning describe how to manually call compute() and reset() at the end of the train/val epochs if logging a metric manually instead of letting self.log(...) do the work. This works for me when I use a single GPU, but when using 2 GPUs and DDP strategy, the metric.compute() call hangs for a long time and then timeout error occurs. Is there a different set of instructions for manually logging in DDP mode when using > 1 GPUs? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
When using torchmetrics with PyTorch Lightning in DDP mode (multi-GPU), metrics need to be synchronized across all processes before calling A recommended approach: Use the from torchmetrics import Accuracy
metric = Accuracy(sync_on_compute=True)Then you can safely call: metric.update(preds, target)
result = metric.compute()
metric.reset()Alternatively, you can manually call Also, make sure you only call This should fix the hanging issue in DDP mode. Let me know if you want a minimal example to demonstrate this! |
Beta Was this translation helpful? Give feedback.
When using torchmetrics with PyTorch Lightning in DDP mode (multi-GPU), metrics need to be synchronized across all processes before calling
compute(). This is because each GPU calculates partial metric states locally, andcompute()waits for all to sync, which can cause hangs if not done properly.A recommended approach:
Use the
sync_on_compute=Trueargument when initializing your metric. This ensures metrics automatically sync across processes before computing:Then you can safely call:
Alternatively, you can manually call
metric.sync()beforeco…