Skip to content

Commit d42316e

Browse files
gautham-kolluchtruong814ananthsub
authored
Add Flops calculator and training flop achieved log (#388)
* Copy from Mcore Signed-off-by: gkollu <[email protected]> * fix run scripts * local Signed-off-by: gkollu <[email protected]> * local tests fail * pr * more models Signed-off-by: gkollu <[email protected]> * cg Signed-off-by: gkollu <[email protected]> * rebase main Signed-off-by: gkollu <[email protected]> * fix old util to new Signed-off-by: gkollu <[email protected]> * call .finalize() after model config Signed-off-by: gkollu <[email protected]> * qwen Signed-off-by: gkollu <[email protected]> * update Signed-off-by: gkollu <[email protected]> * Fix lint errors Signed-off-by: Charlie Truong <[email protected]> * Add docstring for num_floating_point_operations Signed-off-by: Charlie Truong <[email protected]> * Fix format error Signed-off-by: Charlie Truong <[email protected]> * fix formatting Signed-off-by: gkollu <[email protected]> * fix import Signed-off-by: gkollu <[email protected]> * add cache Signed-off-by: gkollu <[email protected]> * remove cache Signed-off-by: gkollu <[email protected]> * Update src/megatron/bridge/training/utils/flop_utils.py Signed-off-by: Ananth Subramaniam <[email protected]> * fix broken links Signed-off-by: gkollu <[email protected]> * fix docs Signed-off-by: gkollu <[email protected]> --------- Signed-off-by: gkollu <[email protected]> Signed-off-by: gautham-kollu <[email protected]> Signed-off-by: Charlie Truong <[email protected]> Signed-off-by: Ananth Subramaniam <[email protected]> Co-authored-by: Charlie Truong <[email protected]> Co-authored-by: Ananth Subramaniam <[email protected]>
1 parent cac49dc commit d42316e

File tree

3 files changed

+336
-92
lines changed

3 files changed

+336
-92
lines changed

docs/parallelisms.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Distributed Data Parallelism (DDP) keeps the model copies consistent by synchron
1515

1616
### Distributed Optimizer
1717

18-
[Distributed optimizer](https://docs.nvidia.com/megatron-core/developer-guide/latest/distrib_optimizer.html) is a memory-optimized data-parallel deployment method. It shards the optimizer states and the high-precision master parameters across data-parallel GPUs instead of replicating them. At the parameter optimizer step, each data-parallel GPU updates its shard of parameters. Since each GPU needs its own gradient shard, the distributed optimizer conducts reduce-scatter of the parameter gradients instead of all-reduce of them. Then, the updated parameter shards are all-gathered across data-parallel GPUs. This approach significantly reduces the memory need of large-scale LLM training.
18+
[Distributed optimizer](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_optimizer.html) is a memory-optimized data-parallel deployment method. It shards the optimizer states and the high-precision master parameters across data-parallel GPUs instead of replicating them. At the parameter optimizer step, each data-parallel GPU updates its shard of parameters. Since each GPU needs its own gradient shard, the distributed optimizer conducts reduce-scatter of the parameter gradients instead of all-reduce of them. Then, the updated parameter shards are all-gathered across data-parallel GPUs. This approach significantly reduces the memory need of large-scale LLM training.
1919

2020
### Enable Data Parallelism
2121

@@ -82,7 +82,7 @@ config = ConfigContainer(
8282

8383
#### Implement Tensor Parallelism
8484

85-
Megatron Bridge integrates TP through the implementation from Megatron Core. For detailed API usage and additional configurations, consult the [Megatron Core Developer Guide](https://docs.nvidia.com/Megatron-Core/developer-guide/latest/api-guide/tensor_parallel.html).
85+
Megatron Bridge integrates TP through the implementation from Megatron Core. For detailed API usage and additional configurations, consult the [Megatron Core Developer Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/tensor_parallel.html).
8686

8787
### Pipeline Parallelism
8888

@@ -127,7 +127,7 @@ For more insights into this approach, see the detailed blog: [Scaling Language M
127127

128128
#### Implement Pipeline Parallelism
129129

130-
The Megatron Bridge implementation of PP leverages functionalities from Megatron Core. For more detailed API usage and configurations related to PP, visit the [Megatron Core Developer Guide](https://docs.nvidia.com/Megatron-Core/developer-guide/latest/api-guide/tensor_parallel.html).
130+
The Megatron Bridge implementation of PP leverages functionalities from Megatron Core. For more detailed API usage and configurations related to PP, visit the [Megatron Core Developer Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/tensor_parallel.html).
131131

132132
### Expert Parallelism and Mixture of Experts (MoE)
133133

@@ -375,7 +375,7 @@ For example, with 32 GPUs total and the configuration above:
375375

376376
## Resources
377377

378-
- [Megatron Core Developer Guide](https://docs.nvidia.com/Megatron-Core/developer-guide/latest/)
378+
- [Megatron Core Developer Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/)
379379
- [Scaling Language Model Training](https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/)
380380
- [Megatron-LM Repository](https://github.com/NVIDIA/Megatron-LM)
381381
- [Transformer Engine](https://github.com/NVIDIA/TransformerEngine)

0 commit comments

Comments
 (0)