You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/parallelisms.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ Distributed Data Parallelism (DDP) keeps the model copies consistent by synchron
15
15
16
16
### Distributed Optimizer
17
17
18
-
[Distributed optimizer](https://docs.nvidia.com/megatron-core/developer-guide/latest/distrib_optimizer.html) is a memory-optimized data-parallel deployment method. It shards the optimizer states and the high-precision master parameters across data-parallel GPUs instead of replicating them. At the parameter optimizer step, each data-parallel GPU updates its shard of parameters. Since each GPU needs its own gradient shard, the distributed optimizer conducts reduce-scatter of the parameter gradients instead of all-reduce of them. Then, the updated parameter shards are all-gathered across data-parallel GPUs. This approach significantly reduces the memory need of large-scale LLM training.
18
+
[Distributed optimizer](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_optimizer.html) is a memory-optimized data-parallel deployment method. It shards the optimizer states and the high-precision master parameters across data-parallel GPUs instead of replicating them. At the parameter optimizer step, each data-parallel GPU updates its shard of parameters. Since each GPU needs its own gradient shard, the distributed optimizer conducts reduce-scatter of the parameter gradients instead of all-reduce of them. Then, the updated parameter shards are all-gathered across data-parallel GPUs. This approach significantly reduces the memory need of large-scale LLM training.
19
19
20
20
### Enable Data Parallelism
21
21
@@ -82,7 +82,7 @@ config = ConfigContainer(
82
82
83
83
#### Implement Tensor Parallelism
84
84
85
-
Megatron Bridge integrates TP through the implementation from Megatron Core. For detailed API usage and additional configurations, consult the [Megatron Core Developer Guide](https://docs.nvidia.com/Megatron-Core/developer-guide/latest/api-guide/tensor_parallel.html).
85
+
Megatron Bridge integrates TP through the implementation from Megatron Core. For detailed API usage and additional configurations, consult the [Megatron Core Developer Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/tensor_parallel.html).
86
86
87
87
### Pipeline Parallelism
88
88
@@ -127,7 +127,7 @@ For more insights into this approach, see the detailed blog: [Scaling Language M
127
127
128
128
#### Implement Pipeline Parallelism
129
129
130
-
The Megatron Bridge implementation of PP leverages functionalities from Megatron Core. For more detailed API usage and configurations related to PP, visit the [Megatron Core Developer Guide](https://docs.nvidia.com/Megatron-Core/developer-guide/latest/api-guide/tensor_parallel.html).
130
+
The Megatron Bridge implementation of PP leverages functionalities from Megatron Core. For more detailed API usage and configurations related to PP, visit the [Megatron Core Developer Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/tensor_parallel.html).
131
131
132
132
### Expert Parallelism and Mixture of Experts (MoE)
133
133
@@ -375,7 +375,7 @@ For example, with 32 GPUs total and the configuration above:
0 commit comments