[RFC] Support full bf16 training #1646

ebsmothers · 2025-08-27T18:51:23Z

This PR adds support for full bfloat16 training. In SFT it is pretty common to store everything in bfloat16 to save memory, with select tensors (logits, RoPE buffers and activations) maintained in a higher precision to preserve numerical accuracy. Separately I think having this supported more generally would be useful for faster iteration -- e.g. it allows me to run Llama3 70B on a single node of H100s, which otherwise is not possible with the default config.

Assuming this is generally useful, would like feedback on:

Acceptable loss convergence: in the first 100 steps on Llama3 8B full bf16 training goes from 12.25 -> 8, as opposed to 12.25 -> 7 with fp32 training. Is this a concern? (As mentioned, for SFT this is less of an issue; happy to validate that statement if that's helpful.)
Interaction with mixed precision training -- where is the right place to validate that these are not both set at once?
Where to put the set_default_dtype API

fegin · 2025-08-27T19:14:44Z

torchtitan/models/llama3/model/model.py


        h = self.norm(h) if self.norm else h
-        output = self.output(h) if self.output else h
+        output = self.output(h).float() if self.output else h


If we set the training dtype during the training initialization, why not also do the output conversion in the trainer (train loop)?

This is already in the loss function https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/loss.py#L21

Also see #642

Thanks, just removed

fegin · 2025-08-27T19:16:50Z

torchtitan/config/job_config.py

+    In the case of full bf16 training, RoPE calculations and logits will still be in fp32.
+    """
+
    mixed_precision_param: Literal["bfloat16", "float32"] = "bfloat16"


What if mixed_precision_param is float32 but dtype is bfloat16? There should be a check?

Yeah agreed. Do we want to do this somewhere in train.py? Lmk if you think there's a better place

mixed_precision_param is coming from FSDP2. I think if FSDP2 can work with that, it's users responsibility to config them properly.

We also make it work with DDP/single device: #1303. I think a warning is at least required.

Sounds good. In that case I will leave this as is

@fegin
autocast is not well supported in torchtitan anyways. I'm not sure if it is still maintained. See other issue like #1525

But sure, having a warning sounds good.

fegin · 2025-09-16T16:57:08Z

One last thing in my mind is that the set_default_dtype() definition should be moved to torchtitan/model.

tianyu-l · 2025-09-16T17:05:26Z

@fegin

One last thing in my mind is that the set_default_dtype() definition should be moved to torchtitan/model

IIUC it is a general context manager, very similar to with torch.device(). Curious why you think it should be moved to model folder?

tianyu-l

I'm a bit surprised that pytorch doesn't provide such context manager natively.
https://discuss.pytorch.org/t/context-manager-for-dtype-and-device/73827/3

LGTM.

fegin · 2025-09-16T17:14:35Z

IIUC it is a general context manager, very similar to with torch.device(). Curious why you think it should be moved to model folder?

Unlike torch.device() which allocates the same model on different devices, this context changes the entire model, the model meaning is technically different. That's my motivation. But I don't have a strong opinion on this. It is also reasonable to put under tool folder.

tianyu-l · 2025-09-16T17:17:39Z

Oh I see what you mean. I think the function itself can be used for more than model definition, so I'd still prefer it being in a util folder. Maybe let's merge it as is if you don't have strong opinion.

hann-wang · 2025-09-18T02:20:48Z

I don't think keeping optimizer states in BF16 is a good idea.

Generally speaking, keeping optimizer states in BF16 will degrade the final performance. Megatron-LM supports only FP16 and FP32 for optimizer states. (FP16 requires a separate scaling factor)

Here's a comparison between FP32/BF16 optimizer states on Megatron-LM:

fegin · 2025-09-18T06:00:21Z

@ebsmothers any thoughts on this?

samsja · 2025-09-19T23:26:06Z

I don't think keeping optimizer states in BF16 is a good idea.

Generally speaking, keeping optimizer states in BF16 will degrade the final performance. Megatron-LM supports only FP16 and FP32 for optimizer states. (FP16 requires a separate scaling factor)

Here's a comparison between FP32/BF16 optimizer states on Megatron-LM:

I also don't think that doing pure bf16 training makes sense even for sft. If the goal is to reduce memory footprint of the optimizer I think that adam8bit is a better tradeoff for low gpu count and with many gpu fsdp should make the optimizer state quite small on each gpu

tianyu-l · 2025-09-19T23:39:53Z

@joecummings sounds like we should revert this PR, as doing bf16 everywhere does not seem to be the right way to save memory. Wdyt?

ebsmothers · 2025-09-23T23:42:26Z

Sorry just getting caught up here. My two cents: pure bf16 should not preclude using optimizers like 8-bit Adam. In my mind it is still generally useful (and fairly standard: see e.g. Lightning’s true bf16 precision setting) to store model weights and gradients in bfloat16 without an extra higher-precision copy. Is 8-bit Adam currently supported by Titan? My impression was that it isn’t, but lmk if that’s mistaken.

@joecummings

Unfortunately I went out on leave after opening #1646 so never actually finished it out to enable bf16 training in the forge experiment, which is what we ultimately wanted (thanks to @joecummings and @tianyu-l for pushing it through). I also see there was some discussion on the original PR which I belatedly responded to. If there are still concerns there let me know. Otherwise if we are not gonna revert that PR we should at least that one so that forge can reap the benefits as intended.

[RFC] Support full bf16 training

dba286d

ebsmothers requested review from fegin, tianyu-l, wconstab and wwwjn as code owners August 27, 2025 18:51

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 27, 2025

fegin reviewed Aug 27, 2025

View reviewed changes

remove .float()

e683589

fegin approved these changes Sep 16, 2025

View reviewed changes

tianyu-l approved these changes Sep 16, 2025

View reviewed changes

tianyu-l merged commit e99e16c into pytorch:main Sep 16, 2025
7 checks passed

joecummings mentioned this pull request Sep 16, 2025

Update titan weights to load in bfloat16 meta-pytorch/torchforge#166

Closed

joecummings mentioned this pull request Sep 19, 2025

Trainer + reference model in bf16 meta-pytorch/torchforge#189

Merged

ebsmothers mentioned this pull request Sep 30, 2025

port true bf16 training into forge experiment #1775

Merged

[RFC] Support full bf16 training #1646

[RFC] Support full bf16 training #1646

Uh oh!

Conversation

ebsmothers commented Aug 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented Sep 16, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

fegin commented Sep 16, 2025

Uh oh!

tianyu-l commented Sep 16, 2025

Uh oh!

Uh oh!

hann-wang commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin commented Sep 18, 2025

Uh oh!

samsja commented Sep 19, 2025

Uh oh!

tianyu-l commented Sep 19, 2025

Uh oh!

ebsmothers commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fegin commented Sep 16, 2025 •

edited

Loading

hann-wang commented Sep 18, 2025 •

edited

Loading