Skip to content

Conversation

@showmyth
Copy link

@showmyth showmyth commented Oct 8, 2025

Description

The AdamW optimizer is a variant of the Adam optimizer that improves deep learning model generalization by decoupling weight decay from the gradient update. While Adam couples L2 regularization with its gradient updates, which is problematic for adaptive methods like Adam, AdamW applies weight decay directly to the parameters after the gradient-based update.

Necessity

Leads to more stable, regularized and effective training, especially for large models like transformers. It is also used a lot more as compared to the traditional adam optimizer.

Features

  • AdamW optimizer class with configurable hyperparameters (learning_rate, b1, b2, weight_decay)
  • Decoupled weight decay application
  • Bias correction with time step tracking

References

Designed to be compatible with mlfromscratch optimization API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant