RubixML · apphp · Nov 7, 2025 · Nov 7, 2025 · Nov 7, 2025 · Nov 7, 2025
diff --git a/docs/neural-network/optimizers/adam.md b/docs/neural-network/optimizers/adam.md
@@ -1,8 +1,29 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Adam.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Adam.Adam.php">[source]</a></span>
 
 # Adam
 Short for *Adaptive Moment Estimation*, the Adam Optimizer combines both Momentum and RMS properties. In addition to storing an exponentially decaying average of past squared gradients like [RMSprop](rms-prop.md), Adam also keeps an exponentially decaying average of past gradients, similar to [Momentum](momentum.md). Whereas Momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction.
 
+## Mathematical formulation
+Per step (element-wise), Adam maintains exponentially decaying moving averages of the gradient and its element-wise square and uses them to scale the update:
+
+$$
+\begin{aligned}
+\mathbf{v}_t &= (1 - \beta_1)\,\mathbf{v}_{t-1} + \beta_1\,\mathbf{g}_t \\
+\mathbf{n}_t &= (1 - \beta_2)\,\mathbf{n}_{t-1} + \beta_2\,\mathbf{g}_t^{2} \\
+\Delta{\theta}_t &= \alpha\, \frac{\mathbf{v}_t}{\sqrt{\mathbf{n}_t} + \varepsilon}
+\end{aligned}
+$$
+
+where:
+- $t$ is the current step,
+- $\alpha$ is the learning rate (`rate`),
+- $\beta_1$ is the momentum decay (`momentumDecay`),
+- $\beta_2$ is the norm decay (`normDecay`),
+- $\mathbf{g}_t$ is the current gradient, and $\mathbf{g}_t^{2}$ denotes element-wise square,
+- $\varepsilon$ is a small constant for numerical stability (in the implementation, the denominator is clipped from below by `EPSILON`).
+
+Note: This formulation follows the implementation in Rubix ML and does not include bias-correction terms.
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|
@@ -12,10 +33,10 @@ Short for *Adaptive Moment Estimation*, the Adam Optimizer combines both Momentu
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\Adam;
+use Rubix\ML\NeuralNet\Optimizers\Adam\Adam;
 
 $optimizer = new Adam(0.0001, 0.1, 0.001);
 ```
 
 ## References
-[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
+[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
diff --git a/docs/neural-network/optimizers/cyclical.md b/docs/neural-network/optimizers/cyclical.md
@@ -1,8 +1,28 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Cyclical.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Cyclical/Cyclical.php">[source]</a></span>
 
 # Cyclical
 The Cyclical optimizer uses a global learning rate that cycles between the lower and upper bound over a designated period while also decaying the upper bound by a factor at each step. Cyclical learning rates have been shown to help escape bad local minima and saddle points of the gradient.
 
+## Mathematical formulation
+Per step (element-wise), the cyclical learning rate and update are computed as:
+
+$$
+\begin{aligned}
+\text{cycle} &= \left\lfloor 1 + \frac{t}{2\,\text{steps}} \right\rfloor \\
+x &= \left| \frac{t}{\text{steps}} - 2\,\text{cycle} + 1 \right| \\
+\text{scale} &= \text{decay}^{\,t} \\
+\eta_t &= \text{lower} + (\text{upper} - \text{lower})\,\max\bigl(0\,1 - x\bigr)\,\text{scale} \\
+\Delta\theta_t &= \eta_t\,g_t
+\end{aligned}
+$$
+
+where:
+- $t$ is the current step counter,
+- $steps$ is the number of steps in every half cycle,
+- $lower$ and $upper$ are the learning rate bounds,
+- $decay$ is the multiplicative decay applied each step,
+- $g_t$ is the current gradient.
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|
@@ -13,10 +33,10 @@ The Cyclical optimizer uses a global learning rate that cycles between the lower
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\Cyclical;
+use Rubix\ML\NeuralNet\Optimizers\Cyclical\Cyclical;
 
 $optimizer = new Cyclical(0.001, 0.005, 1000);
 ```
 
 ## References
-[^1]: L. N. Smith. (2017). Cyclical Learning Rates for Training Neural Networks.
+[^1]: L. N. Smith. (2017). Cyclical Learning Rates for Training Neural Networks.
diff --git a/docs/neural-network/optimizers/momentum.md b/docs/neural-network/optimizers/momentum.md
@@ -1,8 +1,33 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Momentum.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Momentum/Momentum.php">[source]</a></span>
 
 # Momentum
 Momentum accelerates each update step by accumulating velocity from past updates and adding a factor of the previous velocity to the current step. Momentum can help speed up training and escape bad local minima when compared with [Stochastic](stochastic.md) Gradient Descent.
 
+## Mathematical formulation
+Per step (element-wise), Momentum updates the velocity and applies it as the parameter step:
+
+$$
+\begin{aligned}
+\beta &= 1 - \text{decay}, \quad \eta = \text{rate} \\
+\text{Velocity update:}\quad v_t &= \beta\,v_{t-1} + \eta\,g_t \\
+\text{Returned step:}\quad \Delta\theta_t &= v_t
+\end{aligned}
+$$
+
+Nesterov lookahead (when `lookahead = true`) is approximated by applying the velocity update a second time:
+
+$$
+\begin{aligned}
+v_t &\leftarrow \beta\,v_t + \eta\,g_t
+\end{aligned}
+$$
+
+where:
+- $g_t$ is the current gradient,
+- $v_t$ is the velocity (accumulated update),
+- $\beta$ is the momentum coefficient ($1 − decay$),
+- $\eta$ is the learning rate ($rate$).
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|
@@ -12,7 +37,7 @@ Momentum accelerates each update step by accumulating velocity from past updates
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\Momentum;
+use Rubix\ML\NeuralNet\Optimizers\Momentum\Momentum;
 
 $optimizer = new Momentum(0.01, 0.1, true);
 ```

diff --git a/docs/neural-network/optimizers/rms-prop.md b/docs/neural-network/optimizers/rms-prop.md
@@ -1,7 +1,25 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/RMSProp.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/RMSProp/RMSProp.php">[source]</a></span>
 
 # RMS Prop
-An adaptive gradient technique that divides the current gradient over a rolling window of the magnitudes of recent gradients. Unlike [AdaGrad](adagrad.md), RMS Prop does not suffer from an infinitely decaying step size.
+An adaptive gradient technique that divides the current gradient over a rolling window of magnitudes of recent gradients. Unlike [AdaGrad](adagrad.md), RMS Prop does not suffer from an infinitely decaying step size.
+
+## Mathematical formulation
+Per step (element-wise), RMSProp maintains a running average of squared gradients and scales the step by the root-mean-square:
+
+$$
+\begin{aligned}
+\rho &= 1 - \text{decay}, \quad \eta = \text{rate} \\
+\text{Running average:}\quad v_t &= \rho\,v_{t-1} + (1 - \rho)\,g_t^{\,2} \\
+\text{Returned step:}\quad \Delta\theta_t &= \frac{\eta\,g_t}{\max\bigl(\sqrt{v_t},\,\varepsilon\bigr)}
+\end{aligned}
+$$
+
+where:
+- $g_t$ - is the current gradient,
+- $v_t$ - is the running average of squared gradients,
+- $\rho$ - is the averaging coefficient ($1 − decay$),
+- $\eta$ - is the learning rate ($rate$),
+- $\varepsilon$ - is a small constant to avoid division by zero (implemented by clipping $\sqrt{v_t}$ to $[ε, +∞)$).
 
 ## Parameters
 | # | Name | Default | Type | Description |
@@ -11,10 +29,10 @@ An adaptive gradient technique that divides the current gradient over a rolling
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\RMSProp;
+use Rubix\ML\NeuralNet\Optimizers\RMSProp\RMSProp;
 
 $optimizer = new RMSProp(0.01, 0.1);
 ```
 
 ## References
-[^1]: T. Tieleman et al. (2012). Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude.
+[^1]: T. Tieleman et al. (2012). Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude.
diff --git a/docs/neural-network/optimizers/step-decay.md b/docs/neural-network/optimizers/step-decay.md
@@ -1,8 +1,26 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/StepDecay.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/StepDecay/StepDecay.php">[source]</a></span>
 
 # Step Decay
 A learning rate decay optimizer that reduces the global learning rate by a factor whenever it reaches a new *floor*. The number of steps needed to reach a new floor is defined by the *steps* hyper-parameter.
 
+## Mathematical formulation
+Per step (element-wise), the Step Decay learning rate and update are:
+
+$$
+\begin{aligned}
+\text{floor} &= \left\lfloor \frac{t}{k} \right\rfloor \\
+\eta_t &= \frac{\eta_0}{1 + \text{floor}\cdot \lambda} \\
+\Delta\theta_t &= \eta_t\,g_t
+\end{aligned}
+$$
+
+where:
+- $t$ is the current step number,
+- $k$ is the number of steps per floor,
+- $\eta_0$ is the initial learning rate ($rate$),
+- $\lambda$ is the decay factor ($decay$),
+- $g_t$ is the current gradient.
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|
@@ -12,7 +30,7 @@ A learning rate decay optimizer that reduces the global learning rate by a facto
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\StepDecay;
+use Rubix\ML\NeuralNet\Optimizers\StepDecay\StepDecay;
 
 $optimizer = new StepDecay(0.1, 50, 1e-3);
-```
+```
diff --git a/docs/neural-network/optimizers/stochastic.md b/docs/neural-network/optimizers/stochastic.md
@@ -3,6 +3,20 @@
 # Stochastic
 A constant learning rate optimizer based on vanilla Stochastic Gradient Descent (SGD).
 
+## Mathematical formulation
+Per step (element-wise), the SGD update scales the gradient by a constant learning rate:
+
+$$
+\begin{aligned}
+\eta &= \text{rate} \\
+\Delta\theta_t &= \eta\,g_t
+\end{aligned}
+$$
+
+where:
+- $g_t$ is the current gradient,
+- $\eta$ is the learning rate ($rate$).
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|

diff --git a/src/NeuralNet/Optimizers/AdaMax/AdaMax.php b/src/NeuralNet/Optimizers/AdaMax/AdaMax.php
@@ -0,0 +1,90 @@
+<?php
+
+namespace Rubix\ML\NeuralNet\Optimizers\AdaMax;
+
+use NDArray;
+use NumPower;
+use Rubix\ML\NeuralNet\Optimizers\Adam\Adam;
+use Rubix\ML\NeuralNet\Parameters\Parameter;
+
+use const Rubix\ML\EPSILON;
+use const PHP_FLOAT_MAX;
+
+/**
+ * AdaMax
+ *
+ * A version of Adam that replaces the RMS property with the infinity norm of the gradients.
+ *
+ * References:
+ * [1] D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
+ *
+ * @category    Machine Learning
+ * @package     Rubix/ML
+ * @author      Andrew DalPino
+ * @author      Samuel Akopyan <[email protected]>
+ */
+class AdaMax extends Adam
+{
+    /**
+     * @param float $rate
+     * @param float $momentumDecay
+     * @param float $normDecay
+     */
+    public function __construct(float $rate = 0.001, float $momentumDecay = 0.1, float $normDecay = 0.001)
+    {
+        parent::__construct($rate, $momentumDecay, $normDecay);
+    }
+
+    /**
+     * Take a step of gradient descent for a given parameter.
+     *
+     * AdaMax update (element-wise):
+     *   v_t = v_{t-1} + β1 · (g_t − v_{t-1})
+     *   u_t = max(β2 · u_{t-1}, |g_t|)
+     *   Δθ_t = η · v_t / max(u_t, ε)
+     *
+     * @internal
+     *
+     * @param Parameter $param
+     * @param NDArray $gradient
+     * @return NDArray
+     */
+    public function step(Parameter $param, NDArray $gradient) : NDArray
+    {
+        [$velocity, $norm] = $this->cache[$param->id()];
+
+        $vHat = NumPower::multiply(
+            NumPower::subtract($gradient, $velocity),
+            $this->momentumDecay
+        );
+
+        $velocity = NumPower::add($velocity, $vHat);
+
+        // Infinity norm accumulator
+        $norm = NumPower::multiply($norm, 1.0 - $this->normDecay);
+        $absGrad = NumPower::abs($gradient);
+        $norm = NumPower::maximum($norm, $absGrad);
+
+        $this->cache[$param->id()] = [$velocity, $norm];
+
+        $norm = NumPower::clip($norm, EPSILON, PHP_FLOAT_MAX);
+
+        return NumPower::multiply(
+            NumPower::divide($velocity, $norm),
+            $this->rate
+        );
+    }
+
+    /**
+     * Return the string representation of the object.
+     *
+     * @internal
+     *
+     * @return string
+     */
+    public function __toString() : string
+    {
+        return "AdaMax (rate: {$this->rate}, momentum decay: {$this->momentumDecay},"
+            . " norm decay: {$this->normDecay})";
+    }
+}