Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 24 additions & 3 deletions docs/neural-network/optimizers/adam.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,29 @@
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Adam.php">[source]</a></span>
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Adam.Adam.php">[source]</a></span>

# Adam
Short for *Adaptive Moment Estimation*, the Adam Optimizer combines both Momentum and RMS properties. In addition to storing an exponentially decaying average of past squared gradients like [RMSprop](rms-prop.md), Adam also keeps an exponentially decaying average of past gradients, similar to [Momentum](momentum.md). Whereas Momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction.

## Mathematical formulation
Per step (element-wise), Adam maintains exponentially decaying moving averages of the gradient and its element-wise square and uses them to scale the update:

$$
\begin{aligned}
\mathbf{v}_t &= (1 - \beta_1)\,\mathbf{v}_{t-1} + \beta_1\,\mathbf{g}_t \\
\mathbf{n}_t &= (1 - \beta_2)\,\mathbf{n}_{t-1} + \beta_2\,\mathbf{g}_t^{2} \\
\Delta{\theta}_t &= \alpha\, \frac{\mathbf{v}_t}{\sqrt{\mathbf{n}_t} + \varepsilon}
\end{aligned}
$$

where:
- $t$ is the current step,
- $\alpha$ is the learning rate (`rate`),
- $\beta_1$ is the momentum decay (`momentumDecay`),
- $\beta_2$ is the norm decay (`normDecay`),
- $\mathbf{g}_t$ is the current gradient, and $\mathbf{g}_t^{2}$ denotes element-wise square,
- $\varepsilon$ is a small constant for numerical stability (in the implementation, the denominator is clipped from below by `EPSILON`).

Note: This formulation follows the implementation in Rubix ML and does not include bias-correction terms.

## Parameters
| # | Name | Default | Type | Description |
|---|---|---|---|---|
Expand All @@ -12,10 +33,10 @@ Short for *Adaptive Moment Estimation*, the Adam Optimizer combines both Momentu

## Example
```php
use Rubix\ML\NeuralNet\Optimizers\Adam;
use Rubix\ML\NeuralNet\Optimizers\Adam\Adam;

$optimizer = new Adam(0.0001, 0.1, 0.001);
```

## References
[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
26 changes: 23 additions & 3 deletions docs/neural-network/optimizers/cyclical.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,28 @@
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Cyclical.php">[source]</a></span>
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Cyclical/Cyclical.php">[source]</a></span>

# Cyclical
The Cyclical optimizer uses a global learning rate that cycles between the lower and upper bound over a designated period while also decaying the upper bound by a factor at each step. Cyclical learning rates have been shown to help escape bad local minima and saddle points of the gradient.

## Mathematical formulation
Per step (element-wise), the cyclical learning rate and update are computed as:

$$
\begin{aligned}
\text{cycle} &= \left\lfloor 1 + \frac{t}{2\,\text{steps}} \right\rfloor \\
x &= \left| \frac{t}{\text{steps}} - 2\,\text{cycle} + 1 \right| \\
\text{scale} &= \text{decay}^{\,t} \\
\eta_t &= \text{lower} + (\text{upper} - \text{lower})\,\max\bigl(0\,1 - x\bigr)\,\text{scale} \\
\Delta\theta_t &= \eta_t\,g_t
\end{aligned}
$$

where:
- $t$ is the current step counter,
- $steps$ is the number of steps in every half cycle,
- $lower$ and $upper$ are the learning rate bounds,
- $decay$ is the multiplicative decay applied each step,
- $g_t$ is the current gradient.

## Parameters
| # | Name | Default | Type | Description |
|---|---|---|---|---|
Expand All @@ -13,10 +33,10 @@ The Cyclical optimizer uses a global learning rate that cycles between the lower

## Example
```php
use Rubix\ML\NeuralNet\Optimizers\Cyclical;
use Rubix\ML\NeuralNet\Optimizers\Cyclical\Cyclical;

$optimizer = new Cyclical(0.001, 0.005, 1000);
```

## References
[^1]: L. N. Smith. (2017). Cyclical Learning Rates for Training Neural Networks.
[^1]: L. N. Smith. (2017). Cyclical Learning Rates for Training Neural Networks.
29 changes: 27 additions & 2 deletions docs/neural-network/optimizers/momentum.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,33 @@
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Momentum.php">[source]</a></span>
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Momentum/Momentum.php">[source]</a></span>

# Momentum
Momentum accelerates each update step by accumulating velocity from past updates and adding a factor of the previous velocity to the current step. Momentum can help speed up training and escape bad local minima when compared with [Stochastic](stochastic.md) Gradient Descent.

## Mathematical formulation
Per step (element-wise), Momentum updates the velocity and applies it as the parameter step:

$$
\begin{aligned}
\beta &= 1 - \text{decay}, \quad \eta = \text{rate} \\
\text{Velocity update:}\quad v_t &= \beta\,v_{t-1} + \eta\,g_t \\
\text{Returned step:}\quad \Delta\theta_t &= v_t
\end{aligned}
$$

Nesterov lookahead (when `lookahead = true`) is approximated by applying the velocity update a second time:

$$
\begin{aligned}
v_t &\leftarrow \beta\,v_t + \eta\,g_t
\end{aligned}
$$

where:
- $g_t$ is the current gradient,
- $v_t$ is the velocity (accumulated update),
- $\beta$ is the momentum coefficient ($1 − decay$),
- $\eta$ is the learning rate ($rate$).

## Parameters
| # | Name | Default | Type | Description |
|---|---|---|---|---|
Expand All @@ -12,7 +37,7 @@ Momentum accelerates each update step by accumulating velocity from past updates

## Example
```php
use Rubix\ML\NeuralNet\Optimizers\Momentum;
use Rubix\ML\NeuralNet\Optimizers\Momentum\Momentum;

$optimizer = new Momentum(0.01, 0.1, true);
```
Expand Down
26 changes: 22 additions & 4 deletions docs/neural-network/optimizers/rms-prop.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,25 @@
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/RMSProp.php">[source]</a></span>
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/RMSProp/RMSProp.php">[source]</a></span>

# RMS Prop
An adaptive gradient technique that divides the current gradient over a rolling window of the magnitudes of recent gradients. Unlike [AdaGrad](adagrad.md), RMS Prop does not suffer from an infinitely decaying step size.
An adaptive gradient technique that divides the current gradient over a rolling window of magnitudes of recent gradients. Unlike [AdaGrad](adagrad.md), RMS Prop does not suffer from an infinitely decaying step size.

## Mathematical formulation
Per step (element-wise), RMSProp maintains a running average of squared gradients and scales the step by the root-mean-square:

$$
\begin{aligned}
\rho &= 1 - \text{decay}, \quad \eta = \text{rate} \\
\text{Running average:}\quad v_t &= \rho\,v_{t-1} + (1 - \rho)\,g_t^{\,2} \\
\text{Returned step:}\quad \Delta\theta_t &= \frac{\eta\,g_t}{\max\bigl(\sqrt{v_t},\,\varepsilon\bigr)}
\end{aligned}
$$

where:
- $g_t$ - is the current gradient,
- $v_t$ - is the running average of squared gradients,
- $\rho$ - is the averaging coefficient ($1 − decay$),
- $\eta$ - is the learning rate ($rate$),
- $\varepsilon$ - is a small constant to avoid division by zero (implemented by clipping $\sqrt{v_t}$ to $[ε, +∞)$).

## Parameters
| # | Name | Default | Type | Description |
Expand All @@ -11,10 +29,10 @@ An adaptive gradient technique that divides the current gradient over a rolling

## Example
```php
use Rubix\ML\NeuralNet\Optimizers\RMSProp;
use Rubix\ML\NeuralNet\Optimizers\RMSProp\RMSProp;

$optimizer = new RMSProp(0.01, 0.1);
```

## References
[^1]: T. Tieleman et al. (2012). Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude.
[^1]: T. Tieleman et al. (2012). Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude.
24 changes: 21 additions & 3 deletions docs/neural-network/optimizers/step-decay.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,26 @@
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/StepDecay.php">[source]</a></span>
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/StepDecay/StepDecay.php">[source]</a></span>

# Step Decay
A learning rate decay optimizer that reduces the global learning rate by a factor whenever it reaches a new *floor*. The number of steps needed to reach a new floor is defined by the *steps* hyper-parameter.

## Mathematical formulation
Per step (element-wise), the Step Decay learning rate and update are:

$$
\begin{aligned}
\text{floor} &= \left\lfloor \frac{t}{k} \right\rfloor \\
\eta_t &= \frac{\eta_0}{1 + \text{floor}\cdot \lambda} \\
\Delta\theta_t &= \eta_t\,g_t
\end{aligned}
$$

where:
- $t$ is the current step number,
- $k$ is the number of steps per floor,
- $\eta_0$ is the initial learning rate ($rate$),
- $\lambda$ is the decay factor ($decay$),
- $g_t$ is the current gradient.

## Parameters
| # | Name | Default | Type | Description |
|---|---|---|---|---|
Expand All @@ -12,7 +30,7 @@ A learning rate decay optimizer that reduces the global learning rate by a facto

## Example
```php
use Rubix\ML\NeuralNet\Optimizers\StepDecay;
use Rubix\ML\NeuralNet\Optimizers\StepDecay\StepDecay;

$optimizer = new StepDecay(0.1, 50, 1e-3);
```
```
14 changes: 14 additions & 0 deletions docs/neural-network/optimizers/stochastic.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,20 @@
# Stochastic
A constant learning rate optimizer based on vanilla Stochastic Gradient Descent (SGD).

## Mathematical formulation
Per step (element-wise), the SGD update scales the gradient by a constant learning rate:

$$
\begin{aligned}
\eta &= \text{rate} \\
\Delta\theta_t &= \eta\,g_t
\end{aligned}
$$

where:
- $g_t$ is the current gradient,
- $\eta$ is the learning rate ($rate$).

## Parameters
| # | Name | Default | Type | Description |
|---|---|---|---|---|
Expand Down
90 changes: 90 additions & 0 deletions src/NeuralNet/Optimizers/AdaMax/AdaMax.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
<?php

namespace Rubix\ML\NeuralNet\Optimizers\AdaMax;

use NDArray;
use NumPower;
use Rubix\ML\NeuralNet\Optimizers\Adam\Adam;
use Rubix\ML\NeuralNet\Parameters\Parameter;

use const Rubix\ML\EPSILON;
use const PHP_FLOAT_MAX;

/**
* AdaMax
*
* A version of Adam that replaces the RMS property with the infinity norm of the gradients.
*
* References:
* [1] D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
*
* @category Machine Learning
* @package Rubix/ML
* @author Andrew DalPino
* @author Samuel Akopyan <[email protected]>
*/
class AdaMax extends Adam
{
/**
* @param float $rate
* @param float $momentumDecay
* @param float $normDecay
*/
public function __construct(float $rate = 0.001, float $momentumDecay = 0.1, float $normDecay = 0.001)
{
parent::__construct($rate, $momentumDecay, $normDecay);
}

/**
* Take a step of gradient descent for a given parameter.
*
* AdaMax update (element-wise):
* v_t = v_{t-1} + β1 · (g_t − v_{t-1})
* u_t = max(β2 · u_{t-1}, |g_t|)
* Δθ_t = η · v_t / max(u_t, ε)
*
* @internal
*
* @param Parameter $param
* @param NDArray $gradient
* @return NDArray
*/
public function step(Parameter $param, NDArray $gradient) : NDArray
{
[$velocity, $norm] = $this->cache[$param->id()];

$vHat = NumPower::multiply(
NumPower::subtract($gradient, $velocity),
$this->momentumDecay
);

$velocity = NumPower::add($velocity, $vHat);

// Infinity norm accumulator
$norm = NumPower::multiply($norm, 1.0 - $this->normDecay);
$absGrad = NumPower::abs($gradient);
$norm = NumPower::maximum($norm, $absGrad);

$this->cache[$param->id()] = [$velocity, $norm];

$norm = NumPower::clip($norm, EPSILON, PHP_FLOAT_MAX);

return NumPower::multiply(
NumPower::divide($velocity, $norm),
$this->rate
);
}

/**
* Return the string representation of the object.
*
* @internal
*
* @return string
*/
public function __toString() : string
{
return "AdaMax (rate: {$this->rate}, momentum decay: {$this->momentumDecay},"
. " norm decay: {$this->normDecay})";
}
}
Loading
Loading