Web Simulation

 

 

 

1-Variable Gradient Descent

This interactive tutorial helps you build intuition for how step size (learning rate η) affects gradient descent: convergence, oscillation, or divergence. The simulation uses a single variable x and updates it with xx − η·f'(x) at each step.

You can choose from six functions, each illustrating a different challenge: a smooth bowl (ideal convex), local minima (the trap), high-frequency ripples, a flat plateau (vanishing gradient), steep walls (exploding gradient), and a non-differentiable V-shape (endless bouncing). Adjust the learning rate (logarithmic scale 0.001–1.5), optionally enable Momentum or an Auto RL method (AdaGrad, RMSprop, Adam), then use Step Fwd, Step Bwd, or Run to run the descent. The main canvas shows the function curve, the tangent at the current x, the trajectory, and an update arrow (green if loss decreased, red if overshot). Click on the plot to set a new starting x and reset the path. The Theory and Parameters sections below spell out the update rules and all controls.

 

Theory

Plain gradient descent

We minimize f(x) by repeatedly moving opposite to the gradient. The learning rate η controls step size — too small → slow convergence; too large → oscillation or divergence.

xx − η·f'(x)
Momentum (Polyak)

A velocity term smooths updates and can escape shallow local minima:

v ← μv − η·f'(x),   xx + v

The coefficient μ ∈ [0, 1] damps past velocity. μ = 0 reduces to plain GD; μ close to 1 carries more history and can help on "ripply" landscapes.

Adaptive learning rate (Auto RL)

These methods scale the effective step size using past gradient information, so different "directions" (here, just sign and magnitude of f') can have different step sizes. When an adaptive method is selected, its parameters appear as sliders; Momentum is disabled. All controls apply in real time (even during Run).

Method

Update rule

Notes

AdaGrad

GG + (f')²,  xx − ηf' / (√G + ε)

Large past gradients shrink the step. ε avoids division by zero (typ. 1e−8).

RMSprop

v ← βv + (1−β)(f')²,  xx − ηf' / (√v + ε)

β ∈ (0,1) (often 0.9) sets the memory. Reduces AdaGrad's aggressive shrinking over long runs.

Adam

m ← β₁m + (1−β₁)f'v ← β₂v + (1−β₂)(f')²,  xx − η/(√ + ε)

Momentum-like first moment + RMSprop-like second moment, with bias correction. Good default.

Convergence & divergence: the run stops when |f'(x)| < 0.001 (converged). If x or f(x) becomes non-finite or |x| > 50, the simulation reports "Diverged" and halts. Reset clears the trajectory and any adaptive/momentum state.

 

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Controls

0.1000
-3.00
-4.0
4.0
0.90
x = —
Loss = —
Gradient f'(x) = —
Slope suggests moving … (GD goes opposite to gradient).

 

Parameters

  • Function: Selects the loss landscape. Each option includes a short description below the dropdown. The six functions are:
    • The Bowl (Ideal Convex): f(x) = x². Smooth convergence; steps shrink as the slope decreases.
    • The Trap (Local Minima): f(x) = x⁴ − 2x² + 0.5x. Try starting at x = −2 vs x = 2; one leads to a shallow local minimum.
    • The Ripples (Noise): f(x) = x² + sin(5x). Hard to converge; the agent gets stuck in local bumps.
    • The Plateau (Vanishing Gradient): Flattening edges; slopes near zero. Learning stalls unless η is large.
    • The Cliff (Exploding Gradient): Very steep walls. A standard η causes massive overshooting.
    • The Sharp V (Non-Differentiable): f(x) = |x|. Slope is ±1; the algorithm never settles and bounces around the minimum.
  • Learning rate (η): Step size for each update. Range 0.001–1.5 with a logarithmic slider for finer control. Large η can overshoot or diverge; small η converges slowly.
  • Initial x: Starting position on the x-axis. Use the slider or click on the main canvas to set it. Reset restores this value and clears the trajectory.
  • x min / x max: Horizontal range of the main plot (both −8 to 8). Initial x and click-to-set are clamped to this range.
  • Auto Scale (plot): When ON, the main plot’s y-axis includes the trajectory and current point; when OFF (default), y-range is from the function curve over [x min, x max] only.
  • Momentum: When ON, updates use Polyak momentum: v ← μv − η·f'(x), then xx + v. μ (momentum coefficient, 0–1) is set by the slider; the slider is enabled only when Momentum is ON.
  • Auto RL: Adaptive learning-rate method (default None). Options: AdaGrad, RMSprop, Adam. When one is selected, its parameters appear as sliders and Momentum is disabled. All apply in real time (including during Run). Slider ranges: ε (1e−10–1e−4, log scale); β (RMSprop, 0.8–0.999); β₁, β₂ (Adam, β₁ 0.8–0.999, β₂ 0.99–0.9999). Changing the algorithm clears its internal accumulators; changing only parameter values does not.

Buttons

  • Step Fwd: Performs one gradient-descent update (forward). Stops Run if active.
  • Step Bwd: Undoes the last step, reverting to the previous x and loss. Clears momentum and Auto RL accumulators so the next Step Fwd starts from that state. Disabled at the initial point.
  • Run / Stop: Toggle button. Run starts automatic updates until convergence or you click Stop.
  • Reset: Stops Run, restores the initial condition (position from Initial x slider, trajectory cleared), clears momentum and Auto RL accumulators, and redraws.

Visualization

  • Main canvas: Function curve (blue), trajectory (purple; Trajectory dot = points per step, Trajectory line = dashed line; both checkboxes can be on), tangent line at current x (orange, dashed), update arrow on the x-axis (green = loss decreased, red = overshot), and a ball at the current (x, f(x)).
  • Loss plot: Loss vs iteration. Y-axis auto-scaled.
  • Metrics: Sidebar panel showing current x, loss f(x), and gradient f'(x).
  • Insight box: Explains which way the slope suggests moving and highlights when the last step overshot (loss increased).
  • Click-to-set: Click on the main canvas to set a new starting x and reset the path.

Limitations

  • One dimension. The descent acts on a single variable x, so the adaptive methods reduce to scaling one scalar gradient. In real training each method maintains a separate accumulator per weight across many dimensions — behaviour that a 1D view cannot fully show.
  • Deterministic, exact gradients. f'(x) is computed analytically and noise-free; there is no mini-batch / stochastic gradient noise, which is precisely what RMSprop and Adam were designed to tame.
  • Toy loss curves. The six functions are illustrative shapes (bowl, ripples, cliff, V), not surrogates for any real loss surface, and several have closed-form minima.
  • Fixed base learning rate. η does not decay or warm up over a run; learning-rate schedules and warm-restart strategies are out of scope.
  • Non-differentiable handling is approximate. For the Sharp V (|x|) the slope is taken as ±1; true subgradient methods and proximal operators are not implemented, so the agent simply bounces.
  • Hard divergence cutoff. The run is declared "Diverged" at |x| > 50 or non-finite values; this is a display safeguard, not a rigorous stability criterion.