Web Simulator | ShareTechnote

Web Simulation

Gradient Descent Visualizer I

This note provides an interactive, visual simulation of Gradient Descent optimization algorithm with momentum. It helps you build intuition on how gradient descent navigates through different optimization landscapes to find minima of functions.

The simulation visualizes gradient descent on six optimization functions: Sphere, Rosenbrock, Rastrigin, Saddle Point, Bi-modal Gaussian, and Tri-modal Gaussian. Each function presents different challenges - from simple convex landscapes to complex non-convex surfaces with multiple local minima and saddle points. The Bi-modal and Tri-modal Gaussian functions are particularly useful for demonstrating how starting position affects which minimum the algorithm converges to.

The visualization shows both a 2D contour plot and a 3D surface plot side by side at the top, with controls below. When you run the descent, you can watch a red path move along the optimization trajectory, following the negative gradient direction. A yellow marker shows the initial starting point, which updates in real-time as you adjust the Start X and Start Y parameters. The trajectory is overlaid on both plots, allowing you to see how the algorithm navigates the landscape in real-time.

You can adjust the learning rate (step size), momentum (default 0.9) to help escape flat regions and local minima, set the starting position, control the variable range for visualization, and set the maximum number of iterations. This interactive exploration helps you understand how these parameters affect the convergence behavior of gradient descent. The plots automatically clear previous trajectories when you change any parameter, ensuring a clean visualization for each experiment.

Implementation: this visualization uses analytical gradients (exact derivatives) for accurate and efficient computation. The functions are implemented with their mathematical formulas and gradient expressions. The plots use a square aspect ratio (1:1) and display without tick marks, labels, or legends for a clean visualization focused on the optimization landscape.

Sections

Mathematical Foundation
Test Functions
Simulation
Parameters
Buttons
Visualization Features
Limitations

Mathematical Foundation

Gradient descent minimizes a function f(x) by repeatedly stepping in the direction of steepest descent — the negative gradient. The basic update with learning rate α is:

x_k+1 = x_k − α ∇f(x_k)

This simulator adds a momentum term, which accumulates a velocity v across steps to smooth the path and power through flat regions and shallow local minima:

v_k+1 = β v_k − α ∇f(x_k) → x_k+1 = x_k + v_k+1

where β is the momentum coefficient (default 0.9). With β = 0 this reduces to plain gradient descent.

Why momentum helps: on the Rosenbrock valley, plain gradient descent zig-zags across the narrow trough and crawls; momentum builds velocity along the valley floor and reaches the minimum far faster.

Test Functions

The simulator offers six landscapes ranging from a simple convex bowl to highly multimodal surfaces:

Function	Formula	Character / global minimum
Sphere	f = x² + y²	Simple convex bowl; min at (0, 0).
Rosenbrock	f = (1−x)² + 100(y−x²)²	Narrow curved valley; min at (1, 1).
Rastrigin	f = 20 + (x² − 10cos2πx) + (y² − 10cos2πy)	Highly multimodal; min at (0, 0).
Saddle Point	f = x² − y²	Saddle at (0, 0); can trap descent.
Bi-modal Gaussian	two Gaussian valleys	Two local minima; start point decides outcome.
Tri-modal Gaussian	three Gaussian valleys	Global at (−2, 0); minima at (2, 2) and (2, −2).

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Parameters

Followings are short descriptions on each parameters

Function: Selects the optimization function to visualize (see the Test Functions table for the six available landscapes and their formulas). They range from the simple convex Sphere to highly multimodal surfaces with several local minima.
Learning Rate (α): Controls the step size for each gradient descent update. Larger values (closer to 1.0) take bigger steps, which can lead to faster convergence but may overshoot or oscillate. Smaller values (0.001-0.1) are more stable but slower. Default is 0.1.
Momentum: Adds momentum to the gradient descent updates, helping the algorithm escape flat regions and local minima. The momentum value (0 to 0.99) determines how much of the previous velocity is retained. Higher values (0.9-0.99) maintain more velocity, which is especially helpful for functions like Rosenbrock and helps the algorithm converge faster. Default is 0.9. The algorithm uses velocity-based updates: v = momentum * v_old - learningRate * gradient, then position += v.
Max Iterations: Maximum number of gradient descent steps to perform. The algorithm will stop after this many iterations even if it hasn't converged. Default is 100.
Start X, Start Y: Initial position (x₀, y₀) for the gradient descent algorithm. The starting point significantly affects the convergence path and whether the algorithm finds the global minimum. You can adjust these sliders to explore different starting positions. A yellow marker on both plots shows the current starting position in real-time as you adjust these values. The range automatically adjusts based on the Variable Range parameter.
Variable Range (vr): Controls the visualization range for both x and y axes. The plots display the function over the range [-vr, vr] for both variables. This parameter also controls the range of the Start X and Start Y sliders. Adjusting this parameter updates the plots and clamps the starting position to the new range. Default is 3.0.

Buttons

Followings are short descriptions on each Button

Run Descent: Starts the gradient descent algorithm from the current starting position. The red marker will move along the optimization path, updating in real-time. The button changes to "Stop" while running, allowing you to pause the algorithm at any time. The trajectory is overlaid on both the contour and 3D surface plots.
Reset: Clears the current trajectory and resets the visualization. The plots return to showing only the function surface without the optimization path. The iteration counter and current position display are also reset.

Visualization Features

Layout: The two plots (contour and 3D surface) are displayed side by side at the top in a square format (1:1 aspect ratio). The control panel with all parameters is located below the plots for easy access.
Contour Plot: A 2D top-down view showing contour lines (level sets) of the function. The color gradient represents function values. The red path shows the gradient descent trajectory, and a yellow marker indicates the initial starting point.
3D Surface Plot: A three-dimensional visualization of the function surface. You can rotate, zoom, and pan the view to explore the landscape. The red path shows the optimization trajectory in 3D space, and a yellow marker shows the starting point on the surface.
Real-time Updates: During descent, the iteration count, current position (x, y), and current function value are displayed in real-time. The plots update periodically to show the progress. The initial starting point marker updates immediately as you adjust Start X and Start Y sliders.
Clean Visualization: The plots are designed for clarity - no tick marks, tick labels, color bars, legends, or titles. The focus is entirely on the function landscape and optimization trajectory.
Automatic Trajectory Clearing: When you change any parameter (function, learning rate, momentum, max iterations, start position, or variable range) or click any button, any previous trajectory path is automatically cleared, ensuring a clean visualization for each experiment.

Limitations

2D landscapes only. Every test function has exactly two variables so it can be drawn as a contour and a surface. Real optimization problems have thousands to millions of dimensions where this geometric intuition only partly transfers.
Analytical gradients. Exact derivatives are hard-coded for each function; there is no numerical differentiation, automatic differentiation, or noisy/stochastic gradient as in real machine-learning training (SGD).
Fixed learning rate. α is constant during a run — no line search, decay schedule, or adaptive methods (Adam, RMSProp, AdaGrad) that production optimizers rely on.
Classical momentum only. The update is heavy-ball momentum; Nesterov accelerated gradient and second-order (Newton/quasi-Newton) methods are not modeled.
Local convergence. Descent finds whichever basin the start point falls into; on the multimodal functions it offers no guarantee of reaching the global minimum, and it can stall at the saddle point.
Iteration-capped. The run stops at Max Iterations regardless of whether a convergence tolerance has been met.