Web Simulator | ShareTechnote

Web Simulation

Stochastic Gradient Descent (SGD) Tutorial

Stochastic Gradient Descent (SGD) is a variation of the gradient descent optimization algorithm where instead of computing the gradient using the entire dataset, we compute it using a random subset (mini-batch) of data points. This introduces noise into the gradient estimate but offers significant computational advantages.

Sections

The Key Insight: Trading Accuracy for Speed
The Mathematics
Why Does SGD Work?
The Batch Size Spectrum
Visualization: Two Views
Simulation
Interactive Controls
Understanding the Step Details Panel
Observing the Stochastic Nature
Why SGD is Used in Deep Learning
SGD Variants in Practice
Tips for Using This Simulation
Limitations

The Key Insight: Trading Accuracy for Speed

Consider a dataset with millions of data points. In Batch Gradient Descent, you must compute the gradient contribution from every single point before taking ONE step. This is:

Computationally expensive: Each step requires processing ALL data
Memory intensive: Must hold entire dataset in memory
Slow to converge: You take fewer, more accurate steps

In SGD, you randomly sample a small batch (e.g., 5-32 points) and compute a "noisy" gradient estimate. This gradient points roughly toward the minimum, but with some random deviation.

The Mathematics

For linear regression y = mx + b, the Mean Squared Error (MSE) loss is:

MSE = (1/N) Σ (y_pred − y_true)²

Batch Gradient Descent computes gradients over ALL N points:

∂MSE/∂m = (1/N) Σ_i=1..N (error_i · x_i)

Stochastic Gradient Descent computes gradients over a mini-batch of size B:

∂MSE/∂m ≈ (1/B) Σ_i∈batch (error_i · x_i)

The update rule remains the same:

m ← m − η · ∂MSE/∂m
b ← b − η · ∂MSE/∂b

Why Does SGD Work?

The magic of SGD lies in the Law of Large Numbers: even though each individual gradient estimate is noisy, the average direction over many steps still points toward the minimum. The noise actually helps in some ways:

Escape local minima: Random fluctuations can help jump out of shallow local minima
Regularization effect: The noise prevents overfitting to the training data
Faster initial progress: Many noisy steps > few precise steps early in training

The Batch Size Spectrum

Batch Size	Name	Gradient Noise	Memory	Steps/Epoch
1	Pure SGD	Maximum	Minimal	N steps
8-64	Mini-batch SGD	Moderate	Small	N/B steps
N (full)	Batch GD	Zero	Full dataset	1 step

Visualization: Two Views

This simulation shows SGD from two perspectives:

Data Space (Left Canvas): Shows the data points and current regression line. The orange highlighted points are the current mini-batch being used for gradient computation. Watch how different random batches cause the line to "wobble" as it converges.
Loss Landscape (Right Canvas): Shows the MSE as a heatmap where X-axis is slope (m) and Y-axis is intercept (b). Darker colors = lower loss. The red path traces the SGD trajectory — notice the characteristic zig-zag pattern caused by noisy gradients!

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Controls

Learning Rate (η): 0.00010 Speed: 110

Batch Size: 5 Mini-batch (high noise)

Step: 0

m (m_new): 0.1000

b (b_new): 40.0000

MSE: 0.0000

Optimal m*: 0.0000

Optimal b*: 0.0000

Data Space (y = mx + b)

Loss Landscape (MSE Heatmap)

Mini-batch Points

Other Data Points

Current Fit

SGD Path

Optimal Point

STEP DETAILS Step: 0

Click Step for single-step mode with details, or Start for continuous training.

Interactive Controls

Start/Stop: Toggle continuous SGD training.
Step Fwd ▶: Execute a single SGD step with detailed breakdown shown in the Step Details panel.
◀ Step Back: Rewind to the previous step to review the calculation again.
Reset: Reset parameters to initial values while keeping the same data.
New Data: Generate a fresh random dataset.
Learning Rate (η): Controls step size. Too high → overshooting. Too low → slow convergence.
Batch Size: The number of points used per gradient computation:
- 1: Pure SGD — maximum noise, zig-zag path, but computationally cheap
- 5-15: Mini-batch — good balance of noise and stability
- 50 (all): Full Batch GD — smooth path, no noise, but slow
Speed: Animation speed (higher = faster).

Understanding the Step Details Panel

When using Step Fwd ▶ mode, the panel shows detailed calculations:

Mini-Batch Selection: Which random points were chosen (shown as orange highlights on the canvas).
Gradient Calculation: Per-point breakdown showing:
- Prediction: pred = m·x + b
- Error: error = pred − y
- Gradient contributions: error×x for ∂m, error for ∂b
Parameter Update (m_new, b_new): Step-by-step calculation:
- m_new = m − η × ∂MSE/∂m (with intermediate values)
- b_new = b − η × ∂MSE/∂b (with intermediate values)
Regression Line Equation: The updated line y = m_new·x + b_new.
Convergence Status: Current MSE and distance to optimal (least squares) solution.

Observing the Stochastic Nature

The key visual insight is in the Loss Landscape (right canvas):

With Batch Size = 1: The path zig-zags wildly! Each random point pulls the parameters in a different direction.
With Batch Size = 50 (full batch): The path is smooth and direct — like classical gradient descent.
With Batch Size = 5-10: A good middle ground — some noise helps escape local minima, but progress is steady.

Why SGD is Used in Deep Learning

Modern neural networks train on millions or billions of data points. Computing the exact gradient would require:

Forward pass through ALL data
Backward pass through ALL data
Only THEN can you take ONE step!

With SGD (batch size 32-256), you take thousands of noisy but cheap steps instead. The noise is actually beneficial:

Generalization: Prevents memorizing training data
Exploration: Helps find flatter, more robust minima
Speed: More parameter updates per second

SGD Variants in Practice

Algorithm	Key Idea	Benefit
SGD + Momentum	Accumulate velocity from past gradients	Smoother path, faster convergence
Adam	Adaptive learning rates per parameter	Less tuning needed
RMSprop	Divide by running average of gradient magnitudes	Handles varying gradient scales
AdaGrad	Adapt learning rate based on history	Good for sparse gradients

Tips for Using This Simulation

Compare Batch Sizes: Run with batch=1, then reset and run with batch=50. Compare the paths!
Watch the Orange Points: See how different random samples cause different gradient directions.
Use Step Mode: Click "Step" repeatedly to see exactly which points are selected and how they affect the gradient.
Try Different Learning Rates: Too high (0.001) causes overshooting. Too low (0.00001) is very slow.
Generate New Data: Click "New Data" to see how the optimal solution and convergence path change.

Limitations

Linear regression toy problem. SGD here optimizes a 2-parameter line (m, b) with a convex MSE loss and a single global minimum. Real deep-learning loss surfaces are high-dimensional and non-convex, where the dynamics (saddle points, local minima, sharp/flat regions) are far richer.
Tiny synthetic dataset. A few dozen generated points are used so the path is visible; there is no train/validation/test split, so the “regularization” benefit of noise is described, not measured as generalization.
Plain SGD only. Momentum, Nesterov, AdaGrad, RMSProp, and Adam are mentioned but not run; the learning rate is fixed (no schedule, warmup, or decay).
Uniform random mini-batches. Batches are sampled uniformly at random with no shuffling subtleties, class imbalance, or data ordering effects.
No numerical safeguards shown. Exploding/vanishing gradients, gradient clipping, and feature scaling are not part of the demo; extreme learning rates simply overshoot visibly.
Teaching tool. Built to make the accuracy-vs-speed trade-off of batch size and the stochastic gradient path intuitive — not a production optimizer.