Web Simulation 

 

 

 

 

Stochastic Gradient Descent (SGD) Tutorial 

Stochastic Gradient Descent (SGD) is a variation of the gradient descent optimization algorithm where instead of computing the gradient using the entire dataset, we compute it using a random subset (mini-batch) of data points. This introduces noise into the gradient estimate but offers significant computational advantages.

The Key Insight: Trading Accuracy for Speed

Consider a dataset with millions of data points. In Batch Gradient Descent, you must compute the gradient contribution from every single point before taking ONE step. This is:

  • Computationally expensive: Each step requires processing ALL data
  • Memory intensive: Must hold entire dataset in memory
  • Slow to converge: You take fewer, more accurate steps

In SGD, you randomly sample a small batch (e.g., 5-32 points) and compute a "noisy" gradient estimate. This gradient points roughly toward the minimum, but with some random deviation.

The Mathematics

For linear regression y = mx + b, the Mean Squared Error (MSE) loss is:

MSE = (1/N) Σ (ypred − ytrue

Batch Gradient Descent computes gradients over ALL N points:

∂MSE/∂m = (1/N) Σi=1..N (errori · xi)

Stochastic Gradient Descent computes gradients over a mini-batch of size B:

∂MSE/∂m ≈ (1/B) Σi∈batch (errori · xi)

The update rule remains the same:

m ← m − η · ∂MSE/∂m
b ← b − η · ∂MSE/∂b

Why Does SGD Work?

The magic of SGD lies in the Law of Large Numbers: even though each individual gradient estimate is noisy, the average direction over many steps still points toward the minimum. The noise actually helps in some ways:

  • Escape local minima: Random fluctuations can help jump out of shallow local minima
  • Regularization effect: The noise prevents overfitting to the training data
  • Faster initial progress: Many noisy steps > few precise steps early in training

The Batch Size Spectrum

Batch Size Name Gradient Noise Memory Steps/Epoch
1 Pure SGD Maximum Minimal N steps
8-64 Mini-batch SGD Moderate Small N/B steps
N (full) Batch GD Zero Full dataset 1 step

Visualization: Two Views

This simulation shows SGD from two perspectives:

  • Data Space (Left Canvas): Shows the data points and current regression line. The orange highlighted points are the current mini-batch being used for gradient computation. Watch how different random batches cause the line to "wobble" as it converges.
  • Loss Landscape (Right Canvas): Shows the MSE as a heatmap where X-axis is slope (m) and Y-axis is intercept (b). Darker colors = lower loss. The red path traces the SGD trajectory — notice the characteristic zig-zag pattern caused by noisy gradients!
0.00010 110
5 Mini-batch (high noise)
Step: 0
m (mnew): 0.1000
b (bnew): 40.0000
MSE: 0.0000
Optimal m*: 0.0000
Optimal b*: 0.0000
Data Space (y = mx + b)
Loss Landscape (MSE Heatmap)
Mini-batch Points
Other Data Points
Current Fit
SGD Path
Optimal Point
STEP DETAILS Step: 0
Click Step for single-step mode with details, or Start for continuous training.

Interactive Controls

  • Start/Stop: Toggle continuous SGD training.
  • Step Fwd ▶: Execute a single SGD step with detailed breakdown shown in the Step Details panel.
  • ◀ Step Back: Rewind to the previous step to review the calculation again.
  • Reset: Reset parameters to initial values while keeping the same data.
  • New Data: Generate a fresh random dataset.
  • Learning Rate (η): Controls step size. Too high → overshooting. Too low → slow convergence.
  • Batch Size: The number of points used per gradient computation:
    • 1: Pure SGD — maximum noise, zig-zag path, but computationally cheap
    • 5-15: Mini-batch — good balance of noise and stability
    • 50 (all): Full Batch GD — smooth path, no noise, but slow
  • Speed: Animation speed (higher = faster).

Understanding the Step Details Panel

When using Step Fwd ▶ mode, the panel shows detailed calculations:

  • Mini-Batch Selection: Which random points were chosen (shown as orange highlights on the canvas).
  • Gradient Calculation: Per-point breakdown showing:
    • Prediction: pred = m·x + b
    • Error: error = pred − y
    • Gradient contributions: error×x for ∂m, error for ∂b
  • Parameter Update (m_new, b_new): Step-by-step calculation:
    • m_new = m − η × ∂MSE/∂m (with intermediate values)
    • b_new = b − η × ∂MSE/∂b (with intermediate values)
  • Regression Line Equation: The updated line y = m_new·x + b_new.
  • Convergence Status: Current MSE and distance to optimal (least squares) solution.

Observing the Stochastic Nature

The key visual insight is in the Loss Landscape (right canvas):

  • With Batch Size = 1: The path zig-zags wildly! Each random point pulls the parameters in a different direction.
  • With Batch Size = 50 (full batch): The path is smooth and direct — like classical gradient descent.
  • With Batch Size = 5-10: A good middle ground — some noise helps escape local minima, but progress is steady.

Why SGD is Used in Deep Learning

Modern neural networks train on millions or billions of data points. Computing the exact gradient would require:

  • Forward pass through ALL data
  • Backward pass through ALL data
  • Only THEN can you take ONE step!

With SGD (batch size 32-256), you take thousands of noisy but cheap steps instead. The noise is actually beneficial:

  • Generalization: Prevents memorizing training data
  • Exploration: Helps find flatter, more robust minima
  • Speed: More parameter updates per second

SGD Variants in Practice

Algorithm Key Idea Benefit
SGD + Momentum Accumulate velocity from past gradients Smoother path, faster convergence
Adam Adaptive learning rates per parameter Less tuning needed
RMSprop Divide by running average of gradient magnitudes Handles varying gradient scales
AdaGrad Adapt learning rate based on history Good for sparse gradients

Tips for Using This Simulation

  • Compare Batch Sizes: Run with batch=1, then reset and run with batch=50. Compare the paths!
  • Watch the Orange Points: See how different random samples cause different gradient directions.
  • Use Step Mode: Click "Step" repeatedly to see exactly which points are selected and how they affect the gradient.
  • Try Different Learning Rates: Too high (0.001) causes overshooting. Too low (0.00001) is very slow.
  • Generate New Data: Click "New Data" to see how the optimal solution and convergence path change.