|
|
||||||||||||||||||||||||||||||||||||
|
Stochastic Gradient Descent (SGD) is a variation of the gradient descent optimization algorithm where instead of computing the gradient using the entire dataset, we compute it using a random subset (mini-batch) of data points. This introduces noise into the gradient estimate but offers significant computational advantages. The Key Insight: Trading Accuracy for SpeedConsider a dataset with millions of data points. In Batch Gradient Descent, you must compute the gradient contribution from every single point before taking ONE step. This is:
In SGD, you randomly sample a small batch (e.g., 5-32 points) and compute a "noisy" gradient estimate. This gradient points roughly toward the minimum, but with some random deviation. The MathematicsFor linear regression y = mx + b, the Mean Squared Error (MSE) loss is: MSE = (1/N) Σ (ypred − ytrue)² Batch Gradient Descent computes gradients over ALL N points: ∂MSE/∂m = (1/N) Σi=1..N (errori · xi) Stochastic Gradient Descent computes gradients over a mini-batch of size B: ∂MSE/∂m ≈ (1/B) Σi∈batch (errori · xi) The update rule remains the same:
m ← m − η · ∂MSE/∂m Why Does SGD Work?The magic of SGD lies in the Law of Large Numbers: even though each individual gradient estimate is noisy, the average direction over many steps still points toward the minimum. The noise actually helps in some ways:
The Batch Size Spectrum
Visualization: Two ViewsThis simulation shows SGD from two perspectives:
Controls
Step:
0
m (mnew):
0.1000
b (bnew):
40.0000
MSE:
0.0000
Optimal m*:
0.0000
Optimal b*:
0.0000
Data Space (y = mx + b)
Loss Landscape (MSE Heatmap)
Mini-batch Points
Other Data Points
Current Fit
SGD Path
Optimal Point
STEP DETAILS
Step: 0
Click Step for single-step mode with details, or Start for continuous training.
Interactive Controls
Understanding the Step Details PanelWhen using Step Fwd ▶ mode, the panel shows detailed calculations:
Observing the Stochastic NatureThe key visual insight is in the Loss Landscape (right canvas):
Why SGD is Used in Deep LearningModern neural networks train on millions or billions of data points. Computing the exact gradient would require:
With SGD (batch size 32-256), you take thousands of noisy but cheap steps instead. The noise is actually beneficial:
SGD Variants in Practice
Tips for Using This Simulation
|
||||||||||||||||||||||||||||||||||||