|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This tutorial visualizes the backpropagation algorithm in a "glass-box" 2-2-1 neural network (2 inputs → 2 hidden → 1 output). Data flows forward and gradients flow backward as animated packets, with the chain-rule arithmetic shown live for any connection you click. You can use this to build intuition for vanishing gradients, dying ReLU, and the special cancellation that makes cross-entropy + sigmoid work. Sections Mathematical FoundationEach training step does two passes through the network. Forward pass: every neuron computes a weighted sum and applies an activation function: z = Σ (w · a) + b → a = σ(z)
Backward pass: gradients flow from the output back through every connection, using the chain rule. The gradient of the loss with respect to any weight is just the product of two locally-known quantities: ∂L / ∂w = δ · a
where Delta Decomposition: Error × SlopeThe error term δ = (raw error) × σ′(z)
For the sigmoid activation, the slope can be written directly in terms of the output: σ′(z) = σ(z) · (1 − σ(z)) = a · (1 − a)
The slope is a property of the neuron's state, not the error. A saturated neuron with output 0.99 and target 0.0 has a huge raw error (−0.99), but its slope is 0.01, so the product
δ is tiny — the loud "you're wrong!" signal hits earplugs. The simulator marks these neurons with a 🔒 icon.Output Layer vs Hidden Layer DeltaThe two layer types use the same factor structure but get the "raw error" from very different places:
So the full delta formulas are: δoutput = (a − y) · σ′(z)
δhidden = (Σ wnext · δnext) · σ′(z)
When you hover over a hidden-layer connection in the simulator, the math panel switches its label from "Raw Error (a−y)" to "Backprop Sum (Σwδ)" to reflect this difference. Hidden neurons learn by listening to weighted "complaints" from the neurons in front of them, not from any external target. Vanishing Gradient and SaturationBackpropagation multiplies one
ReLU mostly avoids this by having slope 1 for positive inputs (no shrinkage), but introduces its own failure mode: a neuron whose pre-activation is always negative outputs zero, has slope zero, and never recovers — the dying ReLU problem. Cross-Entropy × Sigmoid CancellationWhen you switch the loss to Cross-Entropy and keep sigmoid on the output, a useful algebraic cancellation happens. The output delta simplifies dramatically: δoutput = (a − y)
The See it in the simulator: set Loss = MSE with a saturated output and watch the loss curve flatten. Switch to Cross-Entropy and the same configuration starts learning again immediately, even though the network and weights are identical.
SimulationThe interactive simulator is below. Pick an AND or XOR scenario, hit Play, and watch the loss curve descend. Click on any connection to inspect the chain-rule arithmetic for that specific weight, and use Step to walk one epoch at a time. Network ViewChain RuleHover over a connection to see the chain rule.
Loss CalculationAverage loss calculation across all training samples.
Loss vs. EpochActivations (a)Deltas (δ)Weights (w)Parameters
Scenarios
Interactions
LimitationsThis is an educational visualizer, not a production training tool. Notable simplifications:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||