Web Simulator | ShareTechnote

Web Simulation

Backpropagation Tutorial

This tutorial visualizes the backpropagation algorithm in a "glass-box" 2-2-1 neural network (2 inputs → 2 hidden → 1 output). Data flows forward and gradients flow backward as animated packets, with the chain-rule arithmetic shown live for any connection you click. You can use this to build intuition for vanishing gradients, dying ReLU, and the special cancellation that makes cross-entropy + sigmoid work.

Sections

Mathematical Foundation
Delta Decomposition: Error × Slope
Output Layer vs Hidden Layer Delta
Vanishing Gradient and Saturation
Cross-Entropy × Sigmoid Cancellation
Simulation
Parameters
Scenarios
Interactions
Limitations

Mathematical Foundation

Each training step does two passes through the network.

Forward pass: every neuron computes a weighted sum and applies an activation function:

z = Σ (w · a) + b → a = σ(z)

Backward pass: gradients flow from the output back through every connection, using the chain rule. The gradient of the loss with respect to any weight is just the product of two locally-known quantities:

∂L / ∂w = δ · a

where δ is the error term at the target neuron of that weight, and a is the activation at the source neuron. Click any connection in the simulator to see this exact arithmetic.

Delta Decomposition: Error × Slope

The error term δ at any neuron is the product of two ingredients:

δ = (raw error) × σ′(z)

Ingredient	Question it answers	Depends on
Raw error `(a − y)`	"How wrong are we?"	The target `y`
Slope `σ′(z)`	"Is the neuron listening?"	The neuron's current state `z` or `a`

For the sigmoid activation, the slope can be written directly in terms of the output:

σ′(z) = σ(z) · (1 − σ(z)) = a · (1 − a)

Neuron output `a`	Slope `a(1−a)`	State
0.50	0.25 (max)	"Listening" — learns fast
0.80	0.16	Reduced sensitivity
0.99	0.01	"Locked" — saturated, ignores gradient
0.01	0.01	"Locked" — saturated, ignores gradient

The slope is a property of the neuron's state, not the error. A saturated neuron with output 0.99 and target 0.0 has a huge raw error (−0.99), but its slope is 0.01, so the product δ is tiny — the loud "you're wrong!" signal hits earplugs. The simulator marks these neurons with a 🔒 icon.

Output Layer vs Hidden Layer Delta

The two layer types use the same factor structure but get the "raw error" from very different places:

Layer	Raw error term	Where it comes from
Output	`(a − y)`	The loss function: direct comparison with the target.
Hidden	`Σ w_next · δ_next`	Weighted sum of the deltas from the layer ahead.

So the full delta formulas are:

δ_output = (a − y) · σ′(z)

δ_hidden = (Σ w_next · δ_next) · σ′(z)

When you hover over a hidden-layer connection in the simulator, the math panel switches its label from "Raw Error (a−y)" to "Backprop Sum (Σwδ)" to reflect this difference. Hidden neurons learn by listening to weighted "complaints" from the neurons in front of them, not from any external target.

Vanishing Gradient and Saturation

Backpropagation multiplies one σ′(z) factor per layer when propagating gradients backward. For sigmoid, σ′ ≤ 0.25 always, so after just a few layers the gradient shrinks toward zero — the vanishing gradient problem. Practical symptoms in the simulator:

Loss curve goes flat early at a high value.
One or more neurons display the 🔒 "locked" icon.
Weight updates are near zero even when the prediction is clearly wrong.

ReLU mostly avoids this by having slope 1 for positive inputs (no shrinkage), but introduces its own failure mode: a neuron whose pre-activation is always negative outputs zero, has slope zero, and never recovers — the dying ReLU problem.

Cross-Entropy × Sigmoid Cancellation

When you switch the loss to Cross-Entropy and keep sigmoid on the output, a useful algebraic cancellation happens. The output delta simplifies dramatically:

δ_output = (a − y)

The σ′(z) factor that would normally cause vanishing gradients at saturated outputs is cancelled by the matching term in the cross-entropy derivative. This is why classification networks almost always pair softmax/sigmoid with cross-entropy loss instead of MSE.

See it in the simulator: set Loss = MSE with a saturated output and watch the loss curve flatten. Switch to Cross-Entropy and the same configuration starts learning again immediately, even though the network and weights are identical.

Simulation

The interactive simulator is below. Pick an AND or XOR scenario, hit Play, and watch the loss curve descend. Click on any connection to inspect the chain-rule arithmetic for that specific weight, and use Step to walk one epoch at a time.

Controls

Parameters

Learning Rate 0.10

Animation Speed 0.40

Activation

Loss

Scenarios

Scenario

Stats

Epoch: 0

Loss: 0.0000

LR: 0.10

Network View

Chain Rule

Hover over a connection to see the chain rule.

Target Error (δ)

0.00

Raw Error (a-y)

0.00

Slope σ'(a)

0.00

Gradient (∂L/∂w)

0.00

Error (δ)

0.00

Input (a)

0.00

New Weight

0.00

Current (w)

0.00

−

L. Rate (η)

0.00

Gradient

0.00

Loss Calculation

Average loss calculation across all training samples.

Loss vs. Epoch

Activations (a)

Deltas (δ)

Weights (w)

Parameters

Control	Range	Effect
Learning Rate `η`	0.01–1.0	Step size for weight updates: `w ← w − η · ∂L/∂w`. Higher = faster but may overshoot.
Animation Speed	0.01–1.0	Speed of the green gradient packets and curved arrows. Cosmetic only.
Activation	Sigmoid / ReLU	Sigmoid is smooth and bounded but saturates; ReLU is fast but can die.
Loss	MSE / Cross-Entropy	Cross-Entropy with sigmoid cancels the slope factor at the output, preventing vanishing gradient there.
Weight Init (±)	magnitude	Random initial weights uniform in `[−v, +v]`.
Bias Init	scalar	Starting bias for all non-input neurons.
Lock Threshold	0–1	If a neuron's mean `\|σ′\|` falls below this, it's flagged with 🔒. Higher = stricter.

Scenarios

Scenario	Separability	Typical convergence
AND gate	Linearly separable	~50 epochs — converges with or without the hidden layer.
XOR gate	Not linearly separable	Hundreds of epochs — demonstrates why the hidden layer is required at all.

Interactions

Click on connections: see the exact chain-rule calculation for that weight. The selection persists so you can track one weight through training.
Watch the green packets: they show gradients flowing backward during backpropagation.
Observe neuron states: each neuron shows its current activation a and delta δ. Saturated neurons get a 🔒 icon.
Adjust Lock Threshold: tune the sensitivity for what counts as "locked".

Limitations

This is an educational visualizer, not a production training tool. Notable simplifications:

The 2-2-1 architecture is the minimum that can solve XOR. Real networks have far more layers and units; deep-network phenomena (skip connections, batch norm, momentum) are not modeled.
Only two activation functions (sigmoid, ReLU) and two losses (MSE, Cross-Entropy). No softmax, no regularization, no dropout.
Training is full-batch on a 4-row truth table; mini-batch and SGD noise are absent.
The "Lock Threshold" is a UI heuristic for visualizing saturation, not a training-time intervention — real training would still attempt to update those weights.