Web Simulation 

 

 

 

 

Backpropagation Tutorial 

This tutorial visualizes the backpropagation algorithm in a "glass-box" 2-2-1 neural network (2 inputs → 2 hidden → 1 output). Data flows forward and gradients flow backward as animated packets, with the chain-rule arithmetic shown live for any connection you click. You can use this to build intuition for vanishing gradients, dying ReLU, and the special cancellation that makes cross-entropy + sigmoid work.

Mathematical Foundation

Each training step does two passes through the network.

Forward pass: every neuron computes a weighted sum and applies an activation function:

z = Σ (w · a) + b     →     a = σ(z)

Backward pass: gradients flow from the output back through every connection, using the chain rule. The gradient of the loss with respect to any weight is just the product of two locally-known quantities:

∂L / ∂w = δ · a

where δ is the error term at the target neuron of that weight, and a is the activation at the source neuron. Click any connection in the simulator to see this exact arithmetic.

Delta Decomposition: Error × Slope

The error term δ at any neuron is the product of two ingredients:

δ = (raw error) × σ′(z)

Ingredient

Question it answers

Depends on

Raw error (a − y)

"How wrong are we?"

The target y

Slope σ′(z)

"Is the neuron listening?"

The neuron's current state z or a

For the sigmoid activation, the slope can be written directly in terms of the output:

σ′(z) = σ(z) · (1 − σ(z)) = a · (1 − a)

Neuron output a

Slope a(1−a)

State

0.50

0.25 (max)

"Listening" — learns fast

0.80

0.16

Reduced sensitivity

0.99

0.01

"Locked" — saturated, ignores gradient

0.01

0.01

"Locked" — saturated, ignores gradient

The slope is a property of the neuron's state, not the error. A saturated neuron with output 0.99 and target 0.0 has a huge raw error (−0.99), but its slope is 0.01, so the product δ is tiny — the loud "you're wrong!" signal hits earplugs. The simulator marks these neurons with a 🔒 icon.

Output Layer vs Hidden Layer Delta

The two layer types use the same factor structure but get the "raw error" from very different places:

Layer

Raw error term

Where it comes from

Output

(a − y)

The loss function: direct comparison with the target.

Hidden

Σ wnext · δnext

Weighted sum of the deltas from the layer ahead.

So the full delta formulas are:

δoutput = (a − y) · σ′(z)
δhidden = (Σ wnext · δnext) · σ′(z)

When you hover over a hidden-layer connection in the simulator, the math panel switches its label from "Raw Error (a−y)" to "Backprop Sum (Σwδ)" to reflect this difference. Hidden neurons learn by listening to weighted "complaints" from the neurons in front of them, not from any external target.

Vanishing Gradient and Saturation

Backpropagation multiplies one σ′(z) factor per layer when propagating gradients backward. For sigmoid, σ′ ≤ 0.25 always, so after just a few layers the gradient shrinks toward zero — the vanishing gradient problem. Practical symptoms in the simulator:

  • Loss curve goes flat early at a high value.
  • One or more neurons display the 🔒 "locked" icon.
  • Weight updates are near zero even when the prediction is clearly wrong.

ReLU mostly avoids this by having slope 1 for positive inputs (no shrinkage), but introduces its own failure mode: a neuron whose pre-activation is always negative outputs zero, has slope zero, and never recovers — the dying ReLU problem.

Cross-Entropy × Sigmoid Cancellation

When you switch the loss to Cross-Entropy and keep sigmoid on the output, a useful algebraic cancellation happens. The output delta simplifies dramatically:

δoutput = (a − y)

The σ′(z) factor that would normally cause vanishing gradients at saturated outputs is cancelled by the matching term in the cross-entropy derivative. This is why classification networks almost always pair softmax/sigmoid with cross-entropy loss instead of MSE.

See it in the simulator: set Loss = MSE with a saturated output and watch the loss curve flatten. Switch to Cross-Entropy and the same configuration starts learning again immediately, even though the network and weights are identical.

Simulation

The interactive simulator is below. Pick an AND or XOR scenario, hit Play, and watch the loss curve descend. Click on any connection to inspect the chain-rule arithmetic for that specific weight, and use Step to walk one epoch at a time.

Controls

Parameters

0.10
0.40

Scenarios

Stats

Epoch: 0
Loss: 0.0000
LR: 0.10

Network View

Chain Rule

Hover over a connection to see the chain rule.

Loss Calculation

Average loss calculation across all training samples.

Loss vs. Epoch

Activations (a)

Deltas (δ)

Weights (w)

Parameters

Control

Range

Effect

Learning Rate η

0.01–1.0

Step size for weight updates: w ← w − η · ∂L/∂w. Higher = faster but may overshoot.

Animation Speed

0.01–1.0

Speed of the green gradient packets and curved arrows. Cosmetic only.

Activation

Sigmoid / ReLU

Sigmoid is smooth and bounded but saturates; ReLU is fast but can die.

Loss

MSE / Cross-Entropy

Cross-Entropy with sigmoid cancels the slope factor at the output, preventing vanishing gradient there.

Weight Init (±)

magnitude

Random initial weights uniform in [−v, +v].

Bias Init

scalar

Starting bias for all non-input neurons.

Lock Threshold

0–1

If a neuron's mean |σ′| falls below this, it's flagged with 🔒. Higher = stricter.

Scenarios

Scenario

Separability

Typical convergence

AND gate

Linearly separable

~50 epochs — converges with or without the hidden layer.

XOR gate

Not linearly separable

Hundreds of epochs — demonstrates why the hidden layer is required at all.

Interactions

  • Click on connections: see the exact chain-rule calculation for that weight. The selection persists so you can track one weight through training.
  • Watch the green packets: they show gradients flowing backward during backpropagation.
  • Observe neuron states: each neuron shows its current activation a and delta δ. Saturated neurons get a 🔒 icon.
  • Adjust Lock Threshold: tune the sensitivity for what counts as "locked".

Limitations

This is an educational visualizer, not a production training tool. Notable simplifications:

  • The 2-2-1 architecture is the minimum that can solve XOR. Real networks have far more layers and units; deep-network phenomena (skip connections, batch norm, momentum) are not modeled.
  • Only two activation functions (sigmoid, ReLU) and two losses (MSE, Cross-Entropy). No softmax, no regularization, no dropout.
  • Training is full-batch on a 4-row truth table; mini-batch and SGD noise are absent.
  • The "Lock Threshold" is a UI heuristic for visualizing saturation, not a training-time intervention — real training would still attempt to update those weights.