Web Simulation 

 

 

 

 

Backpropagation Tutorial 

This interactive tutorial visualizes the backpropagation algorithm in a "glass-box" neural network. Watch data and gradients flow through the network as animated packets, see the chain rule calculations in real-time, and explore critical concepts like vanishing gradients and dying ReLU.

What You See

Network View: A 2-2-1 neural network (2 inputs, 2 hidden neurons, 1 output) with weighted connections. Edges change thickness based on weight magnitude and color (blue=positive, red=negative). Green packets show gradients flowing backward during backpropagation. Click on any connection to see the chain rule breakdown.

Loss Chart: Real-time plot of training loss vs. epoch with x-axis tick marks showing epoch numbers. Watch how different problems (AND gate vs. XOR gate) affect convergence. Lock icons (🔒) appear on the chart when neurons become locked during training.

Chain Rule Panel: When you click on a connection, this panel shows the exact gradient calculation: ∂L/∂w = δ × a, where δ is the error term at the target neuron and a is the activation at the source neuron. The selection persists so you can track a specific weight during training.

Key Concepts

Forward Pass: Inputs propagate through the network. Each neuron computes z = Σ(w·a) + b, then applies the activation function a = σ(z).

Backward Pass: Gradients flow backward from the output. The output delta is computed from the loss derivative, then propagated to hidden layers using the chain rule: δ_hidden = (Σ δ_next × w) × σ'(z).

Understanding the Delta Calculation

Important: The Slope (σ') is NOT a function of the error. The Slope is a function of the Neuron's Current Value (z or a). It doesn't care if the answer is "Right" or "Wrong" (the Error). It only cares if the neuron is "Active" or "Saturated."

1. The Raw Error (a-y):

  • What is it? This measures "How wrong are we?"
  • Depends on: The Target (y).
  • Intuition: If you predicted 0.8 and the target is 1.0, your error is -0.2. This is the "Message" we want to send back to the network.

2. The Slope (σ'):

  • What is it? This measures "Is the neuron listening?" or "Is the neuron sensitive to change?"
  • Depends on: The Input Sum (z) or Activation (a).
  • Math: For Sigmoid, the slope is calculated as σ'(z) = σ(z) × (1 - σ(z)) = a × (1 - a).
  • Intuition:
    • If the neuron outputs 0.5 (middle of curve), the slope is high (0.25). The neuron is "listening."
    • If the neuron outputs 0.99 (saturated), the slope is tiny (0.01). The neuron is "deaf" (locked).

Why this distinction matters (The "Aha!" Moment):

In the "Vanishing Gradient" demo, the Error might be huge (e.g., prediction 0.99 vs target 0.0), but if the Slope is zero, the gradient dies.

  • Error: "HEY! CHANGE!" (Loud yelling)
  • Slope: 0.0 (Earplugs in)
  • Result: No update.

So, the Slope is purely a physical property of the neuron's current state, independent of the correct answer.

Technical Note: For Sigmoid, there is a special property where the derivative can be calculated using only the output (a) instead of the complex sum (z): σ'(z) = a × (1 - a). This means:

  • a (Activation): "What is the current signal strength?" (e.g., 0.9)
  • σ' (Slope): "If I nudge the input sum slightly, how much will a change?"

If a is 0.9 (very high), the Slope is 0.09 (very low). This means even though the signal is strong (a), the neuron is "maxed out" and won't change much if you push it further. This is the saturation point.

Output Layer vs. Hidden Layer Delta Calculation

Why the Delta Calculation row behaves differently:

The reason the "Delta Calculation" row shows different information (or disappears) for hidden layers is that the math changes completely once you move away from the output layer.

1. Output Layer (Simple):

The error comes directly from the "Outside World" (the Loss Function).

Formula: δ_output = (a - y) × σ'(a)

This fits perfectly in the 3-box layout: Raw Error × Slope = Delta

2. Hidden Layer (Complex):

The error does not come from a target. It comes from backpropagating the error from the layer ahead.

Formula: δ_hidden = (Σ w × δ_next) × σ'(a)

Because the hidden delta depends on a Sum of Weighted Errors from the next layer (not just a simple subtraction like a-y), it requires a different visualization. When you hover over a hidden layer connection, you'll see "Backprop Sum (Σwδ)" instead of "Raw Error (a-y)", showing how hidden neurons learn by listening to weighted complaints from the layers ahead of them.

Lock Icons: When neurons become "locked" (saturated with very small activation derivatives), a 🔒 icon appears on the neuron in the network view and on the loss chart at the epoch where the lock occurred. You can adjust the sensitivity using the "Lock Threshold" slider.

Cross-Entropy Magic: When using cross-entropy loss with sigmoid, the derivative simplifies to (a - y), canceling out the sigmoid derivative. This prevents vanishing gradients even when neurons are saturated!

Controls

Parameters

0.10
0.40

Scenarios

Stats

Epoch: 0
Loss: 0.0000
LR: 0.10

Network View

Chain Rule

Hover over a connection to see the chain rule.

Loss Calculation

Average loss calculation across all training samples.

Loss vs. Epoch

Activations (a)

Deltas (δ)

Weights (w)

 

Parameters

  • Learning Rate: Controls the step size for weight updates. Higher values learn faster but may overshoot. Lower values are more stable but slower.
  • Animation Speed: Controls the speed of backpropagation animations (green packets and curved arrows). Lower values show slower, more detailed animations.
  • Activation Function: Sigmoid (smooth, bounded) or ReLU (fast, unbounded). Sigmoid can saturate and cause vanishing gradients. ReLU can "die" if inputs are always negative.
  • Loss Function: MSE (Mean Squared Error) or Cross-Entropy. Cross-Entropy with sigmoid has a special property: the gradient simplifies to (a - y), preventing vanishing gradients.
  • Weight Init (±): Controls the range for random weight initialization. Weights will be between -value and +value.
  • Bias Init: Sets the initial bias value for all neurons (except input layer).
  • Lock Threshold: Sensitivity threshold for detecting locked neurons. If a neuron's average activation derivative is below this value, it's considered "locked" and displays a 🔒 icon.

Scenarios

  • Reset (AND) - Fast! The AND gate problem is linearly separable, meaning the network can solve it very quickly (often in less than 50 epochs). This demonstrates how easy problems converge rapidly compared to non-linear problems.
  • Reset (XOR): The XOR problem is non-linearly separable, requiring the hidden layer to learn a non-linear transformation. This takes longer to converge and demonstrates the power of multi-layer networks.

Interactions

  • Click on connections: See the exact chain rule calculation for that weight. The gradient is shown as ∂L/∂w = δ × a, where δ is the error term and a is the source activation. The selection persists so you can track a specific weight during training.
  • Watch the animations: Green packets and curved yellow arrows show gradients flowing backward during backpropagation. This visualizes how errors propagate from the output back through the network.
  • Observe neuron states: Each neuron shows its activation (a) and delta (δ). When a neuron is "locked" (saturated with very small activation derivatives), a 🔒 icon appears on the neuron and on the loss chart.
  • Adjust Lock Threshold: Use the "Lock Threshold" slider to control the sensitivity for detecting locked neurons. Lower values detect more locked neurons, higher values are more lenient.

Educational Value

This simulation makes abstract concepts concrete:

  • The Chain Rule: See exactly how gradients are computed and propagated backward. Click on any connection to see the detailed breakdown.
  • Linear vs. Non-Linear Problems: Compare how quickly the AND gate (linearly separable) converges versus the XOR gate (non-linearly separable).
  • Weight Initialization: Adjust the "Weight Init" and "Bias Init" sliders to see how initialization affects training. Use the "Re-Initialize Weights" button to try different random starting points.
  • Activation Functions: Compare sigmoid (smooth but can saturate) vs. ReLU (fast but can die). Switch between them to see how they affect learning.
  • Loss Functions: See how cross-entropy with sigmoid creates a "magic cancellation" that prevents vanishing gradients.
  • Loss Tracking: The Loss vs. Epoch chart with x-axis labels shows how training progresses over time. Lock icons mark epochs where neurons became locked.
  • Neuron Locking: Watch for 🔒 icons that appear when neurons become saturated. Adjust the "Lock Threshold" slider to control detection sensitivity.

NOTE: This tutorial uses a minimal 2-2-1 architecture to keep the visualization clear. Real networks are much larger, but the same principles apply. The concepts of forward pass, backward pass, chain rule, and gradient flow are universal.