This tutorial visualizes a Multi-Layer Perceptron with the minimum architecture that can solve a non-linearly separable problem: 2 inputs → 2 hidden neurons → 1 output (the 2-2-1 network). It demonstrates that a hidden layer is what unlocks problems like XOR, which a single perceptron provably cannot solve.
The network connections are color-coded (green = positive weight, red = negative) with thickness proportional to magnitude. You can drive a forward pass by setting the binary inputs, or click Train to run backpropagation with momentum and watch the weights converge in real time.
NOTE: Refer to this note for the underlying theory.
Mathematical Foundation
Each neuron computes a weighted sum of its inputs and passes the result through an activation function:
z = Σ wi · xi + b → a = σ(z)
For the 2-2-1 network this expands to three sequential layer computations:
h1 = σ(w11x1 + w21x2)
h2 = σ(w12x1 + w22x2)
y = σ(wh1h1 + wh2h2)
That's 6 weights total. The network is trained by adjusting them to minimize the squared error between y and the target.
Why XOR Needs a Hidden Layer
A single-layer perceptron can only draw linear decision boundaries. XOR's truth table is:
x1
| x2
| AND | OR | XOR |
0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 1 | 1 |
1 | 0 | 0 | 1 | 1 |
1 | 1 | 1 | 1 | 0 |
For AND and OR, you can draw a single straight line separating 0s from 1s — a perceptron can solve them. For XOR you cannot: the 1s are at opposite corners and any line that separates them also splits one class. The hidden layer transforms the inputs into a new representation in which a linear separator does work, and the output layer then draws that line.
Training with Momentum
Standard gradient descent updates each weight by:
Δw = −η · ∂E/∂w
With momentum, the update also carries a fraction of the previous step's velocity:
v ← μ · v − η · ∂E/∂w → w ← w + v
The momentum coefficient μ ≈ 0.9 keeps weights moving through flat regions of the error surface where standard gradient descent would stall. This matters intensely for the 2-2-1 XOR case because that architecture sits right at the minimum capacity needed — the error surface is full of plateaus and saddle points.
Why momentum is essential here: without it, training on XOR routinely plateaus around error 0.4–0.5 and never converges. With μ = 0.9 and wider initialization ([−2.0, +2.0] instead of [−0.5, +0.5]), the network usually finds a solution within a few thousand iterations — though not always, because the 2-2-1 architecture is right at the minimum capacity needed.
Simulation
The interactive simulator is below. Pick a gate (XOR by default), click Train, and watch the iteration count and error drop. If training stalls (sometimes happens with XOR), hit Randomize to try a different starting point.
Parameters
Parameter | Range / default | Effect |
Activation function | Sigmoid / Tanh / Step | Sigmoid & Tanh are trainable. Step cannot be trained (slope = 0) but is shown for comparison. |
Gate preset | XOR / AND / OR / NAND / NOR | Selects the target truth table. XOR is the only one that needs the hidden layer. |
Network weights w | 6 sliders, range [−2, +2] | Manually editable for hidden layer (4) and output layer (2). Wide range helps break symmetry. |
Inputs x1, x2 | 0 or 1 (checkboxes) | Drives a single forward pass. |
Learning rate η | default 0.5 | Step size for weight updates. Stable at 0.5 when paired with momentum. |
Momentum μ | 0–0.99, default 0.9 | Velocity carry-over coefficient. Essential for 2-2-1 XOR. |
Train update speed | seconds per step, default 0.1 | Delay between training iterations — visualization speed only. |
Button | Effect |
Reset | Stops training, randomizes weights and biases in [−2, +2], clears inputs to [0,0], wipes velocity history. |
Randomize | Stops training, randomizes weights and inputs, wipes velocity. Use this when training is stuck. |
Train | Runs backpropagation with momentum. Cycles through the 4 truth-table rows, updates weights after each. Stops automatically when all decisions are correct. |
Test | Cycles through all 4 inputs with the current weights. Reports "Test PASS" or "Test FAIL". |
Tips on Implementation
Several lessons learned from building a reliable XOR trainer at this minimal architecture:
- XOR is at the architecture limit. 2-2-1 is the theoretical minimum that can solve XOR. The error surface has flat plateaus, saddle points, and local minima that vanilla gradient descent often cannot escape.
- Use momentum. Without it, training plateaus at error ~0.4–0.5 within hundreds of iterations and never converges. With
μ = 0.9 it usually converges within thousands.
- Initialize wide. The range
[−2, +2] places neurons in active regions of the sigmoid; [−0.5, +0.5] often leaves them all saturated near 0.5 from the start.
- Online (per-example) updates are used here for visual feedback. Batch updates would be smoother but less educational.
- Reset velocity on restart. Stale momentum from a previous training session can push the network off in a bad direction.
- Convergence indicators: error drops below 0.1 and all 4 decision outputs become correct. If you see neither after a thousand iterations, hit Randomize and try again — the starting point matters.
Limitations
- Fixed 2-2-1 architecture; no way to add neurons or layers in this simulator. See MLP II and MLP III for wider/deeper variants.
- Truth-table problems only (4 input combinations). No continuous data, no real-world classification.
- Online stochastic updates only, no mini-batch.
- Pure backpropagation with momentum — no adaptive optimizers (Adam, RMSprop) or regularization.
- Convergence is not guaranteed. Even with momentum + wide init, XOR at this minimum architecture occasionally fails to converge from unlucky starting weights. Use Randomize if so.