Web Simulation 

 

 

 

 

MLP(Multi Layer Perceptron) I 

This tutorial visualizes a Multi-Layer Perceptron with the minimum architecture that can solve a non-linearly separable problem: 2 inputs → 2 hidden neurons → 1 output (the 2-2-1 network). It demonstrates that a hidden layer is what unlocks problems like XOR, which a single perceptron provably cannot solve.

The network connections are color-coded (green = positive weight, red = negative) with thickness proportional to magnitude. You can drive a forward pass by setting the binary inputs, or click Train to run backpropagation with momentum and watch the weights converge in real time.

NOTE: Refer to this note for the underlying theory.

Mathematical Foundation

Each neuron computes a weighted sum of its inputs and passes the result through an activation function:

z = Σ wi · xi + b    →    a = σ(z)

For the 2-2-1 network this expands to three sequential layer computations:

h1 = σ(w11x1 + w21x2)
h2 = σ(w12x1 + w22x2)
y  = σ(wh1h1 + wh2h2)

That's 6 weights total. The network is trained by adjusting them to minimize the squared error between y and the target.

Why XOR Needs a Hidden Layer

A single-layer perceptron can only draw linear decision boundaries. XOR's truth table is:

x1

x2

AND

OR

XOR

0

0

0

0

0

0

1

0

1

1

1

0

0

1

1

1

1

1

1

0

For AND and OR, you can draw a single straight line separating 0s from 1s — a perceptron can solve them. For XOR you cannot: the 1s are at opposite corners and any line that separates them also splits one class. The hidden layer transforms the inputs into a new representation in which a linear separator does work, and the output layer then draws that line.

Training with Momentum

Standard gradient descent updates each weight by:

Δw = −η · ∂E/∂w

With momentum, the update also carries a fraction of the previous step's velocity:

v ← μ · v − η · ∂E/∂w    →    w ← w + v

The momentum coefficient μ ≈ 0.9 keeps weights moving through flat regions of the error surface where standard gradient descent would stall. This matters intensely for the 2-2-1 XOR case because that architecture sits right at the minimum capacity needed — the error surface is full of plateaus and saddle points.

Why momentum is essential here: without it, training on XOR routinely plateaus around error 0.4–0.5 and never converges. With μ = 0.9 and wider initialization ([−2.0, +2.0] instead of [−0.5, +0.5]), the network usually finds a solution within a few thousand iterations — though not always, because the 2-2-1 architecture is right at the minimum capacity needed.

Simulation

The interactive simulator is below. Pick a gate (XOR by default), click Train, and watch the iteration count and error drop. If training stalls (sometimes happens with XOR), hit Randomize to try a different starting point.

Parameters

Parameter

Range / default

Effect

Activation function

Sigmoid / Tanh / Step

Sigmoid & Tanh are trainable. Step cannot be trained (slope = 0) but is shown for comparison.

Gate preset

XOR / AND / OR / NAND / NOR

Selects the target truth table. XOR is the only one that needs the hidden layer.

Network weights w

6 sliders, range [−2, +2]

Manually editable for hidden layer (4) and output layer (2). Wide range helps break symmetry.

Inputs x1, x2

0 or 1 (checkboxes)

Drives a single forward pass.

Learning rate η

default 0.5

Step size for weight updates. Stable at 0.5 when paired with momentum.

Momentum μ

0–0.99, default 0.9

Velocity carry-over coefficient. Essential for 2-2-1 XOR.

Train update speed

seconds per step, default 0.1

Delay between training iterations — visualization speed only.

Buttons

Button

Effect

Reset

Stops training, randomizes weights and biases in [−2, +2], clears inputs to [0,0], wipes velocity history.

Randomize

Stops training, randomizes weights and inputs, wipes velocity. Use this when training is stuck.

Train

Runs backpropagation with momentum. Cycles through the 4 truth-table rows, updates weights after each. Stops automatically when all decisions are correct.

Test

Cycles through all 4 inputs with the current weights. Reports "Test PASS" or "Test FAIL".

Tips on Implementation

Several lessons learned from building a reliable XOR trainer at this minimal architecture:

  • XOR is at the architecture limit. 2-2-1 is the theoretical minimum that can solve XOR. The error surface has flat plateaus, saddle points, and local minima that vanilla gradient descent often cannot escape.
  • Use momentum. Without it, training plateaus at error ~0.4–0.5 within hundreds of iterations and never converges. With μ = 0.9 it usually converges within thousands.
  • Initialize wide. The range [−2, +2] places neurons in active regions of the sigmoid; [−0.5, +0.5] often leaves them all saturated near 0.5 from the start.
  • Online (per-example) updates are used here for visual feedback. Batch updates would be smoother but less educational.
  • Reset velocity on restart. Stale momentum from a previous training session can push the network off in a bad direction.
  • Convergence indicators: error drops below 0.1 and all 4 decision outputs become correct. If you see neither after a thousand iterations, hit Randomize and try again — the starting point matters.

Limitations

  • Fixed 2-2-1 architecture; no way to add neurons or layers in this simulator. See MLP II and MLP III for wider/deeper variants.
  • Truth-table problems only (4 input combinations). No continuous data, no real-world classification.
  • Online stochastic updates only, no mini-batch.
  • Pure backpropagation with momentum — no adaptive optimizers (Adam, RMSprop) or regularization.
  • Convergence is not guaranteed. Even with momentum + wide init, XOR at this minimum architecture occasionally fails to converge from unlucky starting weights. Use Randomize if so.