Web Simulation 

 

 

 

 

MLP(Multi Layer Perceptron) II 

This tutorial extends MLP I from the minimum 2-2-1 architecture to 2-3-1 — 2 inputs, 3 hidden neurons, 1 output. The extra hidden neuron adds capacity that makes XOR training dramatically more reliable.

Same forward/backward pass and training loop as MLP I, but with 9 weights instead of 6. You should observe convergence in roughly 100–500 iterations, vs the thousands (or sometimes never) that the 2-2-1 case needs.

NOTE: Refer to this note for the theoretical background.

Mathematical Foundation

Each neuron computes the standard MLP unit:

z = Σ wi · xi + b   →   a = σ(z)

For the 2-3-1 network the layer computations expand to:

hj = σ(w1j x1 + w2j x2)  for j = 1, 2, 3
y  = σ(wh1 h1 + wh2 h2 + wh3 h3)

Total 9 trainable weights (6 in the hidden layer + 3 in the output layer). Training is backpropagation with momentum, identical in form to MLP I.

Why the Extra Neuron Helps

XOR is not linearly separable, so the hidden layer must construct a useful intermediate representation before the output neuron can draw a decision boundary. With 2 hidden neurons there's exactly one way the network can do this; with 3, the error surface has multiple equally-good basins and many fewer flat plateaus.

Architecture

Hidden weights

Total weights

Typical XOR convergence

2-2-1 (MLP I)

4

6

1000s of iterations or stall

2-3-1 (MLP II)

6

9

100–500 iterations, almost always succeeds

2-4-1 (MLP III)

8

12

Faster still; over-parameterized for XOR

Capacity vs trainability: in general, over-parameterizing slightly above the minimum needed to express the target function makes optimization far easier. Modern deep networks exploit this routinely — they're typically wildly over-parameterized relative to the function being learned.

Simulation

The interactive simulator is below. Pick a gate (XOR is the interesting one), hit Train, and watch the iteration counter. You should rarely need Randomize at this architecture.

Parameters

Parameter

Range / default

Effect

Activation function

Sigmoid / Tanh / Step

Sigmoid and Tanh are trainable. Step has zero derivative, so cannot be trained by backprop.

Gate preset

XOR / AND / OR / NAND / NOR

Target truth table. XOR is the headline case.

Network weights

9 sliders, range [−2, +2]

6 input→hidden weights + 3 hidden→output weights, manually editable.

Inputs x1, x2

0 or 1

Drives a single forward pass.

Learning rate η

default 0.5

Weight-update step size. Stable at 0.5 with momentum.

Momentum μ

0–0.99, default 0.9

Velocity carry-over. Less critical than at 2-2-1 but still useful.

Train update speed

seconds, default 0.1

Delay between training iterations — visualization speed only.

Buttons

Button

Effect

Reset

Stops training, randomizes weights and biases, clears inputs and velocity.

Randomize

Stops training, randomizes weights and inputs. Rarely needed at 2-3-1.

Train

Runs backpropagation with momentum. Stops automatically when all decisions are correct.

Test

Cycles through all 4 input combinations with the current weights and reports PASS / FAIL.

Tips on Implementation

  • Capacity helps more than tricks. Going from 2 to 3 hidden neurons gives a flatter, more forgiving error surface — far more impact than tuning learning rate or activation.
  • Momentum still earns its keep. Even at 2-3-1, μ = 0.9 converges noticeably faster than μ = 0 by pushing through any remaining flat regions.
  • Wide init [−2, +2] keeps neurons away from saturated regions of the sigmoid where gradients vanish.
  • Online updates (one example at a time) give better visual feedback than batch updates.
  • Convergence indicators: error below 0.1 and all 4 decisions correct. Failure is rare at 2-3-1; if it happens, hit Randomize.

Limitations

  • Fixed 2-3-1 architecture; see MLP I (2-2-1) and MLP III (2-4-1) for comparisons.
  • Truth-table problems only (4 input combinations). No continuous data, no real-world classification.
  • Online stochastic updates only, no mini-batch.
  • Plain backpropagation with momentum — no Adam, RMSprop, or other adaptive optimizers.