Web Simulation 

 

 

 

 

MLP(Multi Layer Perceptron) I 

This note provides an interactive, visual simulation of a Multi-Layer Perceptron (MLP) with a 2-2-1 architecture (2 inputs, 2 hidden neurons, 1 output). It demonstrates how a hidden layer enables the network to solve non-linearly separable problems like XOR, which a single perceptron cannot solve.

The simulation shows the forward pass through all layers, displaying the activation values at each hidden neuron and the final output. It visualizes the network connections with color-coded weights (green for positive, red for negative) and thickness proportional to weight magnitude.

The training uses backpropagation with momentum to update weights in both the hidden and output layers. Momentum helps the network escape local minima and flat regions, which is especially important for solving XOR with the minimal 2-2-1 architecture. Weights are initialized in a wider range [-2.0, 2.0] to break symmetry and provide better starting points. You can watch the network learn to solve XOR and other logic gates by observing the iteration count, error decrease, and weight updates in real-time.

NOTE : Refer to this note for the theoretical details.

 

Parameters

Followings are short descriptions on each parameters
  • Activation Function: Selects the transfer function used in all neurons (hidden and output layers). Sigmoid and Tanh work well for XOR training. Step function cannot be trained with backpropagation (derivative is zero).
  • Gate Preset: Selects a logic gate problem to solve. XOR is the default to showcase the MLP's ability to solve non-linearly separable problems. Other gates (AND, OR, NAND, NOR) are also available.
  • Network Weights: All 6 weights in the network can be manually adjusted using sliders:
    • Hidden Layer 1: w11 (input1→hidden1), w21 (input2→hidden1)
    • Hidden Layer 2: w12 (input1→hidden2), w22 (input2→hidden2)
    • Output Layer: wh1 (hidden1→output), wh2 (hidden2→output)
    Weights are initialized in the range [-2.0, 2.0] to help break symmetry and avoid flat regions during training.
  • Input x1, x2: Binary input values (0 or 1) controlled by checkboxes. These represent the two inputs to the network.
  • Learning Rate: Controls the step size for weight updates during backpropagation. Higher values learn faster but may overshoot or oscillate. Lower values are more stable but slower. Default is 0.5. Can be adjusted during training.
  • Momentum: Controls the momentum factor (0 to 0.99) used in momentum-based gradient descent. Higher values (closer to 0.99) maintain more velocity, helping the network escape local minima and flat regions. Lower values (closer to 0) behave more like standard gradient descent. Default is 0.9. Can be adjusted during training. Momentum adds "velocity" to weight updates, allowing the network to continue moving in a direction even when gradients are small. This is especially important for solving XOR with a minimal 2-2-1 architecture.
  • Train Update Speed (sec): Controls the delay between training steps. Smaller values update more frequently (faster visualization). Default is 0.1 seconds.

Buttons

Followings are short descriptions on each Button
  • Reset: Stops training (if running) and randomizes all weights and biases to new random values in the range [-2.0, 2.0]. Also resets inputs to [0, 0] and clears velocity (momentum) history.
  • Randomize: Stops training (if running) and assigns random values to all weights and biases (range [-2.0, 2.0]), randomizes inputs to random binary values, and resets velocity history. Useful for trying different starting points when training gets stuck.
  • Train: Starts backpropagation training with momentum using the selected gate's truth table. The network cycles through all input combinations one at a time, updating weights after each example using momentum-based gradient descent. The iteration counter shows the number of training steps completed. Training stops automatically when all decisions are correct (all 4 input combinations produce correct outputs), or can be stopped manually by clicking the button again.
  • Test: Stops training (if running) and tests all possible input combinations ([0,0], [0,1], [1,0], [1,1]) with the current weights. Visually cycles through each combination with a delay. Shows a "Test PASS" popup if all combinations result in correct decisions (Decision = 1), or "Test FAIL" otherwise.

Tips on Implementation

At the initial trial, training for XOR often got stuck in the middle without making any further progress. The network would reach an error around 0.4-0.5 after hundreds or thousands of iterations and then plateau, unable to converge to a solution. And the next step was to find why we have such a proglem and how to resolve the issues. Followings are what we have learned and how the problem was fixed.

  • XOR Problem Difficulty: The XOR problem is non-linearly separable and requires a hidden layer. A 2-2-1 architecture (2 hidden neurons) is the theoretical minimum to solve XOR, but it's notoriously difficult for standard gradient descent due to flat regions in the error surface where gradients become tiny.
  • Momentum is Essential: Without momentum, the network often gets stuck at error values around 0.4-0.5 after many iterations. Momentum (0.9) helps the network maintain velocity through flat regions and escape local minima, making convergence much more reliable.
  • Wider Initialization Range: Initializing weights in the range [-2.0, 2.0] instead of [-0.5, 0.5] helps break symmetry and places neurons in more active regions of the sigmoid curve. This prevents neurons from starting in saturated regions where gradients vanish.
  • Online vs Batch Learning: The current implementation uses online/stochastic learning (one example at a time). This can be less stable than batch learning (process all examples before updating), but it provides better visual feedback during training.
  • Learning Rate Sensitivity: The default learning rate of 0.5 works well with momentum. Without momentum, lower rates (0.1-0.2) are often needed, but training becomes slower. With momentum, higher rates are more stable.
  • Activation Function Choice: Sigmoid and Tanh work well for XOR. Step function cannot be used for training (derivative is zero), but the output is thresholded at 0.5 for binary classification anyway.
  • Velocity Reset: Velocities (momentum history) are reset when starting new training or randomizing weights. This prevents momentum from carrying over inappropriate velocity from previous training sessions.
  • Convergence Indicators: Watch for the error dropping below 0.1 and all decisions becoming correct (Decision = 1 for all 4 input combinations). If training stalls, try randomizing weights to start from a different point in the weight space.

NOTE : Even with the application of momentum and wider initialization, the training for XOR would not always be successful due to the challenging nature of the problem. The 2-2-1 architecture represents the theoretical minimum to solve XOR, which means the error surface has many flat regions, saddle points, and local minima. Some initial weight configurations may still lead to convergence failure, especially if the network starts in an unfavorable region of the weight space. However, these improvements (momentum and wider initialization) significantly increase the success rate compared to standard gradient descent without momentum. If training fails to converge after many iterations, try clicking "Randomize" to restart with different initial weights, as different starting points can lead to different convergence outcomes. The combination of momentum and wider initialization makes the network much more likely to find a solution, but due to the inherent difficulty of the minimal architecture, 100% success rate cannot be guaranteed.