Web Simulator | ShareTechnote

Web Simulation

CNN + MLP Pattern Recognizer - 8×8 Input with Convolution & Pooling

This note provides an interactive, visual simulation of a Convolutional Neural Network (CNN) combined with a Multi-Layer Perceptron (MLP) for pattern recognition. The architecture processes 8×8 pixel images through a CNN preprocessing stage (Convolution + Max Pooling) that extracts 3×3 features, which are then fed into a 9-3-1 MLP (9 feature inputs, 3 hidden neurons with ReLU activation, 1 output with sigmoid for binary classification). This demonstrates how CNNs extract meaningful features from raw images before classification.

The task is to detect whether an 8×8 pixel pattern contains a vertical line or a horizontal line. The simulation includes a clickable 8×8 grid where you can toggle pixels on/off to create patterns. The raw 8×8 input is first processed through a 3×3 convolution kernel (with ReLU activation) to produce a 6×6 feature map, then through 2×2 max pooling to produce a 3×3 feature map. These 9 features are then fed into the MLP, which outputs a probability that the pattern is a horizontal line. The network diagram shows the complete pipeline: Raw 8×8 Input → Convolution Block → Pooling Block → MLP (9-3-1) → Output, with every connection visible.

The training uses backpropagation with momentum on a dataset of 8×8 vertical and horizontal line patterns (with noise). As the network trains, you can observe how the CNN preprocessing extracts structural features from the raw input, and how the MLP's hidden neurons learn to become feature detectors - for example, one neuron might learn to detect horizontal features, while another might learn to detect vertical features. This demonstrates the core concept of how Convolutional Neural Networks work: the convolution and pooling layers automatically extract meaningful features (edges, lines, patterns) from raw pixels, which are then classified by the MLP.

Glass-box architecture: the CNN preprocessing (8×8 → 6×6 → 3×3) reduces dimensionality while preserving structural information, and the 9-3-1 MLP with ~28 parameters is small enough to visualize every connection yet complex enough to learn meaningful feature combinations. Refer to this note for theoretical details.

Sections

Mathematical Foundation
Simulation
Usage Example
Parameters
Buttons and Controls
Key Concepts and Implementation
Limitations

Mathematical Foundation

The image flows through four stages, shrinking in size while distilling structure:

Stage	Operation	Output size
Input	Raw pixels	8×8 = 64
Convolution	3×3 kernel + ReLU	6×6 = 36
Max pooling	2×2, stride 2	3×3 = 9
MLP	9 → 3 (ReLU) → 1 (sigmoid)	1 probability

Convolution slides a learnable kernel over the image, computing a dot product at each location to build a feature map:

y(i,j) = ReLU( Σ_m,n K(m,n) · x(i+m, j+n) + b )

Max pooling then keeps only the strongest response in each 2×2 block, giving translation tolerance and shrinking the map:

p(i,j) = max( y(2i, 2j), y(2i+1, 2j), y(2i, 2j+1), y(2i+1, 2j+1) )

The 9 pooled features feed the MLP, trained by backpropagation to minimize binary cross-entropy between the predicted probability x̂ and the true label y:

L = −[ y·log x̂ + (1−y)·log(1−x̂) ]

Why convolution + pooling first? A raw 8×8 image has 64 inputs; feeding it straight to an MLP wastes parameters and ignores spatial structure. Convolution detects local patterns (edges/lines) anywhere in the image with one shared kernel, and pooling makes that detection position-tolerant — so the MLP only has to reason about 9 structural features, not 64 raw pixels.

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Usage Example

Follow these steps to explore how the network learns to detect vertical vs horizontal lines:

Initial State: When you first load the simulation, the network has random weights. The input grid is empty (all pixels off). Click on pixels in the 8×8 grid to create a pattern - try drawing a vertical line (all pixels in one column) or a horizontal line (all pixels in one row).
Test Before Training: Click pixels to create a pattern and observe the prediction. With random weights, the network will likely make incorrect predictions. The network diagram shows the complete pipeline: a mini 8×8 grid representation, Convolution and Pooling blocks, and the MLP with all connections. Notice how the line thickness and colors represent weight strength and sign. The MLP input nodes (labeled F0-F8) are colored based on the feature values from the pooling layer - darker nodes indicate higher feature activation.
Start Training: Click the "Train" button. Watch as the input grid cycles through training patterns (vertical and horizontal lines). You'll see:
- The 8×8 input grid updating to show each training example
- The mini 8×8 representation in the network diagram updating
- MLP input nodes (F0-F8) changing color based on extracted features
- Hidden neurons lighting up based on their activation
- The network diagram updating in real-time
- Accuracy and loss improving over epochs
Adjust Training Delay: Use the "Training Delay" slider to control how fast training progresses. Lower values (0.1x-0.5x) make training faster but harder to observe. Higher values (2x-10x) slow down training so you can carefully watch each pattern being processed. You can adjust this even while training is running.
Observe Feature Learning: After a few epochs, stop training and examine the network diagram. You should notice:
- How the CNN preprocessing (Convolution + Pooling) extracts structural features from the 8×8 input
- The MLP input nodes (F0-F8) showing different activation levels for different patterns
- One or more hidden neurons with thick blue connections to specific feature inputs (horizontal line detector)
- Other neurons with connections to different feature inputs (vertical line detector)
- The output neuron combining these features to make the final decision
Test After Training: Clear the input grid and draw your own patterns. Try drawing:
- A vertical line (all pixels in any column 0-7) - should predict "Vertical"
- A horizontal line (all pixels in any row 0-7) - should predict "Horizontal"
- Mixed patterns - see how the network handles ambiguous cases
- Observe how the feature nodes (F0-F8) respond differently to different patterns
Experiment with Parameters: Try adjusting the learning rate and momentum during training. Higher learning rates learn faster but may overshoot. Momentum helps the network escape flat regions. Observe how these parameters affect convergence speed and stability.
Reset and Retry: If training doesn't converge well, click "Reset" to reinitialize weights and try again. Different random initializations can lead to different learned features, demonstrating the importance of weight initialization in neural networks.

Tip: The key insight to look for is how the CNN preprocessing (Convolution + Max Pooling) extracts structural features from the 8×8 input, and how the MLP’s hidden neurons become specialized feature detectors. After training, you should be able to identify which hidden neuron is looking for horizontal patterns and which is looking for vertical patterns by examining the connection weights in the network diagram. Notice how the feature nodes (F0-F8) respond differently to different input patterns.

Parameters

Followings are short descriptions on each parameters

Input Grid (8×8): A clickable 8×8 grid where you can toggle pixels on/off by clicking. Green pixels are "on" (value 1), gray pixels are "off" (value 0). Click any pixel to toggle it. Create vertical lines (all pixels in one column) or horizontal lines (all pixels in one row) to test the network. The grid automatically updates the prediction as you draw.
CNN Preprocessing: The 8×8 input is processed through two stages before reaching the MLP:
- Convolution (3×3 kernel): Applies a fixed 3×3 convolution kernel with ReLU activation to the 8×8 input, producing a 6×6 feature map. The kernel highlights structural differences and edge patterns.
- Max Pooling (2×2): Applies 2×2 max pooling to the 6×6 feature map, producing a 3×3 feature map (9 features). This reduces dimensionality while preserving important structural information.
The 9 features from pooling are then fed into the MLP.
MLP Architecture: The MLP receives 9 feature inputs (from the 3×3 pooled features), has 3 hidden neurons with ReLU activation, and 1 output neuron with sigmoid activation. Total MLP parameters: (9×3 + 3) + (3×1 + 1) = 28 weights and biases. This is small enough to visualize every connection clearly.
Network Visualization: The top panel shows the complete CNN+MLP pipeline: a mini 8×8 raw input representation, Convolution block, Pooling block, and the MLP with all connections. The MLP input nodes are labeled F0-F8 and are colored based on feature intensity (darker = higher activation). Line thickness represents weight magnitude, color represents sign (blue=positive, red=negative). Hidden neurons "light up" based on their activation level (darker blue = higher activation). The output node shows the probability of a horizontal line.
Learning Rate: Controls the step size for weight updates during backpropagation. Higher values learn faster but may overshoot or oscillate. Lower values are more stable but slower. Default is 0.1. Can be adjusted during training.
Momentum: Controls the momentum factor (0 to 0.99) used in momentum-based gradient descent. Higher values (closer to 0.99) maintain more velocity, helping the network escape local minima and flat regions. Default is 0.9.
Training Delay: Controls the speed of training visualization. Lower values (0.1x-0.5x) make training faster but harder to observe. Higher values (2x-10x) slow down training so you can carefully watch each pattern being processed. Default is 1.0x (realtime). Can be adjusted during training.
Training Data: The network is trained on a dataset of 8×8 vertical and horizontal line patterns. Each pattern type has 8 clean examples (one for each column 0-7 for vertical, one for each row 0-7 for horizontal) plus 2 noisy variations per clean example (with 2 random pixels flipped). Total: 16 clean + 32 noisy = 48 training samples.

Buttons and Controls

Followings are short descriptions on each control

Clear: Clears the 8×8 input grid, turning all pixels off. The prediction resets to "Draw..." and the network diagram updates.
Predict: (Auto-updates) The prediction automatically updates whenever you click a pixel in the input grid. Processes the current 8×8 pattern through the CNN preprocessing (Convolution → Max Pooling) and then through the MLP, displaying the prediction (Vertical or Horizontal line) along with confidence. The probability gauge shows the probability of a horizontal line (0% = vertical, 100% = horizontal). The network diagram updates in real-time to show: the mini 8×8 input representation, feature node activations (F0-F8), hidden neuron activations, and the output probability.
Train: Starts training the network on the dataset. Training uses batch gradient descent with momentum. The network processes batches of 8 samples, accumulates gradients, and updates weights. Training continues until you click "Stop". The epoch counter, accuracy, and loss are displayed in real-time. Watch the network diagram to see: the input grid cycling through training patterns, feature nodes (F0-F8) changing color based on extracted features, weights changing (line thickness and colors), and how neurons learn to detect features.
Stop: Stops the training process. The network retains the weights learned so far. The "Train" button reappears, allowing you to resume training.
Reset: Reinitializes all MLP network weights and biases with random values (He initialization for ReLU). Also resets momentum velocities and the epoch counter. The CNN preprocessing (convolution kernel) remains fixed. Useful for starting fresh training with the same CNN preprocessing.

Key Concepts and Implementation

This simulation demonstrates CNN preprocessing combined with MLP feature learning in a "glass box" architecture where every connection is visible. Here are the key concepts:

CNN Architecture: 8×8 → Conv → Pool → MLP(9-3-1): The complete pipeline processes 8×8 raw input through CNN preprocessing (Convolution 3×3 → Max Pooling 2×2) to extract 9 features, which are then fed into an MLP with 9 inputs, 3 hidden neurons with ReLU activation, and 1 output with sigmoid. Total MLP parameters: 28 - small enough to visualize every single weight, yet complex enough to learn meaningful feature combinations. The CNN preprocessing demonstrates how convolutional layers extract structural features before classification.
CNN Preprocessing: The fixed 3×3 convolution kernel highlights structural differences and edge patterns in the 8×8 input. After convolution with ReLU activation, the 6×6 feature map is downsampled through 2×2 max pooling to produce 9 features. This preprocessing reduces the input dimensionality from 64 pixels to 9 features while preserving important structural information. The convolution and pooling operations are the fundamental building blocks of CNNs.
Feature Extraction: The CNN preprocessing automatically extracts meaningful features from raw pixels. When you draw a vertical line, the convolution and pooling layers produce a specific pattern of feature activations (visible in nodes F0-F8). When you draw a horizontal line, a different pattern emerges. The MLP then learns to recognize these feature patterns rather than processing raw pixels directly.
MLP Feature Detection: As the MLP trains, hidden neurons learn to become feature detectors for the extracted features. For example, one neuron might develop strong positive weights to features that activate for horizontal patterns, making it a "horizontal feature detector." Another neuron might learn to detect vertical features. This demonstrates how CNNs decompose the problem into hierarchical features: CNN extracts low-level features (edges, lines), MLP learns high-level feature combinations.
Weight Visualization: Every MLP connection is drawn in the network diagram. Line thickness represents weight magnitude, color represents sign (blue=positive, red=negative). The feature nodes (F0-F8) are colored based on their activation intensity (darker = higher). As training progresses, you can literally see which features each hidden neuron is "looking at" by observing which connections become thick and blue. This makes the learning process transparent and understandable.
Neuron Activation: Hidden neurons "light up" based on their activation level. When you draw a horizontal line, you'll see the feature nodes (F0-F8) show a specific activation pattern, and neurons connected to those features activate. When you draw a vertical line, different features activate and different neurons respond. This visual feedback shows how the network decomposes the problem into features at multiple levels.
ReLU Activation: Both the convolution layer and the MLP hidden layer use ReLU (Rectified Linear Unit) activation, which is standard for image recognition. ReLU helps with gradient flow and allows neurons to learn sparse, specialized features. MLP weights are initialized using He initialization (sqrt(2/fan_in)) which is optimal for ReLU.
Sigmoid Output: The output layer uses sigmoid activation to produce a probability (0 to 1) that the pattern is a horizontal line. Values close to 0 indicate vertical lines, values close to 1 indicate horizontal lines.
Binary Cross-Entropy Loss: The network is trained using binary cross-entropy loss, which measures how well the predicted probability matches the true label (0 for vertical, 1 for horizontal).
What to Look For: After training, observe the network diagram. You should see: (1) How different 8×8 patterns produce different feature activations (F0-F8), (2) One or more hidden neurons with thick blue lines connecting to specific feature inputs (horizontal detector), (3) Other neurons connecting to different features (vertical detector), (4) The output neuron combining these features to make the final decision. This is the moment "math" becomes "machine vision" - the CNN extracts features automatically, and the MLP learns to recognize feature patterns.

Limitations

Fixed convolution kernel: in this demo the 3×3 kernel is preset (only the MLP weights train). Real CNNs learn their kernels — often dozens or hundreds per layer.
Toy scale, two classes: 8×8 images and a horizontal-vs-vertical task with ~28 parameters. Production vision networks use millions of parameters and many stacked conv/pool layers.
Single channel, single filter: one grayscale input and one feature map; no color channels, multiple filters, padding, or stride options.
Small noisy dataset: trained on a handful of line patterns, so it illustrates feature learning rather than achieving robust generalization.