This interactive tutorial provides a step-by-step visualization of how Convolution and Max Pooling operations work in Convolutional Neural Networks (CNNs). Watch as an 8×8 input image is processed through a 3×3 convolution kernel to produce a 6×6 feature map, then through 2×2 max pooling to produce a 3×3 output. This is a standalone visualization tool focused entirely on the mechanics of Convolution and Pooling—no training logic, just pure matrix operations with animation steps.
The tutorial shows each operation in detail: you can see exactly which input pixels contribute to each convolution output, how the kernel slides across the input, how ReLU activation is applied, and how max pooling selects the maximum value from each 2×2 window. Use the step controls to move forward or backward through the process, or click "Run" to watch the complete animation. You can also draw your own patterns or select from predefined patterns (vertical lines, horizontal lines, cross, box, random).
NOTE : This is a pure visualization tool for understanding CNN operations. The convolution kernel is fixed (a simple edge/feature detector), and all operations are shown step-by-step with mathematical explanations. Refer to this note for theoretical details on CNNs.
200ms
Input (8×8)
➜
Convolution
Kernel (3×3)
➜
Conv Output (6×6)
➜
Max Pool
2×2, Stride 2
➜
Pooled (3×3)
Current Operation
Click 'Step Fwd' or 'Run' to start...
Usage Example
Follow these steps to explore how Convolution and Max Pooling operations work:
Initial State: When you first load the tutorial, a vertical line pattern is loaded by default. The 8×8 input grid shows the pattern, and the convolution and pooling outputs are empty. You can click on pixels in the 8×8 grid to create your own pattern, or select a predefined pattern from the dropdown menu.
Step Through Operations: Click "Step Fwd" to move forward one step at a time. You'll see:
Yellow highlights on the input grid showing which 3×3 region is being convolved
Green highlight on the convolution output cell being calculated
The mathematical formula in the bottom panel showing the convolution calculation
Values appearing in the 6×6 convolution output grid
Watch Convolution: As you step through the convolution phase (36 steps), observe:
How the 3×3 kernel slides across the 8×8 input (left to right, top to bottom)
The element-wise multiplication: each input pixel is multiplied by the corresponding kernel value
The sum of all products, which becomes the convolution output
ReLU activation: negative values become 0
Watch Max Pooling: After convolution completes, the pooling phase begins (9 steps). Observe:
Yellow highlights on the 6×6 convolution output showing which 2×2 window is being pooled
Green highlight on the 3×3 pooled output cell being calculated
The mathematical formula showing the max operation
How the maximum value from each 2×2 window is selected
Run Animation: Click "Run" to watch the complete process automatically. Adjust the speed slider to control animation speed (50ms = fast, 1000ms = slow). Click "Pause" to stop at any point.
Try Different Patterns: Select different patterns from the dropdown:
Vertical Line: See how vertical structures are detected
Horizontal Line: See how horizontal structures are detected
Cross: See how both vertical and horizontal features combine
Square Box: Hollow box outline - see how edge detection works on enclosed regions
Letter 'A': Classic pixel art letter A - demonstrates complex shape recognition
Letter 'X': Diagonal cross pattern - shows diagonal edge detection
Letter 'O': Circular/oval outline - demonstrates curved edge detection
Each kernel produces different results on the same input pattern, demonstrating how different kernels extract different features.
Adjust Stride Values: Use the Conv Stride and Pool Stride dropdowns to change how the operations move:
Conv Stride: Controls how far the kernel moves between convolution operations (1, 2, or 3). Higher stride = smaller output grid.
Pool Stride: Controls how far the pooling window moves (1 or 2). Higher stride = smaller output grid.
Changing stride values automatically recalculates the output grid dimensions. Notice how stride affects the size of the final feature map.
Observe Gradient Colors: The convolution and pooling output grids use gradient color coding:
Light blue = low values (weak activations)
Dark blue = high values (strong activations)
The gradient makes it easy to see which regions have the strongest feature responses
Step Backward: Use "Step Back" to go backward through the operations. This is useful for reviewing specific steps or understanding how a particular output value was calculated.
Reset: Click "Reset" to clear all outputs and return to the initial state. The input pattern remains, but all convolution and pooling calculations are cleared.
Tip: Pay attention to the mathematical formulas in the bottom panel. They show exactly how each output value is calculated. Notice how the convolution kernel emphasizes certain patterns (edges, corners) and how max pooling reduces dimensionality while preserving the strongest activations.
Parameters
Followings are short descriptions on each parameters
Input Grid (8×8): A clickable 8×8 grid where you can toggle pixels on/off by clicking. Dark pixels are "on" (value 1), white pixels are "off" (value 0). Click any pixel to toggle it. The grid is locked during animation (when currentStep ≥ 0). You can also select predefined patterns from the dropdown menu.
Convolution Kernel (3×3): A selectable 3×3 kernel displayed between the input and convolution output. You can choose from several predefined kernels:
Edge Detector (default):
[0 1 0] [1 2 1] [0 1 0]
- Detects edges and local patterns
Blur:
[1 1 1] [1 1 1] [1 1 1]
- Smoothing/averaging kernel
Sharpen:
[0 -1 0] [-1 5 -1] [0 -1 0]
- Enhances edges
Vertical Edge:
[-1 0 1] [-1 0 1] [-1 0 1]
- Detects vertical edges
Horizontal Edge:
[-1 -1 -1] [0 0 0] [1 1 1]
- Detects horizontal edges
The kernel slides across the input, performing element-wise multiplication and summing the results. Changing the kernel changes what features are detected.
Convolution Stride: Controls how far the kernel moves between convolution operations. Options: 1 (default), 2, or 3. With stride 1, the kernel moves one pixel at a time. With stride 2, it moves two pixels, producing a smaller output. The output dimension is calculated as: floor((8 - 3)/stride) + 1. Changing stride automatically updates the output grid size.
Convolution Output: The result of applying the 3×3 convolution kernel to the 8×8 input. The output size depends on the stride (e.g., 6×6 with stride 1, 3×3 with stride 2, 2×2 with stride 3). Each output cell is calculated by:
Placing the kernel over a 3×3 region of the input
Multiplying each input pixel by the corresponding kernel value
Summing all products
Applying ReLU activation: max(0, sum) - negative values become 0
The kernel slides from top-left to bottom-right according to the stride setting. Output cells are color-coded with a blue gradient: light blue for low values, dark blue for high values, making it easy to see activation strength.
Pooling Stride: Controls how far the pooling window moves between pooling operations. Options: 1 or 2 (default). With stride 2, windows don't overlap. With stride 1, windows overlap by one pixel. The output dimension is calculated as: floor((convDim - 2)/stride) + 1. Changing stride automatically updates the output grid size.
Max Pooling (2×2): The convolution output is downsampled using 2×2 max pooling. This means:
Divide the 6×6 grid into non-overlapping 2×2 windows
For each window, select the maximum value
Place that maximum value in the corresponding position in the 3×3 output
This reduces the convolution output grid size while preserving the strongest activations. The exact reduction depends on the convolution output size and pooling stride. Output cells are color-coded with the same blue gradient as the convolution output, making it easy to compare activation strengths.
Animation Speed: Controls how fast the step-by-step animation runs when you click "Run". Range: 50ms (very fast) to 1000ms (slow). Default: 200ms. You can adjust this even while the animation is running.
Mathematical Display: The bottom panel shows the mathematical formula for the current operation. During convolution, it shows the element-wise products and their sum. During pooling, it shows the values in the 2×2 window and the selected maximum.
Buttons and Controls
Followings are short descriptions on each control
Load Pattern: Dropdown menu to select predefined patterns. Options include: Vertical Line, Horizontal Line, Cross, Square Box (hollow), Letter 'A', Letter 'X', Letter 'O', Triangle (hollow), Diamond (hollow), Plus Sign, Arrow, Checkmark, Random Noise, or Clear (blank grid). All patterns are designed using pixel art principles for clear recognition on the 8×8 grid. Selecting a pattern automatically resets the simulation and loads the new pattern.
Kernel: Dropdown menu to select the convolution kernel. Options: Edge Detector (default), Blur, Sharpen, Vertical Edge, or Horizontal Edge. Each kernel detects different types of features. Changing the kernel automatically updates the kernel display and resets the simulation.
Conv Stride: Dropdown to select convolution stride (1, 2, or 3). Controls how far the kernel moves between operations. Changing stride recalculates output dimensions and resets the simulation.
Pool Stride: Dropdown to select pooling stride (1 or 2). Controls how far the pooling window moves. Changing stride recalculates output dimensions and resets the simulation.
Step Back (❮): Moves backward one step in the animation. Useful for reviewing previous operations or understanding how a value was calculated. If at the beginning (step -1), this button has no effect.
Run (▶): Starts automatic animation of all operations. The animation will step through all 36 convolution steps, then all 9 pooling steps. The animation automatically stops at the end. Click "Pause" to stop manually.
Pause (❚❚): Stops the automatic animation. The current step is preserved, and you can continue with "Step Fwd" or "Step Back" manually, or click "Run" again to resume automatic animation.
Step Fwd (❯): Moves forward one step in the animation. Each step shows one operation: either one convolution calculation or one pooling calculation. The highlights and mathematical formulas update to show the current operation.
Reset: Clears all convolution and pooling outputs, resets the current step to -1 (idle state), and stops any running animation. The input pattern remains unchanged. Useful for starting over with the same input pattern.
Speed Slider: Controls the animation speed when using "Run". Range: 50ms (very fast) to 1000ms (slow). The value represents milliseconds between steps. You can adjust this even while animation is running.
Key Concepts and Implementation
This tutorial demonstrates the fundamental operations of Convolutional Neural Networks: Convolution and Max Pooling. Here are the key concepts:
Convolution Operation: Convolution is a mathematical operation that slides a small kernel (filter) across an input image, computing the dot product at each position. In this tutorial:
The 3×3 kernel slides across the 8×8 input
At each position, the kernel is element-wise multiplied with the overlapping 3×3 region
All products are summed to produce one output value
ReLU activation is applied: negative values become 0
This produces a 6×6 output (8-3+1 = 6 in each dimension)
Convolution extracts local features like edges, corners, and patterns. The kernel acts as a feature detector.
Kernel (Filter): The kernel is a small matrix of weights that defines what features to detect. In this tutorial, you can select from several predefined kernels:
Edge Detector: Emphasizes the center pixel and neighbors, detecting local patterns and edges
Blur: Averages neighboring pixels, creating a smoothing effect
Sharpen: Enhances edges by subtracting neighboring values from a weighted center
Vertical Edge: Specifically detects vertical edges by comparing left and right neighbors
Horizontal Edge: Specifically detects horizontal edges by comparing top and bottom neighbors
Each kernel produces different feature maps from the same input, demonstrating how different kernels extract different information. In real CNNs, kernels are learned during training, but here we use fixed kernels for demonstration.
Stride: Stride controls the step size of the sliding window. For convolution, stride determines how far the kernel moves between operations. For pooling, stride determines how far the pooling window moves. Higher stride values produce smaller output grids but reduce computation. The relationship is: output_size = floor((input_size - window_size)/stride) + 1. This tutorial allows you to experiment with different stride values to see how they affect the output dimensions.
Gradient Color Coding: The convolution and pooling output grids use a blue gradient to visualize activation strength:
Light blue (#f0f8ff) = low values (0-1) - weak activations
Medium blue (#50a8ff) = medium values (5-6) - moderate activations
Dark blue (#0044cc) = high values (10+) - strong activations
This visual encoding makes it immediately clear which regions of the feature map have the strongest responses to the input pattern.
ReLU Activation: After convolution, ReLU (Rectified Linear Unit) is applied: ReLU(x) = max(0, x). This means:
Positive values pass through unchanged
Negative values become 0
ReLU introduces non-linearity, which is essential for neural networks to learn complex patterns. It also helps with gradient flow during training.
Max Pooling: Pooling is a downsampling operation that reduces the spatial dimensions of the feature map. Max pooling:
Divides the input into non-overlapping windows (2×2 in this case)
Selects the maximum value from each window
Places that maximum in the output
Reduces 6×6 to 3×3 (stride 2 means windows don't overlap)
Max pooling reduces computation, provides translation invariance (the feature can appear anywhere in the window), and preserves the strongest activations.
Step-by-Step Visualization: The tutorial shows each operation individually:
Yellow highlights (semi-transparent) show the input region being processed (3×3 for convolution, 2×2 for pooling). The transparency allows you to see the underlying pattern.
Green highlights show the output cell being calculated
Mathematical formulas show the exact calculation being performed, including all 9 multiplication terms for convolution
Values appear in the output grids as they are calculated, with gradient color coding to show activation strength
Dynamic grid sizes adjust automatically when you change stride values
This step-by-step approach makes it easy to understand how each output value is computed.
Why These Operations Matter: Convolution and pooling are the building blocks of CNNs:
Convolution extracts local features (edges, textures, patterns) from raw pixels
Pooling reduces dimensionality, making the network more efficient and providing translation invariance
Together, they transform raw pixels into meaningful features that can be classified
In real CNNs, multiple layers of convolution and pooling create hierarchical features (simple edges → complex shapes → objects)
What to Observe: As you step through the operations, notice:
How the convolution kernel emphasizes certain patterns in the input
How different input patterns produce different convolution outputs
How max pooling preserves the strongest activations while reducing size
How the final 3×3 output captures essential features from the original 8×8 input
This demonstrates how CNNs automatically extract meaningful features from images without explicit programming.
NOTE : This tutorial focuses exclusively on the mechanics of Convolution and Max Pooling operations in CNNs. It is a pure visualization tool with no training logic—just step-by-step matrix operations with detailed mathematical explanations. The convolution kernel is fixed (not learned), and all operations are shown transparently. This makes it perfect for understanding the fundamental building blocks of CNNs before moving on to full networks with learnable parameters. In real CNNs, multiple convolution and pooling layers are stacked, and the kernels are learned during training to detect task-specific features. However, the core operations (convolution and pooling) remain the same. This tutorial demonstrates how these operations transform raw pixels into meaningful features that can be used for classification or other tasks.