Web Simulator | ShareTechnote

Web Simulation

Clustering Algorithms Tutorial

This interactive tutorial demonstrates Clustering Algorithms, a fundamental family of unsupervised machine learning techniques used to partition data into meaningful groups. You'll explore and compare 7 different clustering algorithms side by side, observing how they behave differently on the same dataset.

The simulation visualizes K-Means, K-Medoids, Mini-batch K-Means, K-Modes, K-Prototypes, DBSCAN, and Mean Shift—each designed for different data types and use cases. Watch how centroids move, clusters form, and algorithms converge!

Sections

The Algorithms
The Clustering Problem
Simulation
Usage Guide
Algorithm Details
Distance Metrics
Visual Elements
Statistics Explained
Algorithm Comparison
K-Required vs Auto-K Algorithms
Limitations and Considerations
Real-World Applications
Clustering in the Deep Learning Era

The Algorithms

Each algorithm uses a different approach to clustering:

K-Means: Classic algorithm using Euclidean distance with mean-based centroids
K-Medoids: Uses actual data points as cluster centers (more robust to outliers)
Mini-batch K-Means: Faster variant using random samples per iteration
K-Modes: Designed for categorical data using Hamming distance
K-Prototypes: Handles mixed numerical and categorical data
DBSCAN: Density-based algorithm that discovers clusters and detects outliers (no K required)
Mean Shift: Non-parametric algorithm that finds clusters by seeking density peaks (no K required)

The Clustering Problem

Given a set of observations, partition them into K clusters such that points within the same cluster are more similar to each other than to points in other clusters:

Minimize: J = Σᵢ₌₁ᴷ Σₓ∈Cᵢ d(x, μᵢ)²

where:

K is the number of clusters
Cᵢ is the set of points in cluster i
μᵢ is the centroid (center) of cluster i
d(x, μᵢ) is the distance from point x to centroid μᵢ

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Clustering Controls

Algorithm: Data: Points:

K (Clusters):

γ (Cat. Weight): 0.50

ε (Radius): 40 MinPts: 4

Bandwidth: 60

Controls:

Show Centroids: Show Connections: Delay (ms): 100ms

K-Means Clustering

Partitions data into K clusters by minimizing within-cluster variance. Uses Euclidean distance and updates centroids as the mean of cluster members.

J = Σᵢ Σₓ∈Cᵢ ||x - μᵢ||²

CLUSTERING VISUALIZATION

Clustering Statistics

Iterations: 0

Status: Running

WCSS: 0

Cluster Sizes: -

Current Step

Click Step Fwd or Run to start clustering

Usage Guide

Follow these steps to explore the clustering algorithms:

Select an Algorithm: Choose from the dropdown to see how different algorithms behave:
- K-Means: Watch centroids move to cluster means
- K-Medoids: Centroids snap to actual data points
- Mini-batch: Faster but jittery convergence
- K-Modes: Points snap to categorical grid
- K-Prototypes: Combines both position and shape/color
- DBSCAN: Watch ε-neighborhoods expand and discover clusters (Auto K)
- Mean Shift: See centroids shift toward density peaks and merge (Auto K)
Generate Data: Choose a data pattern and click "Generate":
- Gaussian Blobs: Well-separated clusters (best for K-Means)
- Moons: Two interleaved crescents (best for DBSCAN)
- Spiral: Intertwined spirals (best for DBSCAN)
- Varied Density: Clusters with different densities (test Mean Shift)
- Overlapping: Clusters that overlap (challenging case)
Algorithm-Specific Parameters:
- K (Clusters): For K-Means family - how many clusters to create
- ε (Radius): For DBSCAN - neighborhood radius (larger = more connected)
- MinPts: For DBSCAN - minimum neighbors to be a core point
- Bandwidth: For Mean Shift - kernel window size
Control Execution:
- Step Fwd: Execute one iteration forward
- Step Bwd: Go back to previous iteration
- Run: Auto-run until convergence
- Reset: Reinitialize centroids
Observe Convergence: Watch the WCSS (Within-Cluster Sum of Squares) decrease and the status change to "Converged" when the algorithm stabilizes.

Key Insight: Different algorithms excel in different scenarios. K-Means is fast but requires K. DBSCAN and Mean Shift automatically discover the number of clusters and can find non-spherical shapes. K-Modes handles categorical data that K-Means cannot.

Algorithm Details

K-Means (Lloyd's Algorithm)

The most popular clustering algorithm. It works by:

Initialize: Place K centroids (using K-means++ for better initialization)
Assign: Assign each point to the nearest centroid (Euclidean distance)
Update: Move each centroid to the mean of its assigned points
Repeat: Until assignments don't change (convergence)

μᵢ = (1/|Cᵢ|) Σₓ∈Cᵢ x

Data Structure: Point-Centric Assignment

Rather than centroids maintaining lists of their assigned points, each point stores its own cluster index. This is more efficient because reassignment only requires updating one integer.

Points Array

// Each point knows which cluster it belongs to:

                                    P0 { x: 12, y: 45, clusterIndex: 0 }

                                    →

                                    Cluster 0

                                    P1 { x: 89, y: 23, clusterIndex: 2 }

                                    →

                                    Cluster 2

                                    P2 { x: 34, y: 67, clusterIndex: 1 }

                                    →

                                    Cluster 1

                                    P3 { x: 56, y: 12, clusterIndex: 0 }

                                    →

                                    Cluster 0

                                    P4 { x: 78, y: 89, clusterIndex: 1 }

                                    →

                                    Cluster 1

                                    ...

                                To get all points for Centroid 0:
                            
                                points.filter(p => p.clusterIndex === 0)
                                 → 
                                [P0, P3, ...]
                            
                                To reassign P1 from Cluster 2 to Cluster 0:
                            
                                P1.clusterIndex = 0;
                                 // That's it! No list manipulation

Why This Matters: During the Update phase, the algorithm filters points by their clusterIndex to compute the new centroid position as the mean of all assigned points.

Centroid Update: How Position Changes

After all points are assigned, each centroid moves to the center of mass of its assigned points:

Centroid Update Process

BEFORE UPDATE

old

→

AFTER UPDATE

mean

// K-Means: new position = mean of assigned points

                                c.x = clusterPoints.reduce((sum, p) => sum + p.x, 0) / clusterPoints.length;
                            
                                c.y = clusterPoints.reduce((sum, p) => sum + p.y, 0) / clusterPoints.length;

Different algorithms update centroids differently:

Algorithm	Update Rule	Result
K-Means	Mean of assigned points	Virtual point (may not be a data point)
K-Medoids	Point that minimizes total distance	Always an actual data point
Mini-batch	Incremental: μ ← μ(1-η) + x·η	Gradually moves toward samples
K-Modes	Mode (most frequent) per attribute	Most common category combination

Convergence: The algorithm stops when no point changes its cluster assignment AND centroids stop moving.

K-Medoids (PAM - Partitioning Around Medoids)

Instead of using the mean, uses actual data points as cluster centers:

Medoid: The most centrally located point in a cluster
Advantage: More robust to outliers and noise
Disadvantage: Computationally more expensive (O(n²))

mᵢ = argminₓ∈Cᵢ Σᵧ∈Cᵢ d(x, y)

Mini-Batch K-Means

A faster approximation of K-Means for large datasets:

Uses small random samples (mini-batches) per iteration
Updates centroids using online learning
Trades accuracy for speed (useful for millions of points)

μᵢ ← μᵢ(1 − η) + x·η where η = 1/count

K-Modes (Categorical Data)

Designed specifically for categorical attributes:

Distance: Hamming distance (count of attribute mismatches)
Centroid: Mode (most frequent value) for each attribute
Use case: Customer segmentation, survey data

d(x, y) = Σⱼ (xⱼ ≠ yⱼ) = number of mismatches

K-Prototypes (Mixed Data)

Combines K-Means and K-Modes for mixed numerical/categorical data:

Numerical: Uses Euclidean distance and mean
Categorical: Uses Hamming distance and mode
γ parameter: Controls the weight between numerical and categorical

d = (1-γ)·d_euclidean + γ·d_hamming

DBSCAN (Density-Based Spatial Clustering)

A density-based algorithm that doesn't require specifying K in advance:

ε (Epsilon): Radius of the neighborhood around each point
MinPts: Minimum points required to form a dense region
Core Point: Has at least MinPts neighbors within ε
Border Point: Within ε of a core point but has fewer than MinPts neighbors
Noise Point: Not within ε of any core point (outlier)

Nε(p) = {q ∈ D | dist(p, q) ≤ ε}

The algorithm works by:

Visit: Pick an unvisited point and find its ε-neighborhood
Check: If |Nε(p)| ≥ MinPts, it's a core point → start new cluster
Expand: Add all density-reachable points to the cluster
Noise: Points not reachable from any core point are marked as noise

Advantage: DBSCAN can find arbitrarily shaped clusters and automatically identifies outliers.

DBSCAN Point Classification

Core Point

≥ MinPts neighbors

Border Point

< MinPts, near core

Noise Point

Isolated outlier

                            ε-neighborhood (dashed circle) defines which points are "neighbors"
                        

Mean Shift (Mode-Seeking Algorithm)

A non-parametric algorithm that finds cluster centers by seeking density peaks:

Bandwidth: Size of the kernel window (similar to ε in DBSCAN)
Kernel: Gaussian kernel to weight nearby points
Shift: Move centroid toward weighted mean of nearby points
Merge: Combine centroids that converge to the same peak

m(x) = Σᵢ K(xᵢ − x) · xᵢ / Σᵢ K(xᵢ − x)

The algorithm works by:

Initialize: Place seed centroids at data points
Shift: Move each centroid toward weighted mean of points within bandwidth
Iterate: Repeat until centroids stop moving (convergence)
Merge: Combine centroids that are closer than bandwidth/2
Assign: Assign each point to nearest final centroid

Advantage: Mean Shift automatically determines the number of clusters based on data density.

Mean Shift: Centroid Migration

BEFORE SHIFT

→

AFTER SHIFT

Centroid shifts toward weighted mean of points within bandwidth
Points closer to centroid have higher weight (Gaussian kernel)

Distance Metrics

Metric	Formula	Data Type	Used By
Euclidean	√(Σ(xᵢ - yᵢ)²)	Numerical	K-Means, K-Medoids, Mini-batch, DBSCAN, Mean Shift
Hamming	Σ(xᵢ ≠ yᵢ)	Categorical	K-Modes
Mixed	(1-γ)·Euclidean + γ·Hamming	Mixed	K-Prototypes

Visual Elements

Understanding the visualization:

Point Shape: Represents categorical attribute 1 (circle, square, triangle, diamond)
Point Border Color: Represents categorical attribute 2
Point Fill Color: Shows cluster assignment
Cross Markers: Centroids (cluster centers)
Grid Lines: Category boundaries (shown in K-Modes)

Statistics Explained

Iterations: Number of algorithm steps executed
Status: "Running" during execution, "Converged" when stable
WCSS: Within-Cluster Sum of Squares (lower is better)
Cluster Sizes: Number of points in each cluster

Algorithm Comparison

Algorithm	Speed	K Required?	Cluster Shape	Best For
K-Means	Fast	Yes	Spherical	General purpose
K-Medoids	Slow	Yes	Spherical	Noisy data with outliers
Mini-batch	Very Fast	Yes	Spherical	Large datasets
K-Modes	Fast	Yes	-	Categorical data
K-Prototypes	Medium	Yes	Mixed	Mixed numerical/categorical
DBSCAN	Medium	No	Arbitrary	Noisy data, outlier detection
Mean Shift	Slow	No	Arbitrary	Unknown cluster count

K-Required vs Auto-K Algorithms

When should you use which type?

Scenario	Recommended	Why
You know the number of clusters	K-Means	Fast, simple, deterministic cluster count
Clusters have arbitrary shapes	DBSCAN	Can find non-convex clusters (moons, spirals)
Data has outliers/noise	DBSCAN	Automatically identifies noise points
Unknown cluster count	Mean Shift	Discovers clusters based on density peaks
Clusters have varying density	Mean Shift	Handles density variations better than DBSCAN
Need interpretable centroids	K-Medoids	Centroids are actual data points
Categorical data	K-Modes	Uses Hamming distance, not Euclidean

Limitations and Considerations

K Selection: K-Means family requires choosing K beforehand. Use DBSCAN or Mean Shift to automatically discover the number of clusters.
Initialization Sensitivity: K-Means results depend on initial centroid placement. We use random placement; K-means++ can improve this.
Cluster Shape: K-Means assumes spherical clusters. Use DBSCAN or Mean Shift for arbitrary shapes.
Scale Sensitivity: Features should be normalized for Euclidean distance to work properly.
Parameter Tuning: DBSCAN requires good ε and MinPts values; Mean Shift needs appropriate bandwidth.
Local Minima: Algorithms may converge to local optima, not global.

Real-World Applications

Customer Segmentation: Group customers by purchasing behavior
Image Compression: Reduce colors by clustering pixels
Anomaly Detection: Points far from any centroid may be outliers
Document Clustering: Group similar documents/articles
Gene Expression Analysis: Cluster genes with similar expression patterns

Clustering in the Deep Learning Era

Are these "classical" algorithms still relevant? Yes! Here's why:

Use Case	Why Classical Clustering
Vector Quantization	K-Means used in VQ-VAE, codebook learning for image/audio
Embedding Preprocessing	Cluster embeddings before feeding to neural networks
Interpretability	Explainable results for medical, finance, legal applications
Edge/IoT Devices	Low compute requirements vs. neural networks
Quick Prototyping	Fast implementation, no training data needed

Modern approaches often combine both:

Deep Embedded Clustering (DEC): Autoencoders + K-Means jointly
SCAN: Contrastive learning + clustering
SwAV: Self-supervised learning using online K-Means
Vector Databases (Pinecone, Milvus): K-Means/HNSW for similarity search

Bottom Line: K-Means is to clustering what linear regression is to prediction — fast, interpretable, and "good enough" for most real-world problems. Deep learning clustering is powerful but often overkill unless dealing with raw images/audio/text that need representation learning first.