Web Simulation 

 

 

 

 

Clustering Algorithms Tutorial 

This interactive tutorial demonstrates Clustering Algorithms, a fundamental family of unsupervised machine learning techniques used to partition data into meaningful groups. You'll explore and compare 7 different clustering algorithms side by side, observing how they behave differently on the same dataset.

The simulation visualizes K-Means, K-Medoids, Mini-batch K-Means, K-Modes, K-Prototypes, DBSCAN, and Mean Shift—each designed for different data types and use cases. Watch how centroids move, clusters form, and algorithms converge!

The Algorithms

Each algorithm uses a different approach to clustering:

  • K-Means: Classic algorithm using Euclidean distance with mean-based centroids
  • K-Medoids: Uses actual data points as cluster centers (more robust to outliers)
  • Mini-batch K-Means: Faster variant using random samples per iteration
  • K-Modes: Designed for categorical data using Hamming distance
  • K-Prototypes: Handles mixed numerical and categorical data
  • DBSCAN: Density-based algorithm that discovers clusters and detects outliers (no K required)
  • Mean Shift: Non-parametric algorithm that finds clusters by seeking density peaks (no K required)

The Clustering Problem

Given a set of observations, partition them into K clusters such that points within the same cluster are more similar to each other than to points in other clusters:

Minimize: J = Σᵢ₌₁ᴷ Σₓ∈Cᵢ d(x, μᵢ)²

where:

  • K is the number of clusters
  • Cᵢ is the set of points in cluster i
  • μᵢ is the centroid (center) of cluster i
  • d(x, μᵢ) is the distance from point x to centroid μᵢ

 

0.50
100ms

K-Means Clustering

Partitions data into K clusters by minimizing within-cluster variance. Uses Euclidean distance and updates centroids as the mean of cluster members.

J = Σᵢ Σₓ∈Cᵢ ||x - μᵢ||²
CLUSTERING VISUALIZATION

Clustering Statistics

Iterations: 0
Status: Running
WCSS: 0
Cluster Sizes: -

Current Step

Click Step Fwd or Run to start clustering

 

Usage Guide

Follow these steps to explore the clustering algorithms:

  1. Select an Algorithm: Choose from the dropdown to see how different algorithms behave:
    • K-Means: Watch centroids move to cluster means
    • K-Medoids: Centroids snap to actual data points
    • Mini-batch: Faster but jittery convergence
    • K-Modes: Points snap to categorical grid
    • K-Prototypes: Combines both position and shape/color
    • DBSCAN: Watch ε-neighborhoods expand and discover clusters (Auto K)
    • Mean Shift: See centroids shift toward density peaks and merge (Auto K)
  2. Generate Data: Choose a data pattern and click "Generate":
    • Gaussian Blobs: Well-separated clusters (best for K-Means)
    • Moons: Two interleaved crescents (best for DBSCAN)
    • Spiral: Intertwined spirals (best for DBSCAN)
    • Varied Density: Clusters with different densities (test Mean Shift)
    • Overlapping: Clusters that overlap (challenging case)
  3. Algorithm-Specific Parameters:
    • K (Clusters): For K-Means family - how many clusters to create
    • ε (Radius): For DBSCAN - neighborhood radius (larger = more connected)
    • MinPts: For DBSCAN - minimum neighbors to be a core point
    • Bandwidth: For Mean Shift - kernel window size
  4. Control Execution:
    • Step Fwd: Execute one iteration forward
    • Step Bwd: Go back to previous iteration
    • Run: Auto-run until convergence
    • Reset: Reinitialize centroids
  5. Observe Convergence: Watch the WCSS (Within-Cluster Sum of Squares) decrease and the status change to "Converged" when the algorithm stabilizes.

Key Insight: Different algorithms excel in different scenarios. K-Means is fast but requires K. DBSCAN and Mean Shift automatically discover the number of clusters and can find non-spherical shapes. K-Modes handles categorical data that K-Means cannot.

Algorithm Details

K-Means (Lloyd's Algorithm)

The most popular clustering algorithm. It works by:

  1. Initialize: Place K centroids (using K-means++ for better initialization)
  2. Assign: Assign each point to the nearest centroid (Euclidean distance)
  3. Update: Move each centroid to the mean of its assigned points
  4. Repeat: Until assignments don't change (convergence)
μᵢ = (1/|Cᵢ|) Σₓ∈Cᵢ x
Data Structure: Point-Centric Assignment

Rather than centroids maintaining lists of their assigned points, each point stores its own cluster index. This is more efficient because reassignment only requires updating one integer.

Points Array
// Each point knows which cluster it belongs to:
P0 { x: 12, y: 45, clusterIndex: 0 } Cluster 0
P1 { x: 89, y: 23, clusterIndex: 2 } Cluster 2
P2 { x: 34, y: 67, clusterIndex: 1 } Cluster 1
P3 { x: 56, y: 12, clusterIndex: 0 } Cluster 0
P4 { x: 78, y: 89, clusterIndex: 1 } Cluster 1
...
To get all points for Centroid 0:
points.filter(p => p.clusterIndex === 0) [P0, P3, ...]
To reassign P1 from Cluster 2 to Cluster 0:
P1.clusterIndex = 0; // That's it! No list manipulation

Why This Matters: During the Update phase, the algorithm filters points by their clusterIndex to compute the new centroid position as the mean of all assigned points.

Centroid Update: How Position Changes

After all points are assigned, each centroid moves to the center of mass of its assigned points:

Centroid Update Process
BEFORE UPDATE
old
AFTER UPDATE
mean
// K-Means: new position = mean of assigned points
c.x = clusterPoints.reduce((sum, p) => sum + p.x, 0) / clusterPoints.length;
c.y = clusterPoints.reduce((sum, p) => sum + p.y, 0) / clusterPoints.length;

Different algorithms update centroids differently:

Algorithm Update Rule Result
K-Means Mean of assigned points Virtual point (may not be a data point)
K-Medoids Point that minimizes total distance Always an actual data point
Mini-batch Incremental: μ ← μ(1-η) + x·η Gradually moves toward samples
K-Modes Mode (most frequent) per attribute Most common category combination

Convergence: The algorithm stops when no point changes its cluster assignment AND centroids stop moving.

K-Medoids (PAM - Partitioning Around Medoids)

Instead of using the mean, uses actual data points as cluster centers:

  • Medoid: The most centrally located point in a cluster
  • Advantage: More robust to outliers and noise
  • Disadvantage: Computationally more expensive (O(n²))
mᵢ = argminₓ∈Cᵢ Σᵧ∈Cᵢ d(x, y)
Mini-Batch K-Means

A faster approximation of K-Means for large datasets:

  • Uses small random samples (mini-batches) per iteration
  • Updates centroids using online learning
  • Trades accuracy for speed (useful for millions of points)
μᵢ ← μᵢ(1 - η) + x·η where η = 1/count
K-Modes (Categorical Data)

Designed specifically for categorical attributes:

  • Distance: Hamming distance (count of attribute mismatches)
  • Centroid: Mode (most frequent value) for each attribute
  • Use case: Customer segmentation, survey data
d(x, y) = Σⱼ (xⱼ �  yⱼ) = number of mismatches
K-Prototypes (Mixed Data)

Combines K-Means and K-Modes for mixed numerical/categorical data:

  • Numerical: Uses Euclidean distance and mean
  • Categorical: Uses Hamming distance and mode
  • γ parameter: Controls the weight between numerical and categorical
d = (1-γ)·d_euclidean + γ·d_hamming
DBSCAN (Density-Based Spatial Clustering)

A density-based algorithm that doesn't require specifying K in advance:

  • ε (Epsilon): Radius of the neighborhood around each point
  • MinPts: Minimum points required to form a dense region
  • Core Point: Has at least MinPts neighbors within ε
  • Border Point: Within ε of a core point but has fewer than MinPts neighbors
  • Noise Point: Not within ε of any core point (outlier)
Nε(p) = {q ∈ D | dist(p, q) ≤ ε}

The algorithm works by:

  1. Visit: Pick an unvisited point and find its ε-neighborhood
  2. Check: If |Nε(p)| ≥ MinPts, it's a core point → start new cluster
  3. Expand: Add all density-reachable points to the cluster
  4. Noise: Points not reachable from any core point are marked as noise

Advantage: DBSCAN can find arbitrarily shaped clusters and automatically identifies outliers!

DBSCAN Point Classification
Core Point
≥ MinPts neighbors
Border Point
< MinPts, near core
Noise Point
Isolated outlier
ε-neighborhood (dashed circle) defines which points are "neighbors"
Mean Shift (Mode-Seeking Algorithm)

A non-parametric algorithm that finds cluster centers by seeking density peaks:

  • Bandwidth: Size of the kernel window (similar to ε in DBSCAN)
  • Kernel: Gaussian kernel to weight nearby points
  • Shift: Move centroid toward weighted mean of nearby points
  • Merge: Combine centroids that converge to the same peak
m(x) = Σᵢ K(xᵢ - x) · xᵢ / Σᵢ K(xᵢ - x)

The algorithm works by:

  1. Initialize: Place seed centroids at data points
  2. Shift: Move each centroid toward weighted mean of points within bandwidth
  3. Iterate: Repeat until centroids stop moving (convergence)
  4. Merge: Combine centroids that are closer than bandwidth/2
  5. Assign: Assign each point to nearest final centroid

Advantage: Mean Shift automatically determines the number of clusters based on data density!

Mean Shift: Centroid Migration
BEFORE SHIFT
AFTER SHIFT
Centroid shifts toward weighted mean of points within bandwidth
Points closer to centroid have higher weight (Gaussian kernel)

Distance Metrics

Metric Formula Data Type Used By
Euclidean √(Σ(xᵢ - yᵢ)²) Numerical K-Means, K-Medoids, Mini-batch, DBSCAN, Mean Shift
Hamming Σ(xᵢ �  yᵢ) Categorical K-Modes
Mixed (1-γ)·Euclidean + γ·Hamming Mixed K-Prototypes

Visual Elements

Understanding the visualization:

  • Point Shape: Represents categorical attribute 1 (circle, square, triangle, diamond)
  • Point Border Color: Represents categorical attribute 2
  • Point Fill Color: Shows cluster assignment
  • Cross Markers: Centroids (cluster centers)
  • Grid Lines: Category boundaries (shown in K-Modes)

Statistics Explained

  • Iterations: Number of algorithm steps executed
  • Status: "Running" during execution, "Converged" when stable
  • WCSS: Within-Cluster Sum of Squares (lower is better)
  • Cluster Sizes: Number of points in each cluster

Algorithm Comparison

Algorithm Speed K Required? Cluster Shape Best For
K-Means Fast Yes Spherical General purpose
K-Medoids Slow Yes Spherical Noisy data with outliers
Mini-batch Very Fast Yes Spherical Large datasets
K-Modes Fast Yes - Categorical data
K-Prototypes Medium Yes Mixed Mixed numerical/categorical
DBSCAN Medium No Arbitrary Noisy data, outlier detection
Mean Shift Slow No Arbitrary Unknown cluster count

K-Required vs Auto-K Algorithms

When should you use which type?

Scenario Recommended Why
You know the number of clusters K-Means Fast, simple, deterministic cluster count
Clusters have arbitrary shapes DBSCAN Can find non-convex clusters (moons, spirals)
Data has outliers/noise DBSCAN Automatically identifies noise points
Unknown cluster count Mean Shift Discovers clusters based on density peaks
Clusters have varying density Mean Shift Handles density variations better than DBSCAN
Need interpretable centroids K-Medoids Centroids are actual data points
Categorical data K-Modes Uses Hamming distance, not Euclidean

Limitations and Considerations

  • K Selection: K-Means family requires choosing K beforehand. Use DBSCAN or Mean Shift to automatically discover the number of clusters.
  • Initialization Sensitivity: K-Means results depend on initial centroid placement. We use random placement; K-means++ can improve this.
  • Cluster Shape: K-Means assumes spherical clusters. Use DBSCAN or Mean Shift for arbitrary shapes.
  • Scale Sensitivity: Features should be normalized for Euclidean distance to work properly.
  • Parameter Tuning: DBSCAN requires good ε and MinPts values; Mean Shift needs appropriate bandwidth.
  • Local Minima: Algorithms may converge to local optima, not global.

Real-World Applications

  • Customer Segmentation: Group customers by purchasing behavior
  • Image Compression: Reduce colors by clustering pixels
  • Anomaly Detection: Points far from any centroid may be outliers
  • Document Clustering: Group similar documents/articles
  • Gene Expression Analysis: Cluster genes with similar expression patterns

Clustering in the Deep Learning Era

Are these "classical" algorithms still relevant? Yes! Here's why:

Use Case Why Classical Clustering
Vector Quantization K-Means used in VQ-VAE, codebook learning for image/audio
Embedding Preprocessing Cluster embeddings before feeding to neural networks
Interpretability Explainable results for medical, finance, legal applications
Edge/IoT Devices Low compute requirements vs. neural networks
Quick Prototyping Fast implementation, no training data needed

Modern approaches often combine both:

  • Deep Embedded Clustering (DEC): Autoencoders + K-Means jointly
  • SCAN: Contrastive learning + clustering
  • SwAV: Self-supervised learning using online K-Means
  • Vector Databases (Pinecone, Milvus): K-Means/HNSW for similarity search

Bottom Line: K-Means is to clustering what linear regression is to prediction—fast, interpretable, and "good enough" for most real-world problems. Deep learning clustering is powerful but often overkill unless dealing with raw images/audio/text that need representation learning first.