Web Simulator | ShareTechnote

Web Simulation

Data Normalization: Distance Distortion

When you mix features on different scales in one dataset (for example Age in years and Income in dollars), a naive Euclidean distance between two points is dominated by the axis with the larger range. Algorithms like K-Nearest Neighbours or K-Means rely on that distance, so without scaling, one feature can effectively “drown out” the other. This tool shows how that distortion looks and how two common scaling methods fix it.

Sections

Why scale?
Min-Max normalization
Standardization (Z-score)
Euclidean distance
Simulation
Usage
Key insight
Limitations

Why scale?

Suppose Feature A (Age) ranges from 0 to 100 and Feature B (Income) from 20,000 to 500,000. A difference of 1 year and a difference of 1 dollar are treated the same in the formula d = √(ΔA² + ΔB²), so a small change in income can outweigh a large change in age. Visually, the plot is stretched along the income axis and distances look skewed. After scaling, both axes contribute on an equal footing.

Min-Max normalization

Map each feature into [0, 1] using the observed min and max:

x̅_i = (x_i − min) / (max − min)

All points lie in a unit square; the aspect ratio is equal and distances are balanced. The drawback: a single extreme outlier (e.g. one very high income) makes max huge, so every other point gets squashed toward 0 on that axis and the rest of the variation becomes hard to see.

Standardization (Z-score)

Center each feature at 0 and scale by its standard deviation (sample formula with n − 1):

z_i = (x_i − μ) / σ, σ = √(Σ_j(x_j − μ)² / (n − 1))

Values are in “number of standard deviations from the mean.” On the plot, the origin is the mean; concentric circles at 1σ, 2σ, 3σ show how spread the data is. Outliers sit far from the center but the bulk of the points keep a readable, centered layout, so standardization is often more robust than min-max when outliers are present.

Euclidean distance

For two points p = (p₁, p₂) and q = (q₁, q₂), the Euclidean distance is:

d = √((p₁ − q₁)² + (p₂ − q₂)²)

In Original mode you use raw Age and Income, so the number is in mixed units. In Min-Max or Standardization mode you use the transformed coordinates, so the distance is dimensionless and comparable. Click two points in the simulator to see how the same pair’s distance value changes with the chosen mode.

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Distance
Click two points to measure.

—
—

Feature space (data view)

X–Y same scale

(Original scale only)

Algorithm perspective

Data point Outlier

Usage

Data preset: At the top of the control panel, choose a dataset from Extreme (Age vs Income) through Strong, Moderate, Mild, to Similar scales (0–100). Each preset uses different Age and Income ranges so you can see how normalization behaves when axis ranges differ a lot (extreme) versus when they are comparable (mild/similar).

Transformation mode: Switch between Original (Unscaled), Min-Max [0, 1], and Standardization (Z-score). The points animate to their new positions. In Original mode the axes use raw ranges (aspect ratio can be distorted); use X–Y same scale (top-left of the first canvas) to show equal pixels per unit on both axes. In Min-Max mode both features are in [0, 1]; in Z-score the origin is the mean and the grid shows 1, 2, 3 standard deviations.

Distance: Click one point, then another, to see the Euclidean distance in the current coordinate system (on the line and in the sidebar). In Original mode the distance is in mixed units; in Min-Max or Z-score it is dimensionless.

Add extreme outlier: Inserts a “Millionaire Toddler” (Age 2, Income $5,000,000), drawn in red. In Min-Max mode the new max income squashes other points; in Standardization the outlier sits far from the center but the rest keep a readable spread.

Reset data: Regenerates 20 random points using the current data preset and clears the outlier and selection.

Algorithm perspective: The second canvas shows distance-based behavior. 3-Nearest Neighbors: select one point to see neighbors (green lines). Distance heatmap: select one point to see the distance field (options: distance contrast, true scale). Learning path (gradient descent): select one point to see the path toward the minimum (learning rate slider). Attention weights (Softmax): select one point to see softmax attention from that query; use the Temperature slider to sharpen (low) or soften (high) the distribution. Subplots at the top-left of each canvas compare Original (saturated) vs Normalized (balanced) attention.

Key insight

Method	Output range	Outlier behaviour
Min-Max	[0, 1] (bounded)	Fragile — one extreme value steals the scale and squashes everything else
Standardization (Z-score)	Unbounded, centered at 0	Robust — scale is set by the group’s std dev; the outlier just sits far from center

Rule of thumb: reach for Min-Max when you need bounded values and the data has no big outliers (e.g. pixel intensities); reach for Standardization when outliers are present or the feature is roughly bell-shaped. Click Add extreme outlier in each mode to feel the difference.

Limitations

2D, two features: the demo scales two features (Age, Income). Real datasets have many features, and per-feature scaling decisions interact.
Leakage caveat: here min/max/mean/std are computed over all shown points. In practice you must fit the scaler on the training set only and apply it to test data, or you leak information.
Linear scaling only: Min-Max and Z-score are affine. Skewed data often needs a non-linear transform (log, Box-Cox, robust/quantile scaling) not shown here.
Euclidean focus: the distortion shown is for Euclidean distance; tree-based models (decision trees, random forests) are scale-invariant and need no normalization at all.