Web Simulator | ShareTechnote

Web Simulation

Transformer Scaled Dot-Product Attention (USE Embeddings)

This interactive simulator visualizes Scaled Dot-Product Attention using real 512-dimensional word embeddings from the Universal Sentence Encoder (USE). Words are shown in a circle; you click a word to set it as the Query (Q). The table then shows Q·K/√d_k, Softmax %, and the magnitude of each word’s weighted Value contribution (‖w·V‖).

Formula:

Attention(Q, K, V) = softmax((Q·K^T / √d_k) / τ) V

Wq, Wk, Wv are identity-style (first 64 dims preserved) so Q·K reflects USE semantic similarity—e.g. “cat” attends more to “sat” and “mat”. τ = temperature: low (slider left) sharpens peaks; high (slider right) makes weights uniform. d_k = 64.

Sections

Simulation
How to use
What you’re seeing
Technical note
Limitations

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Input words (space or comma separated):

Status: Loading...

Head type:

Temperature (τ): 5.00

Click a word to set it as Query word

Loading Universal Sentence Encoder...
Use http:// (e.g. local server), not file://.

Scaled dot-product: softmax(Q·K^T / √d_k) — Query (Q) in gold, Keys (K) in cyan

Mask self-attention

Word	Role	Formula	Q·K / √d_k	Softmax %	Formula (‖w·V‖)	‖w·V‖

Embedding (E) — 512 dims, 1D heatmap per word

E + PE (Embedding + Positional Encoding) — 512 dims, 1D heatmap per word

Weight matrices W_Q, W_K, W_V (512×64) — Q = X·W_Q, K = X·W_K, V = X·W_V, X = E+PE

W_Q

W_K

W_V

Query vectors (Q) — one row per word, 64 dims (Q = X·W_Q, X = E+PE)

Key vectors (K) — one row per word, 64 dims (K = X·W_K, X = E+PE)

Value vectors (V) — one row per word, 64 dims (V = X·W_V, X = E+PE)

Weighted value (w·V) — one row per word, 64 dims (w·V = softmax weight × V for current Query)

Z_head = w · V_??? , 64 dims

Transfer pipeline: head aggregation → final Z

In a full Transformer, 8 heads each produce a 64D vector; they are concatenated (8×64 = 512) then projected by W_O to get the final 512D Z. This sim uses 8 heads. Per-head: Z_head = w · V (attention weights × Values); concat → Z_concat → × W_O → Z_final; residual: X + Z_final.

Head 1

Head 2

Head 3

Head 4

Head 5

Head 6

Head 7

Head 8

Z_concat

W_O (Output projection) 512×512

Z (contextualized output) = Z_concat × W_O = Z_final

Show residual (X+Z)

Query word input (E+PE) = X

Residual output (X+Z_final)

How to use

Input words: Type or edit words (e.g. "The cat sat on the mat"). Separate by spaces or commas.
Reset: Re-embed the current input and redraw the circle. Use after changing the text.
Click a word: Set that word as the Query (Q). The canvas shows attention beams (Q→K) and the table fills with Q·K/√d_k, Softmax %, and ‖w·V‖ (magnitude of each word’s weighted Value contribution).
Temperature (τ): Lower values (e.g. 0.2–0.5) sharpen attention so one word dominates; higher values (e.g. 2–5) make weights more uniform. Formula: softmax((Q·K/√d_k) / τ).

What you’re seeing

Circle: Words from your input, laid out in a circle. Gold = current Query (Q); cyan = Keys (K).
Beams: Attention from the Query to each word; thickness and opacity follow the softmax weight.
Table: For the chosen Query, one row per word: Role (Q or K), raw score Q·K/√d_k, Softmax %, and ‖w·V‖ (weight × L2 norm of that word’s Value vector). The bar column reflects Softmax %.

Technical note

Embeddings X come from USE (512-d). Wq and Wk are the same projection onto the first 64 dimensions, so Q and K reflect semantic similarity in USE space. Wv is a fixed random matrix. This simulator does not use trained transformer weights; it illustrates the attention equation with real embeddings and synthetic Wq, Wk, Wv.

Limitations

Synthetic projections, not trained weights. W_q and W_k are an identity-style projection onto the first 64 USE dimensions and W_v is a fixed random matrix. A real attention layer learns all three, so the attention pattern here reflects raw USE similarity, not what a trained model would attend to.
Single head, single query. One head is shown and you inspect one Query word at a time; the full token×token attention matrix and multi-head behaviour are not displayed.
USE sentence embeddings as token vectors. Words are embedded with the Universal Sentence Encoder rather than a Transformer's own token embeddings, so the input space differs from an actual model's.
No masking or positional encoding. All words attend to all words; causal masking and position information are not modeled (repeated words like the two “the”s are indistinguishable by position).
Temperature is a teaching aid. The τ slider illustrates softmax sharpness; standard scaled dot-product attention uses τ = 1.
Teaching tool. Built to make scaled dot-product attention tangible with real embeddings, not to reproduce a trained attention layer.