Web Simulation 

 

 

 

 

Transformer Scaled Dot-Product Attention (USE Embeddings) 

This interactive simulator visualizes Scaled Dot-Product Attention using real 512-dimensional word embeddings from the Universal Sentence Encoder (USE). Words are shown in a circle; you click a word to set it as the Query (Q). The table then shows Q·K/√dk, Softmax %, and the magnitude of each word’s weighted Value contribution (‖w·V‖).

Formula: Attention(Q, K, V) = softmax((Q·KT / √dk) / τ) V. Wq, Wk, Wv are identity-style (first 64 dims preserved) so Q·K reflects USE semantic similarity—e.g. “cat” attends more to “sat” and “mat”. τ = temperature: low (slider left) sharpens peaks; high (slider right) makes weights uniform. dk = 64.

Input words (space or comma separated):
Status: Loading...
Head type:
Temperature (τ): 5.00
Click a word to set it as Query word
Loading Universal Sentence Encoder...
Use http:// (e.g. local server), not file://.

Scaled dot-product: softmax(Q·KT / √dk) — Query (Q) in gold, Keys (K) in cyan

WordRoleFormulaQ·K / √dkSoftmax %Formula (‖w·V‖)‖w·V‖

Embedding (E) — 512 dims, 1D heatmap per word

E + PE (Embedding + Positional Encoding) — 512 dims, 1D heatmap per word

Weight matrices W_Q, W_K, W_V (512×64) — Q = X·W_Q, K = X·W_K, V = X·W_V, X = E+PE

W_Q
W_K
W_V

Query vectors (Q) — one row per word, 64 dims (Q = X·W_Q, X = E+PE)

Key vectors (K) — one row per word, 64 dims (K = X·W_K, X = E+PE)

Value vectors (V) — one row per word, 64 dims (V = X·W_V, X = E+PE)

Weighted value (w·V) — one row per word, 64 dims (w·V = softmax weight × V for current Query)

Z_head = w · V_??? , 64 dims

Transfer pipeline: head aggregation → final Z

In a full Transformer, 8 heads each produce a 64D vector; they are concatenated (8×64 = 512) then projected by W_O to get the final 512D Z. This sim uses 8 heads. Per-head: Z_head = w · V (attention weights × Values); concat → Z_concat → × W_O → Z_final; residual: X + Z_final.

Head 1
Head 2
Head 3
Head 4
Head 5
Head 6
Head 7
Head 8
Z_concat

W_O (Output projection) 512×512

Z (contextualized output) = Z_concat × W_O = Z_final
Query word input (E+PE) = X
Residual output (X+Z_final)

 

How to use

  • Input words: Type or edit words (e.g. "The cat sat on the mat"). Separate by spaces or commas.
  • Reset: Re-embed the current input and redraw the circle. Use after changing the text.
  • Click a word: Set that word as the Query (Q). The canvas shows attention beams (Q→K) and the table fills with Q·K/√dk, Softmax %, and ‖w·V‖ (magnitude of each word’s weighted Value contribution).
  • Temperature (τ): Lower values (e.g. 0.2–0.5) sharpen attention so one word dominates; higher values (e.g. 2–5) make weights more uniform. Formula: softmax((Q·K/√dk) / τ).

What you’re seeing

  • Circle: Words from your input, laid out in a circle. Gold = current Query (Q); cyan = Keys (K).
  • Beams: Attention from the Query to each word; thickness and opacity follow the softmax weight.
  • Table: For the chosen Query, one row per word: Role (Q or K), raw score Q·K/√dk, Softmax %, and ‖w·V‖ (weight × L2 norm of that word’s Value vector). The bar column reflects Softmax %.

Technical note

Embeddings X come from USE (512-d). Wq and Wk are the same projection onto the first 64 dimensions, so Q and K reflect semantic similarity in USE space. Wv is a fixed random matrix. This simulator does not use trained transformer weights; it illustrates the attention equation with real embeddings and synthetic Wq, Wk, Wv.