Web Simulation 

 

 

 

 

Transformer Scaled Dot-Product Attention (USE Embeddings) 

This interactive simulator visualizes Scaled Dot-Product Attention using real 512-dimensional word embeddings from the Universal Sentence Encoder (USE). Words are shown in a circle; you click a word to set it as the Query (Q). The table then shows Q·K/√dk, Softmax %, and the magnitude of each word’s weighted Value contribution (‖w·V‖).

Formula:

Attention(Q, K, V) = softmax((Q·KT / √dk) / τ) V

Wq, Wk, Wv are identity-style (first 64 dims preserved) so Q·K reflects USE semantic similarity—e.g. “cat” attends more to “sat” and “mat”. τ = temperature: low (slider left) sharpens peaks; high (slider right) makes weights uniform. dk = 64.

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Input words (space or comma separated):
Status: Loading...
Head type:
Temperature (τ): 5.00
Click a word to set it as Query word
Loading Universal Sentence Encoder...
Use http:// (e.g. local server), not file://.

Scaled dot-product: softmax(Q·KT / √dk) — Query (Q) in gold, Keys (K) in cyan

Word

Role

Formula

Q·K / √dk

Softmax %

Formula (‖w·V‖)

‖w·V‖

Embedding (E) — 512 dims, 1D heatmap per word

E + PE (Embedding + Positional Encoding) — 512 dims, 1D heatmap per word

Weight matrices W_Q, W_K, W_V (512×64) — Q = X·W_Q, K = X·W_K, V = X·W_V, X = E+PE

W_Q
W_K
W_V

Query vectors (Q) — one row per word, 64 dims (Q = X·W_Q, X = E+PE)

Key vectors (K) — one row per word, 64 dims (K = X·W_K, X = E+PE)

Value vectors (V) — one row per word, 64 dims (V = X·W_V, X = E+PE)

Weighted value (w·V) — one row per word, 64 dims (w·V = softmax weight × V for current Query)

Z_head = w · V_??? , 64 dims

Transfer pipeline: head aggregation → final Z

In a full Transformer, 8 heads each produce a 64D vector; they are concatenated (8×64 = 512) then projected by W_O to get the final 512D Z. This sim uses 8 heads. Per-head: Z_head = w · V (attention weights × Values); concat → Z_concat → × W_O → Z_final; residual: X + Z_final.

Head 1
Head 2
Head 3
Head 4
Head 5
Head 6
Head 7
Head 8
Z_concat

W_O (Output projection) 512×512

Z (contextualized output) = Z_concat × W_O = Z_final
Query word input (E+PE) = X
Residual output (X+Z_final)

 

How to use

  • Input words: Type or edit words (e.g. "The cat sat on the mat"). Separate by spaces or commas.
  • Reset: Re-embed the current input and redraw the circle. Use after changing the text.
  • Click a word: Set that word as the Query (Q). The canvas shows attention beams (Q→K) and the table fills with Q·K/√dk, Softmax %, and ‖w·V‖ (magnitude of each word’s weighted Value contribution).
  • Temperature (τ): Lower values (e.g. 0.2–0.5) sharpen attention so one word dominates; higher values (e.g. 2–5) make weights more uniform. Formula: softmax((Q·K/√dk) / τ).

What you’re seeing

  • Circle: Words from your input, laid out in a circle. Gold = current Query (Q); cyan = Keys (K).
  • Beams: Attention from the Query to each word; thickness and opacity follow the softmax weight.
  • Table: For the chosen Query, one row per word: Role (Q or K), raw score Q·K/√dk, Softmax %, and ‖w·V‖ (weight × L2 norm of that word’s Value vector). The bar column reflects Softmax %.

Technical note

Embeddings X come from USE (512-d). Wq and Wk are the same projection onto the first 64 dimensions, so Q and K reflect semantic similarity in USE space. Wv is a fixed random matrix. This simulator does not use trained transformer weights; it illustrates the attention equation with real embeddings and synthetic Wq, Wk, Wv.

Limitations

  • Synthetic projections, not trained weights. Wq and Wk are an identity-style projection onto the first 64 USE dimensions and Wv is a fixed random matrix. A real attention layer learns all three, so the attention pattern here reflects raw USE similarity, not what a trained model would attend to.
  • Single head, single query. One head is shown and you inspect one Query word at a time; the full token×token attention matrix and multi-head behaviour are not displayed.
  • USE sentence embeddings as token vectors. Words are embedded with the Universal Sentence Encoder rather than a Transformer's own token embeddings, so the input space differs from an actual model's.
  • No masking or positional encoding. All words attend to all words; causal masking and position information are not modeled (repeated words like the two “the”s are indistinguishable by position).
  • Temperature is a teaching aid. The τ slider illustrates softmax sharpness; standard scaled dot-product attention uses τ = 1.
  • Teaching tool. Built to make scaled dot-product attention tangible with real embeddings, not to reproduce a trained attention layer.