Web Simulation 

 

 

 

 

Transformer Encoder / Decoder Tutorial 

This tutorial zooms out from individual Transformer blocks and shows how complete Transformer architectures are assembled. The key question is: do we need an Encoder, a Decoder, or both?

Mathematical Foundation

1. Encoder-only architecture

The encoder receives the full input sequence and uses bidirectional self-attention:

H = Encoder(X)

For example, with source tokens:

X = [Je, suis, etudiant]

each token can attend to all source tokens:

Je -> [Je, suis, etudiant]

suis -> [Je, suis, etudiant]

etudiant -> [Je, suis, etudiant]

This is useful for understanding tasks such as classification, search, and embedding generation.

2. Decoder-only architecture

The decoder uses masked self-attention for autoregressive generation:

D_t = Decoder(y_0, ..., y_t)

For target tokens:

Y = [I, am, a, student]

the causal mask allows:

I -> [I]

am -> [I, am]

a -> [I, am, a]

student -> [I, am, a, student]

This is the GPT-style architecture for next-token generation.

3. Encoder-decoder architecture

Encoder-decoder models use both stacks:

H_enc = Encoder(X_src)
D_self = MaskedSelfAttention(Y_tgt)
C = CrossAttention(Q_dec, K_enc, V_enc)

The encoder writes source context into memory. The decoder reads that memory through cross-attention while generating the target sequence.

4. Cross-attention is the bridge

In cross-attention, the Query comes from the decoder, but Key and Value come from the encoder:

CrossAttn = Attention(Q_decoder, K_encoder, V_encoder)

For translation, this means:

student query reads encoder memory [Je, suis, etudiant]

That is how the decoder can generate English while still looking back at the French source sentence.

5. Prediction head

The decoder output is mapped to vocabulary probabilities:

p(next token) = softmax(Linear(D_final))

In the simulator example, the largest probability is for ., so the next predicted token is a period.

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Architecture view

Encoder Stack
Encoder Block N
Self-Attn + FFN + AddNorm
↑ repeated blocks ↑
Encoder Block 1
bidirectional self-attention
Context Memory
K_encoder
V_encoder
WRITE ← Encoder
READ → Decoder
Decoder Stack
Linear / Softmax
predict next token
Decoder Block N
Masked Attn + Cross-Attn + FFN
↑ repeated blocks ↑
Decoder Block 1
causal masked self-attention

Encoder bidirectional mask

Je

suis

etudiant

Decoder causal mask

I

am

a

student

Cross-attention read pattern

Je

suis

etudiant

Architecture formulas

Next-token probabilities

Usage Instructions

  1. Architecture: Switch between Encoder-Decoder, Decoder-Only, and Encoder-Only.
  2. Step controls: Move through architecture selection, source encoding, masked decoding, cross-attention, and next-token prediction.
  3. Encoder table: Shows that encoder self-attention is bidirectional: all source tokens can see all source tokens.
  4. Decoder table: Shows the causal mask: each target token can only see itself and previous tokens.
  5. Cross-attention table: Shows decoder target queries reading encoder source memory. This only matters in Encoder-Decoder mode.

What To Notice

  • Encoder-only: Understands input but does not generate a target sequence.
  • Decoder-only: Generates next tokens but has no separate source memory.
  • Encoder-decoder: Uses encoder memory and decoder generation together.
  • Cross-attention: The decoder Query reads encoder Keys and Values.

Parameters

  • Architecture: Controls which Transformer family is shown.
  • Animation: Steps through the macro flow.

Limitations

  • Architecture diagram, not a running model. The tool shows which tokens each stack can attend to (bidirectional encoder, causal decoder, cross-attention bridge) using attendance tables; it does not run real attention, embeddings, or generate text.
  • Single short example. A fixed French→English sentence pair illustrates the data flow; there is no tokenizer, vocabulary, or beam search, and the “prediction” is a canned example.
  • One layer, conceptual. Each stack is drawn as a single conceptual block. Real models stack many layers, each with its own attention + FFN + Add&Norm, plus positional encodings.
  • No weights or training. There are no learned matrices and no gradients; the demo is purely structural, so it cannot show how the model learns alignment or translation.
  • Three canonical families only. Encoder-only, decoder-only, and encoder-decoder are shown; hybrids, mixture-of-experts, and retrieval-augmented variants are out of scope.
  • Teaching tool. Built to make the differences between BERT-style, GPT-style, and seq2seq Transformers and the role of cross-attention clear — not a functional translator or language model.