Web Simulator | ShareTechnote

Web Simulation

Transformer Encoder / Decoder Tutorial

This tutorial zooms out from individual Transformer blocks and shows how complete Transformer architectures are assembled. The key question is: do we need an Encoder, a Decoder, or both?

Sections

Mathematical Foundation
Simulation
Usage Instructions
What To Notice
Parameters
Limitations

Mathematical Foundation

1. Encoder-only architecture

The encoder receives the full input sequence and uses bidirectional self-attention:

H = Encoder(X)

For example, with source tokens:

X = [Je, suis, etudiant]

each token can attend to all source tokens:

Je -> [Je, suis, etudiant]

suis -> [Je, suis, etudiant]

etudiant -> [Je, suis, etudiant]

This is useful for understanding tasks such as classification, search, and embedding generation.

2. Decoder-only architecture

The decoder uses masked self-attention for autoregressive generation:

D_t = Decoder(y_0, ..., y_t)

For target tokens:

Y = [I, am, a, student]

the causal mask allows:

I -> [I]

am -> [I, am]

a -> [I, am, a]

student -> [I, am, a, student]

This is the GPT-style architecture for next-token generation.

3. Encoder-decoder architecture

Encoder-decoder models use both stacks:

H_enc = Encoder(X_src)
D_self = MaskedSelfAttention(Y_tgt)
C = CrossAttention(Q_dec, K_enc, V_enc)

The encoder writes source context into memory. The decoder reads that memory through cross-attention while generating the target sequence.

4. Cross-attention is the bridge

In cross-attention, the Query comes from the decoder, but Key and Value come from the encoder:

CrossAttn = Attention(Q_decoder, K_encoder, V_encoder)

For translation, this means:

student query reads encoder memory [Je, suis, etudiant]

That is how the decoder can generate English while still looking back at the French source sentence.

5. Prediction head

The decoder output is mapped to vocabulary probabilities:

p(next token) = softmax(Linear(D_final))

In the simulator example, the largest probability is for ., so the next predicted token is a period.

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Architecture

Animation

Architecture view

Encoder Stack

Encoder Block N
Self-Attn + FFN + AddNorm

↑ repeated blocks ↑

Encoder Block 1
bidirectional self-attention

Context Memory

K_encoder

V_encoder

WRITE ← Encoder

READ → Decoder

Decoder Stack

Linear / Softmax
predict next token

Decoder Block N
Masked Attn + Cross-Attn + FFN

↑ repeated blocks ↑

Decoder Block 1
causal masked self-attention

Encoder bidirectional mask

	Je	suis	etudiant

Decoder causal mask

	I	am	a	student

Cross-attention read pattern

	Je	suis	etudiant

Architecture formulas

Next-token probabilities

Usage Instructions

Architecture: Switch between Encoder-Decoder, Decoder-Only, and Encoder-Only.
Step controls: Move through architecture selection, source encoding, masked decoding, cross-attention, and next-token prediction.
Encoder table: Shows that encoder self-attention is bidirectional: all source tokens can see all source tokens.
Decoder table: Shows the causal mask: each target token can only see itself and previous tokens.
Cross-attention table: Shows decoder target queries reading encoder source memory. This only matters in Encoder-Decoder mode.

What To Notice

Encoder-only: Understands input but does not generate a target sequence.
Decoder-only: Generates next tokens but has no separate source memory.
Encoder-decoder: Uses encoder memory and decoder generation together.
Cross-attention: The decoder Query reads encoder Keys and Values.

Parameters

Architecture: Controls which Transformer family is shown.
Animation: Steps through the macro flow.

Limitations

Architecture diagram, not a running model. The tool shows which tokens each stack can attend to (bidirectional encoder, causal decoder, cross-attention bridge) using attendance tables; it does not run real attention, embeddings, or generate text.
Single short example. A fixed French→English sentence pair illustrates the data flow; there is no tokenizer, vocabulary, or beam search, and the “prediction” is a canned example.
One layer, conceptual. Each stack is drawn as a single conceptual block. Real models stack many layers, each with its own attention + FFN + Add&Norm, plus positional encodings.
No weights or training. There are no learned matrices and no gradients; the demo is purely structural, so it cannot show how the model learns alignment or translation.
Three canonical families only. Encoder-only, decoder-only, and encoder-decoder are shown; hybrids, mixture-of-experts, and retrieval-augmented variants are out of scope.
Teaching tool. Built to make the differences between BERT-style, GPT-style, and seq2seq Transformers and the role of cross-attention clear — not a functional translator or language model.