Web Simulation 

 

 

 

 

microGPT: Minimal GPT Training & Inference 

This simulator implements the most atomic way to train and run inference for a GPT in pure JavaScript (ported from the dependency-free Python version by @karpathy). It demonstrates tokenization, autograd, a tiny transformer (embedding + position, one layer of multi-head attention and MLP), and Adam optimization to learn to generate short names.

The excellent orignal code and documents are at Andrej Karpathy blog . This is just a visualized web based version of Karpathy's python code.

Mathematical foundation

1. Tokenizer

The dataset is a list of documents (e.g. names). Unique characters in the dataset form the character vocabulary; each character maps to a token id 0 .. n−1. A special BOS (Beginning of Sequence) token has id n. Documents are tokenized as BOS + character ids + BOS. Vocab size = n + 1.

2. Autograd

Each scalar in the computation is a Value with data (forward value) and grad (gradient from the loss). Operations (+, ×, log, exp, ReLU, etc.) build a directed graph. Backward traverses the graph in topological order and applies the chain rule:

child.grad += local_derivative × parent.grad
3. Model (GPT-style)

For each position: token embedding (wte) + position embedding (wpe) → RMSNorm. Then for each layer: (1) Multi-head attention: RMSNorm → Q, K, V projections; per-head scaled dot-product attention over previous positions; output projection and residual; (2) MLP: RMSNorm → linear(4×dim) → ReLU → linear(dim) → residual. The final hidden state is multiplied by lm_head to get logits over the vocabulary. Softmax gives next-token probabilities; the loss is the average negative log probability of the target token at each position.

4. Optimizer (Adam)

First moment m and second moment v are updated with decay rates β1, β2. Bias-corrected , update the parameters (learning rate is linearly decayed over steps):

p ← p − lr × m̂ / (√v̂ + ε)
5. Inference

Start with BOS; at each position run the forward pass, apply temperature to logits (divide by T), softmax, then sample the next token. Stop when BOS is sampled or context length is reached. Lower temperature → more deterministic; higher → more diverse (and often noisier) outputs.

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Model stats
num docs:   vocab size:   num params:
Step: 0 / 1000
Training & inference
0.50
Network architecture

Training

posId

Input (context)

token_id (in)

Target

token_id (out)

token_id pos_id Wte Wpe Sum (Add) RMSNorm LAYER 0 ATTENTION RMS Q,K,V 4h-Attn out FEED FORWARD (MLP) RMS FC 4d ReLU FC d lm_head Logits
Weights (live)

Number

Matrix

Row

Col

Value

Training log

                        
Generated samples (inference)

 

Usage

Use the controls to train a minimal GPT and generate name-like strings:

  1. Num steps: Total training steps (each step uses one document). Try 500–1000 for quick results.
  2. Temperature: Sampling temperature for inference (0.1–1.5). Lower = more conservative; higher = more random.
  3. Reset: Re-initialize the model and optimizer. Clears the log and step counter.
  4. Step Fwd: Advance one autoregressive step (training data). The overlay shows posId, Input (context), token_id, and Target.
  5. Gen Step: Generate one token at a time. The overlay switches to "Generation" mode and shows the input context and sampled output for each step.
  6. Run / Stop: Run training until the step count reaches Num steps, or click Stop to pause.
  7. Generate: Sample 20 sequences from the current model using the current temperature and show them in the samples area.

After training (e.g. 500–1000 steps), click Generate to see short "hallucinated" names. The model learns character-level statistics from the small embedded name list.

Limitations

  • Educational micro-scale. A single transformer layer with a tiny embedding dimension and a handful of heads, trained on a short embedded name list. It captures the mechanism of a GPT, not the capacity — real models have dozens of layers and billions of parameters.
  • Character-level, tiny vocabulary. Tokens are individual characters plus a BOS marker; there is no sub-word/BPE tokenizer, so it cannot model words, syntax, or long-range meaning.
  • Scalar autograd, no tensors/GPU. Every value is a scalar node in a hand-built graph for transparency; this is orders of magnitude slower than batched tensor math, so only short training runs are practical.
  • Overfits a small dataset. With so little data the model memorizes character statistics rather than generalizing; "names" are plausible-looking but not novel in a meaningful sense.
  • Fixed, short context. A small maximum context length limits how much prior text attention can use; documents longer than the context are truncated.
  • Port of a teaching model. This is a visualized JS port of Karpathy's dependency-free micro-GPT; it deliberately omits dropout, weight tying nuances, mixed precision, and the engineering that production training requires.