Web Simulation 

 

 

 

 

microGPT: Minimal GPT Training & Inference 

This simulator implements the most atomic way to train and run inference for a GPT in pure JavaScript (ported from the dependency-free Python version by @karpathy). It demonstrates tokenization, autograd, a tiny transformer (embedding + position, one layer of multi-head attention and MLP), and Adam optimization to learn to generate short names.

The excellent orignal code and documents are at Andrej Karpathy blog . This is just a visualized web based version of Karpathy's python code.

Mathematical foundation

1. Tokenizer

The dataset is a list of documents (e.g. names). Unique characters in the dataset form the character vocabulary; each character maps to a token id 0 .. n−1. A special BOS (Beginning of Sequence) token has id n. Documents are tokenized as BOS + character ids + BOS. Vocab size = n + 1.

2. Autograd

Each scalar in the computation is a Value with data (forward value) and grad (gradient from the loss). Operations (+, ×, log, exp, ReLU, etc.) build a directed graph. Backward traverses the graph in topological order and applies the chain rule: child.grad += local_derivative × parent.grad.

3. Model (GPT-style)

For each position: token embedding (wte) + position embedding (wpe) → RMSNorm. Then for each layer: (1) Multi-head attention: RMSNorm → Q, K, V projections; per-head scaled dot-product attention over previous positions; output projection and residual; (2) MLP: RMSNorm → linear(4×dim) → ReLU → linear(dim) → residual. Final hidden state is multiplied by lm_head to get logits over the vocabulary. Softmax gives next-token probabilities; the loss is the average negative log probability of the target token at each position.

4. Optimizer (Adam)

First moment m and second moment v are updated with decay rates β1, β2. Bias-corrected , are used to update parameters: p ← p − lr × m̂ / (√v̂ + ε). Learning rate is linearly decayed over steps.

5. Inference

Start with BOS; at each position run the forward pass, apply temperature to logits (divide by T), softmax, then sample the next token. Stop when BOS is sampled or context length is reached. Lower temperature → more deterministic; higher → more diverse (and often noisier) outputs.

Model stats
num docs:   vocab size:   num params:
Step: 0 / 1000
Training & inference
0.50
Network architecture
Training
posIdInput (context)token_id (in)Targettoken_id (out)
token_id pos_id Wte Wpe Sum (Add) RMSNorm LAYER 0 ATTENTION RMS Q,K,V 4h-Attn out FEED FORWARD (MLP) RMS FC 4d ReLU FC d lm_head Logits
Weights (live)
Number Matrix Row Col Value
Training log

                        
Generated samples (inference)

 

Usage

Use the controls to train a minimal GPT and generate name-like strings:

  1. Num steps: Total training steps. Each step trains on one position (one autoregressive prediction per doc). Try 500–1000 for quick results.
  2. Temperature: Sampling temperature for inference (0.1–1.5). Lower = more conservative; higher = more random.
  3. Reset: Re-initialize the model and optimizer. Clears the log and step counter.
  4. Step Fwd: Train one position and update the overlay. Same logic as Run, but a single step. The overlay shows posId, Input (context), token_id (in), Target, and token_id (out).
  5. Run / Stop: Repeatedly invokes Step Fwd until Num steps is reached. The overlay updates each step to show the current (doc, pos) with Input growing by one token per step within each document.
  6. Gen Step: Generate one token at a time. The overlay switches to "Generation" mode and shows the input context and sampled output for each step.
  7. Generate: Repeatedly invokes Gen Step to sample 20 sequences from the current model using the current temperature.

After training (e.g. 500–1000 steps), click Generate to see short “hallucinated” names. The model learns character-level statistics from the small embedded name list.