|
|
||||||
|
This simulator implements the most atomic way to train and run inference for a GPT in pure JavaScript (ported from the dependency-free Python version by @karpathy). It demonstrates tokenization, autograd, a tiny transformer (embedding + position, one layer of multi-head attention and MLP), and Adam optimization to learn to generate short names. The excellent orignal code and documents are at Andrej Karpathy blog . This is just a visualized web based version of Karpathy's python code. Mathematical foundation1. Tokenizer The dataset is a list of documents (e.g. names). Unique characters in the dataset form the character vocabulary; each character maps to a token id 0 .. n−1. A special BOS (Beginning of Sequence) token has id n. Documents are tokenized as BOS + character ids + BOS. Vocab size = n + 1. 2. Autograd Each scalar in the computation is a Value with data (forward value) and grad (gradient from the loss). Operations (+, ×, log, exp, ReLU, etc.) build a directed graph. Backward traverses the graph in topological order and applies the chain rule: child.grad += local_derivative × parent.grad. 3. Model (GPT-style) For each position: token embedding (wte) + position embedding (wpe) → RMSNorm. Then for each layer: (1) Multi-head attention: RMSNorm → Q, K, V projections; per-head scaled dot-product attention over previous positions; output projection and residual; (2) MLP: RMSNorm → linear(4×dim) → ReLU → linear(dim) → residual. Final hidden state is multiplied by lm_head to get logits over the vocabulary. Softmax gives next-token probabilities; the loss is the average negative log probability of the target token at each position. 4. Optimizer (Adam) First moment m and second moment v are updated with decay rates β1, β2. Bias-corrected m̂, v̂ are used to update parameters: p ← p − lr × m̂ / (√v̂ + ε). Learning rate is linearly decayed over steps. 5. Inference Start with BOS; at each position run the forward pass, apply temperature to logits (divide by T), softmax, then sample the next token. Stop when BOS is sampled or context length is reached. Lower temperature → more deterministic; higher → more diverse (and often noisier) outputs. Model stats
num docs: — vocab size: — num params: —
Step: 0 / 1000
Training & inference
0.50
Network architecture
Weights (live)
Training log
Generated samples (inference)
UsageUse the controls to train a minimal GPT and generate name-like strings:
After training (e.g. 500–1000 steps), click Generate to see short “hallucinated” names. The model learns character-level statistics from the small embedded name list.
|
||||||