Web Simulator | ShareTechnote

Web Simulation

microGPT: Minimal GPT Training & Inference

This simulator implements the most atomic way to train and run inference for a GPT in pure JavaScript (ported from the dependency-free Python version by @karpathy). It demonstrates tokenization, autograd, a tiny transformer (embedding + position, one layer of multi-head attention and MLP), and Adam optimization to learn to generate short names.

The excellent orignal code and documents are at Andrej Karpathy blog . This is just a visualized web based version of Karpathy's python code.

Sections

Mathematical foundation
Simulation
Usage
Limitations

Mathematical foundation

1. Tokenizer

The dataset is a list of documents (e.g. names). Unique characters in the dataset form the character vocabulary; each character maps to a token id 0 .. n−1. A special BOS (Beginning of Sequence) token has id n. Documents are tokenized as BOS + character ids + BOS. Vocab size = n + 1.

2. Autograd

Each scalar in the computation is a Value with data (forward value) and grad (gradient from the loss). Operations (+, ×, log, exp, ReLU, etc.) build a directed graph. Backward traverses the graph in topological order and applies the chain rule:

child.grad += local_derivative × parent.grad

3. Model (GPT-style)

For each position: token embedding (wte) + position embedding (wpe) → RMSNorm. Then for each layer: (1) Multi-head attention: RMSNorm → Q, K, V projections; per-head scaled dot-product attention over previous positions; output projection and residual; (2) MLP: RMSNorm → linear(4×dim) → ReLU → linear(dim) → residual. The final hidden state is multiplied by lm_head to get logits over the vocabulary. Softmax gives next-token probabilities; the loss is the average negative log probability of the target token at each position.

4. Optimizer (Adam)

First moment m and second moment v are updated with decay rates β₁, β₂. Bias-corrected m̂, v̂ update the parameters (learning rate is linearly decayed over steps):

p ← p − lr × m̂ / (√v̂ + ε)

5. Inference

Start with BOS; at each position run the forward pass, apply temperature to logits (divide by T), softmax, then sample the next token. Stop when BOS is sampled or context length is reached. Lower temperature → more deterministic; higher → more diverse (and often noisier) outputs.

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Model stats

num docs: — vocab size: — num params: —

Step: 0 / 1000

Training & inference

Num steps:

Temperature: 0.50

Network architecture

Training
posId	Input (context)	token_id (in)	Target	token_id (out)
—	—	—	—	—

Weights (live)

Number	Matrix	Row	Col	Value

Training log

Generated samples (inference)

Usage

Use the controls to train a minimal GPT and generate name-like strings:

Num steps: Total training steps (each step uses one document). Try 500–1000 for quick results.
Temperature: Sampling temperature for inference (0.1–1.5). Lower = more conservative; higher = more random.
Reset: Re-initialize the model and optimizer. Clears the log and step counter.
Step Fwd: Advance one autoregressive step (training data). The overlay shows posId, Input (context), token_id, and Target.
Gen Step: Generate one token at a time. The overlay switches to "Generation" mode and shows the input context and sampled output for each step.
Run / Stop: Run training until the step count reaches Num steps, or click Stop to pause.
Generate: Sample 20 sequences from the current model using the current temperature and show them in the samples area.

After training (e.g. 500–1000 steps), click Generate to see short "hallucinated" names. The model learns character-level statistics from the small embedded name list.

Limitations

Educational micro-scale. A single transformer layer with a tiny embedding dimension and a handful of heads, trained on a short embedded name list. It captures the mechanism of a GPT, not the capacity — real models have dozens of layers and billions of parameters.
Character-level, tiny vocabulary. Tokens are individual characters plus a BOS marker; there is no sub-word/BPE tokenizer, so it cannot model words, syntax, or long-range meaning.
Scalar autograd, no tensors/GPU. Every value is a scalar node in a hand-built graph for transparency; this is orders of magnitude slower than batched tensor math, so only short training runs are practical.
Overfits a small dataset. With so little data the model memorizes character statistics rather than generalizing; "names" are plausible-looking but not novel in a meaningful sense.
Fixed, short context. A small maximum context length limits how much prior text attention can use; documents longer than the context are truncated.
Port of a teaching model. This is a visualized JS port of Karpathy's dependency-free micro-GPT; it deliberately omits dropout, weight tying nuances, mixed precision, and the engineering that production training requires.