|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This interactive simulation demonstrates Q-Learning using Tic-Tac-Toe. You can see the agent's Q-table—its memory—directly: empty cells show how desirable the agent thinks each move is. Watch it learn which moves lead to wins (+1), losses (-1), or draws (0.5) through trial and error. Why Tic-Tac-Toe for RL?Reinforcement Learning is often a "black box." Here, the state is the board (a 9-character string like
The Q-Learning Update (Bellman Equation)After each move, the agent updates the Q-value for the state-action pair it just used:
Q-Update (Bellman):
Q(s, a) ← Q(s, a) + α · [ R + γ · maxa' Q(s', a') − Q(s, a) ]
α (alpha) = learning rate; γ (gamma) = discount factor; R = reward; s' = next state. For terminal states (win/loss/draw), there is no next state, so max Q(s', a') is 0. Parameters and ε-Greedy
ε-Greedy: With probability ε the agent picks a random empty cell; otherwise it picks the cell with the highest Q-value. Setting ε to 1 in the sim shows the exploration—exploitation trade-off clearly. What You See in the SimBoard: You play X (first); agent plays O. Q-value overlay: Empty cells show the agent's Q(s, a) for that move. Heatmap: Green = higher (good) Q-value; red = lower (bad). Neural Console: Stats, win-rate graph, log, and Detailed Info with the last Q-update formula (Plugged-in) and the action table. Q-Table frame: Full Q-table below; when a Q-update is shown in Detailed Info, the corresponding state row and action cell are highlighted and scrolled into view. Step: Advance one half-move (X or O). Fast Train: Run 1,000 games (X moves first, O learns). Hover an empty cell to see the Bellman breakdown (TD target and error). Board (you play X, agent plays O)
0.30
0.10
0.90
Neural Console
States: 0
Games: 0
Wins: 0
Win rate: 0%
Q-table persists across games; only Reset All clears it.
Q-values are learned for O (the agent) only; X is the opponent (random in training, or you when playing).
Q(s,a) ← Q(s,a) + α·[R + γ·max Q(s',a') − Q(s,a)]
Detailed Info
Q(s,a) ← Q(s,a) + α[R + γ·maxa' Q(s',a') − Q(s,a)]
α= , γ= , ε=
State s =
Q(s,a) for all possible actions:
Q-Table
Usage Instructions
Experiments to Try
Mathematical DetailsTD Error (Temporal Difference Error):
δ = R + γ · maxa'Q(s', a') - Q(s, a)
The "surprise" - difference between expected and actual value. Positive δ = outcome better than expected → increase Q Negative δ = outcome worse than expected → decrease Q Convergence Guarantee: Q-Learning is proven to converge to optimal Q* if:
Agent Moves by Q-Values, Not Rewards!A common misconception: "The agent moves toward rewards." This is imprecise!
Key Insight: The agent cannot see rewards before acting — rewards are only revealed after the action is taken. The action is chosen based on Q-values (learned estimates), not direct reward lookup.
Timeline of Events:
1. Agent is at state s 2. Choose action based on Q(s, a) ← Uses Q-values! 3. Execute action a 4. Environment gives reward R — Reward revealed AFTER 5. Agent arrives at state s' 6. Update Q-value using R ← Learning happens Analogy: Reward is like a teacher's grade — you only get it after submitting your answer. Q-value is like your study notes — you consult them to decide your answer. You don't choose answers based on grades (you don't know them yet!), you choose based on what you've learned from past grades. Known Rewards vs. True RLIn this Tic-Tac-Toe sim, we defined rewards (+1 / -1 / 0.5). The agent does not see them in advance; it discovers which moves lead to win/loss/draw through play.
Q-Learning shines when:
Bootstrapping: How We Handle Unknown Future RewardsThe Q-value represents cumulative future reward:
Q(s,a) ≈ R + γ·R' + γ²·R'' + γ³·R''' + ...
↑ ↑ ↑ ↑ known unknown unknown unknown The Problem: How do we calculate Q NOW when it depends on FUTURE rewards we don't know? The Solution — Bootstrapping: We use our current estimate of future rewards!
The Recursive Trick:
Q(s,a) = R + γ · [R' + γ·R'' + γ²·R''' + ...]
↑ └— This IS Q(s', a')! known! This IS Q(s', a')! So: Q(s,a) = R + γ · max Q(s', a') ↑ ↑ known estimated from Q-table! Why Does This Work? Iteration! Each update makes the estimate slightly better:
Game 1: Q(s,a) = 0 (initial)
Game 2: Q(s,a) = 0.09 (after a win, α=0.1) ... Game N: Q(s,a) converges toward true value The "Ripple Effect": When the agent wins (+1) or loses (-1), that reward updates the Q-value for the last move. Earlier moves in the same game get updated using the next state's Q-values. Value propagates backward through the game.
The Magic of Bootstrapping: We're using our own (initially wrong) estimates to improve those same estimates, and mathematically this process converges to the true values! This is called Temporal Difference (TD) Learning.
Real-World Applications
Extensions Beyond Basic Q-Learning
Key Insights
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||