Web Simulator | ShareTechnote

Web Simulation

Reinforcement Learning: Q-Learning with Tic-Tac-Toe

This interactive simulation demonstrates Q-Learning using Tic-Tac-Toe. You can see the agent's Q-table—its memory—directly: empty cells show how desirable the agent thinks each move is. Watch it learn which moves lead to wins (+1), losses (-1), or draws (0.5) through trial and error.

Sections

Why Tic-Tac-Toe for RL?
The Q-Learning Update (Bellman Equation)
Parameters and ε-Greedy
What You See in the Sim
Simulation
Usage Instructions
Experiments to Try
Mathematical Details
Agent Moves by Q-Values, Not Rewards!
Known Rewards vs. True RL
Bootstrapping: How We Handle Unknown Future Rewards
Real-World Applications
Extensions Beyond Basic Q-Learning
Key Insights
Limitations

Why Tic-Tac-Toe for RL?

Reinforcement Learning is often a "black box." Here, the state is the board (a 9-character string like X.O...X..), and the Q-table stores one value per possible move in each state. By overlaying Q-values on the board, you see the agent's "thought process"—which cells it prefers and how that changes as it learns.

State (s): Board configuration when it's the agent's turn (e.g. X.O...X..)
Action (a): Which empty cell to play (0–8)
Reward (R): +1 for win, -1 for loss, 0.5 for draw (only at game end)
Agent: Plays O (second player); X (opponent) always moves first. Training uses random X so O learns to respond.

The Q-Learning Update (Bellman Equation)

After each move, the agent updates the Q-value for the state-action pair it just used:

Q-Update (Bellman):

                            Q(s, a) ← Q(s, a) + α · [ R + γ · maxa' Q(s', a') − Q(s, a) ]
                        

α (alpha) = learning rate; γ (gamma) = discount factor; R = reward; s' = next state. For terminal states (win/loss/draw), there is no next state, so max Q(s', a') is 0.

Parameters and ε-Greedy

Parameter	Symbol	Range	Effect
Learning Rate	α (alpha)	0 to 1	How much new information overrides old. High = fast but unstable, Low = slow but stable
Discount Factor	γ (gamma)	0 to 1	How much to value future rewards. High = farsighted, Low = shortsighted
Exploration Rate	ε (epsilon)	0 to 1	Probability of taking random action. Balances exploration vs exploitation

ε-Greedy: With probability ε the agent picks a random empty cell; otherwise it picks the cell with the highest Q-value. Setting ε to 1 in the sim shows the exploration—exploitation trade-off clearly.

What You See in the Sim

Board: You play X (first); agent plays O. Q-value overlay: Empty cells show the agent's Q(s, a) for that move. Heatmap: Green = higher (good) Q-value; red = lower (bad). Neural Console: Stats, win-rate graph, log, and Detailed Info with the last Q-update formula (Plugged-in) and the action table. Q-Table frame: Full Q-table below; when a Q-update is shown in Detailed Info, the corresponding state row and action cell are highlighted and scrolled into view. Step: Advance one half-move (X or O). Fast Train: Run 1,000 games (X moves first, O learns). Hover an empty cell to see the Bellman breakdown (TD target and error).

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Board (you play X, agent plays O)

ε (explore) 0.30

α (learn) 0.10

γ (discount) 0.90

Train speed

Neural Console

States: 0 Games: 0 Wins: 0 Win rate: 0%

Q-table persists across games; only Reset All clears it.

Q-values are learned for O (the agent) only; X is the opponent (random in training, or you when playing).

Q(s,a) ← Q(s,a) + α·[R + γ·max Q(s',a') − Q(s,a)]

Detailed Info

Q(s,a) ← Q(s,a) + α[R + γ·max_a' Q(s',a') − Q(s,a)]

α= , γ= , ε=

State s =

Q(s,a) for all possible actions:

Q-Table

Usage Instructions

Play: You play X (first); click an empty cell. The agent (O) responds using its Q-table (ε-greedy).
Q-value overlay: Empty cells show Q(s, a) for that move. Green = higher (good); red = lower (bad).
Step: Advance one move (X or O). Use to see the Plugged-in formula and Q-table highlight for each Q-update.
Fast Train: Run 1,000 games (X moves first, random X; O learns). Win rate and state count update in the Neural Console.
Q-Table frame: Lists all states and Q(s,a) for actions 0–8. The row/cell for the last Q-update (in Detailed Info) is highlighted and auto-scrolled.
Sliders: ε (exploration), α (learning rate), γ (discount). Train speed: Slow or Instant.
Reset board: New game; keeps Q-table and stats.
Reset All: Clears board, Q-table, stats, graph, log, and Detailed Info. Use Fast Train or Step to train from scratch.
Persistence: The Q-table and stats are saved in the browser (localStorage) and restored on reload until you click Reset All.
Hover: Hover an empty cell for the Bellman breakdown (TD target and error).

Experiments to Try

Experiment	Setup	What to Observe
Fast Train	Click Fast Train (1k games), then play	Agent gets much stronger; Q-values favor wins and blocks
High exploration	Set ε = 1.0	Agent makes random moves; illustrates exploration vs exploitation
Step + Plugged-in	Use Step to advance; when O moves, check Detailed Info	Plugged-in shows the exact Bellman update; Q-Table frame highlights the updated state and action
Hover + Bellman	Hover an empty cell	Tooltip shows Q(s,a), TD target, and TD error for that move

Mathematical Details

TD Error (Temporal Difference Error):

                        δ = R + γ · maxa'Q(s', a') - Q(s, a)

                        The "surprise" - difference between expected and actual value.

                        Positive δ = outcome better than expected → increase Q

                        Negative δ = outcome worse than expected → decrease Q

Convergence Guarantee: Q-Learning is proven to converge to optimal Q* if:

All state-action pairs are visited infinitely often
Learning rate decreases appropriately (∑α = ∞, ∑α² < ∞)
In practice: ε-greedy exploration ensures sufficient exploration

Agent Moves by Q-Values, Not Rewards!

A common misconception: "The agent moves toward rewards." This is imprecise!

Common Saying	Precise Statement
"Agent moves toward reward"	Agent moves toward high Q-value
"Agent avoids negative reward"	Agent avoids low Q-value (learned from past negative rewards)
"Reward guides behavior"	Reward trains Q; Q guides behavior

Key Insight: The agent cannot see rewards before acting — rewards are only revealed after the action is taken. The action is chosen based on Q-values (learned estimates), not direct reward lookup.

                        Timeline of Events:

                        1. Agent is at state s

                        2. Choose action based on Q(s, a) ← Uses Q-values!

                        3. Execute action a

                        4. Environment gives reward R — Reward revealed AFTER

                        5. Agent arrives at state s'

                        6. Update Q-value using R ← Learning happens

Analogy: Reward is like a teacher's grade — you only get it after submitting your answer. Q-value is like your study notes — you consult them to decide your answer. You don't choose answers based on grades (you don't know them yet!), you choose based on what you've learned from past grades.

Known Rewards vs. True RL

In this Tic-Tac-Toe sim, we defined rewards (+1 / -1 / 0.5). The agent does not see them in advance; it discovers which moves lead to win/loss/draw through play.

Entity	Knows Rewards?	Notes
We (designers)	✓ Yes	We defined win/loss/draw rewards
Agent (true RL)	✗ No	Learns from experience via Q-updates

Q-Learning shines when:

Stochastic transitions: Same action → different outcomes
Unknown environment: Robot exploring new building
Delayed rewards: Chess — win/lose only at end
Complex reward shaping: Cumulative, discounted rewards

Bootstrapping: How We Handle Unknown Future Rewards

The Q-value represents cumulative future reward:

                        Q(s,a) ≈ R + γ·R' + γ²·R'' + γ³·R''' + ...

                             ↑     ↑       ↑         ↑

                           known  unknown  unknown  unknown

The Problem: How do we calculate Q NOW when it depends on FUTURE rewards we don't know?

The Solution — Bootstrapping: We use our current estimate of future rewards!

The Recursive Trick:

                            Q(s,a) = R + γ · [R' + γ·R'' + γ²·R''' + ...]

                                     ↑        └— This IS Q(s', a')!

                                  known!              This IS Q(s', a')!

                            So: Q(s,a) = R + γ · max Q(s', a')

                                           ↑              ↑

                                        known         estimated from Q-table!

Why Does This Work? Iteration! Each update makes the estimate slightly better:

                        Game 1:   Q(s,a) = 0      (initial)

                        Game 2:   Q(s,a) = 0.09   (after a win, α=0.1)

                        ...

                        Game N:   Q(s,a) converges toward true value

The "Ripple Effect": When the agent wins (+1) or loses (-1), that reward updates the Q-value for the last move. Earlier moves in the same game get updated using the next state's Q-values. Value propagates backward through the game.

The Magic of Bootstrapping: We're using our own (initially wrong) estimates to improve those same estimates, and mathematically this process converges to the true values! This is called Temporal Difference (TD) Learning.

Real-World Applications

Domain	Application
Robotics	Navigation, manipulation, walking robots
Games	Game AI (Atari, Go, Chess), NPC behavior
Finance	Portfolio optimization, trading strategies
Healthcare	Treatment planning, drug dosing
Autonomous Vehicles	Decision making, path planning
Recommendation Systems	Content personalization, ad placement

Extensions Beyond Basic Q-Learning

SARSA: On-policy variant (uses actual next action, not max)
Deep Q-Networks (DQN): Neural network approximates Q-function
Double DQN: Addresses overestimation bias
Policy Gradient: Directly optimize policy instead of values
Actor-Critic: Combines value and policy learning

Key Insights

Credit Assignment: How does the agent know which actions led to reward? Q-Learning propagates value backwards through the Bellman equation.
Delayed Rewards: The discount factor γ handles rewards that come many steps later.
Bootstrapping: We update estimates using other estimates - powerful but can propagate errors.
Tabular vs Function Approximation: This demo uses a table; real-world uses neural networks.

Limitations

Tabular Q-learning on a tiny game. Tic-Tac-Toe has only a few thousand reachable board states, so a full Q-table fits in memory. This approach does not scale to larger games — chess or Go need function approximation (Deep RL), which this demo does not implement.
Self-play / fixed-opponent learning. The agent learns against the training opponent it was given; its strength reflects that opponent's quality. It is not guaranteed to play the game-theoretic optimal (minimax) strategy unless trained long enough against strong play.
Single fixed reward scheme. Win / lose / draw rewards are predefined; the agent is not discovering an unknown reward structure, and reward shaping is not explored.
ε-greedy exploration only. Action selection uses simple ε-greedy; more advanced exploration (UCB, softmax, optimistic init) is not offered, and ε / α / γ are set manually.
Deterministic, fully observable. The board state is fully known and moves are deterministic — there is no hidden information, stochastic transitions, or partial observability.
Teaching tool, not a benchmark. Training-episode counts and visual overlays are tuned for clarity, not to demonstrate state-of-the-art RL performance.