Web Simulation 

 

 

 

 

 

Reinforcement Learning: Q-Learning with Tic-Tac-Toe 

This interactive simulation demonstrates Q-Learning using Tic-Tac-Toe. You can see the agent's Q-table—its memory—directly: empty cells show how desirable the agent thinks each move is. Watch it learn which moves lead to wins (+1), losses (-1), or draws (0.5) through trial and error.

Why Tic-Tac-Toe for RL?

Reinforcement Learning is often a "black box." Here, the state is the board (a 9-character string like X.O...X..), and the Q-table stores one value per possible move in each state. By overlaying Q-values on the board, you see the agent's "thought process"—which cells it prefers and how that changes as it learns.

  • State (s): Board configuration when it's the agent's turn (e.g. X.O...X..)
  • Action (a): Which empty cell to play (0–8)
  • Reward (R): +1 for win, -1 for loss, 0.5 for draw (only at game end)
  • Agent: Plays O (second player); X (opponent) always moves first. Training uses random X so O learns to respond.

The Q-Learning Update (Bellman Equation)

After each move, the agent updates the Q-value for the state-action pair it just used:

Q-Update (Bellman):
Q(s, a) ← Q(s, a) + α · [ R + γ · maxa' Q(s', a') − Q(s, a) ]

α (alpha) = learning rate; γ (gamma) = discount factor; R = reward; s' = next state. For terminal states (win/loss/draw), there is no next state, so max Q(s', a') is 0.

Parameters and ε-Greedy

Parameter

Symbol

Range

Effect

Learning Rate

α (alpha)

0 to 1

How much new information overrides old. High = fast but unstable, Low = slow but stable

Discount Factor

γ (gamma)

0 to 1

How much to value future rewards. High = farsighted, Low = shortsighted

Exploration Rate

ε (epsilon)

0 to 1

Probability of taking random action. Balances exploration vs exploitation

ε-Greedy: With probability ε the agent picks a random empty cell; otherwise it picks the cell with the highest Q-value. Setting ε to 1 in the sim shows the exploration—exploitation trade-off clearly.

What You See in the Sim

Board: You play X (first); agent plays O. Q-value overlay: Empty cells show the agent's Q(s, a) for that move. Heatmap: Green = higher (good) Q-value; red = lower (bad). Neural Console: Stats, win-rate graph, log, and Detailed Info with the last Q-update formula (Plugged-in) and the action table. Q-Table frame: Full Q-table below; when a Q-update is shown in Detailed Info, the corresponding state row and action cell are highlighted and scrolled into view. Step: Advance one half-move (X or O). Fast Train: Run 1,000 games (X moves first, O learns). Hover an empty cell to see the Bellman breakdown (TD target and error).

Simulation

The interactive simulator is below. Use the controls to explore the concepts described above.

Board (you play X, agent plays O)
0.30
0.10
0.90
Neural Console
States: 0 Games: 0 Wins: 0 Win rate: 0%
Q-table persists across games; only Reset All clears it.
Q-values are learned for O (the agent) only; X is the opponent (random in training, or you when playing).
Q(s,a) ← Q(s,a) + α·[R + γ·max Q(s',a') − Q(s,a)]
Detailed Info
Q(s,a) ← Q(s,a) + α[R + γ·maxa' Q(s',a') − Q(s,a)]
α= , γ= , ε=
State s =
Q(s,a) for all possible actions:
Q-Table

Usage Instructions

  1. Play: You play X (first); click an empty cell. The agent (O) responds using its Q-table (ε-greedy).
  2. Q-value overlay: Empty cells show Q(s, a) for that move. Green = higher (good); red = lower (bad).
  3. Step: Advance one move (X or O). Use to see the Plugged-in formula and Q-table highlight for each Q-update.
  4. Fast Train: Run 1,000 games (X moves first, random X; O learns). Win rate and state count update in the Neural Console.
  5. Q-Table frame: Lists all states and Q(s,a) for actions 0–8. The row/cell for the last Q-update (in Detailed Info) is highlighted and auto-scrolled.
  6. Sliders: ε (exploration), α (learning rate), γ (discount). Train speed: Slow or Instant.
  7. Reset board: New game; keeps Q-table and stats.
  8. Reset All: Clears board, Q-table, stats, graph, log, and Detailed Info. Use Fast Train or Step to train from scratch.
  9. Persistence: The Q-table and stats are saved in the browser (localStorage) and restored on reload until you click Reset All.
  10. Hover: Hover an empty cell for the Bellman breakdown (TD target and error).

Experiments to Try

Experiment

Setup

What to Observe

Fast Train

Click Fast Train (1k games), then play

Agent gets much stronger; Q-values favor wins and blocks

High exploration

Set ε = 1.0

Agent makes random moves; illustrates exploration vs exploitation

Step + Plugged-in

Use Step to advance; when O moves, check Detailed Info

Plugged-in shows the exact Bellman update; Q-Table frame highlights the updated state and action

Hover + Bellman

Hover an empty cell

Tooltip shows Q(s,a), TD target, and TD error for that move

Mathematical Details

TD Error (Temporal Difference Error):

δ = R + γ · maxa'Q(s', a') - Q(s, a)

The "surprise" - difference between expected and actual value.
Positive δ = outcome better than expected → increase Q
Negative δ = outcome worse than expected → decrease Q

Convergence Guarantee: Q-Learning is proven to converge to optimal Q* if:

  • All state-action pairs are visited infinitely often
  • Learning rate decreases appropriately (∑α = ∞, ∑α² < ∞)
  • In practice: ε-greedy exploration ensures sufficient exploration

Agent Moves by Q-Values, Not Rewards!

A common misconception: "The agent moves toward rewards." This is imprecise!

Common Saying

Precise Statement

"Agent moves toward reward"

Agent moves toward high Q-value

"Agent avoids negative reward"

Agent avoids low Q-value (learned from past negative rewards)

"Reward guides behavior"

Reward trains Q; Q guides behavior

Key Insight: The agent cannot see rewards before acting — rewards are only revealed after the action is taken. The action is chosen based on Q-values (learned estimates), not direct reward lookup.
Timeline of Events:

1. Agent is at state s
2. Choose action based on Q(s, a) ← Uses Q-values!
3. Execute action a
4. Environment gives reward R — Reward revealed AFTER
5. Agent arrives at state s'
6. Update Q-value using R ← Learning happens

Analogy: Reward is like a teacher's grade — you only get it after submitting your answer. Q-value is like your study notes — you consult them to decide your answer. You don't choose answers based on grades (you don't know them yet!), you choose based on what you've learned from past grades.

Known Rewards vs. True RL

In this Tic-Tac-Toe sim, we defined rewards (+1 / -1 / 0.5). The agent does not see them in advance; it discovers which moves lead to win/loss/draw through play.

Entity

Knows Rewards?

Notes

We (designers)

✓ Yes

We defined win/loss/draw rewards

Agent (true RL)

✗ No

Learns from experience via Q-updates

Q-Learning shines when:

  • Stochastic transitions: Same action → different outcomes
  • Unknown environment: Robot exploring new building
  • Delayed rewards: Chess — win/lose only at end
  • Complex reward shaping: Cumulative, discounted rewards

Bootstrapping: How We Handle Unknown Future Rewards

The Q-value represents cumulative future reward:

Q(s,a) ≈ R + γ·R' + γ²·R'' + γ³·R''' + ...
     ↑     ↑       ↑         ↑
   known  unknown  unknown  unknown

The Problem: How do we calculate Q NOW when it depends on FUTURE rewards we don't know?

The Solution — Bootstrapping: We use our current estimate of future rewards!

The Recursive Trick:

Q(s,a) = R + γ · [R' + γ·R'' + γ²·R''' + ...]
         ↑        └— This IS Q(s', a')!
      known!              This IS Q(s', a')!

So: Q(s,a) = R + γ · max Q(s', a')
               ↑              ↑
            known         estimated from Q-table!

Why Does This Work? Iteration! Each update makes the estimate slightly better:

Game 1:   Q(s,a) = 0      (initial)
Game 2:   Q(s,a) = 0.09   (after a win, α=0.1)
...
Game N:   Q(s,a) converges toward true value

The "Ripple Effect": When the agent wins (+1) or loses (-1), that reward updates the Q-value for the last move. Earlier moves in the same game get updated using the next state's Q-values. Value propagates backward through the game.

The Magic of Bootstrapping: We're using our own (initially wrong) estimates to improve those same estimates, and mathematically this process converges to the true values! This is called Temporal Difference (TD) Learning.

Real-World Applications

Domain

Application

Robotics

Navigation, manipulation, walking robots

Games

Game AI (Atari, Go, Chess), NPC behavior

Finance

Portfolio optimization, trading strategies

Healthcare

Treatment planning, drug dosing

Autonomous Vehicles

Decision making, path planning

Recommendation Systems

Content personalization, ad placement

Extensions Beyond Basic Q-Learning

  • SARSA: On-policy variant (uses actual next action, not max)
  • Deep Q-Networks (DQN): Neural network approximates Q-function
  • Double DQN: Addresses overestimation bias
  • Policy Gradient: Directly optimize policy instead of values
  • Actor-Critic: Combines value and policy learning

Key Insights

  • Credit Assignment: How does the agent know which actions led to reward? Q-Learning propagates value backwards through the Bellman equation.
  • Delayed Rewards: The discount factor γ handles rewards that come many steps later.
  • Bootstrapping: We update estimates using other estimates - powerful but can propagate errors.
  • Tabular vs Function Approximation: This demo uses a table; real-world uses neural networks.

Limitations

  • Tabular Q-learning on a tiny game. Tic-Tac-Toe has only a few thousand reachable board states, so a full Q-table fits in memory. This approach does not scale to larger games — chess or Go need function approximation (Deep RL), which this demo does not implement.
  • Self-play / fixed-opponent learning. The agent learns against the training opponent it was given; its strength reflects that opponent's quality. It is not guaranteed to play the game-theoretic optimal (minimax) strategy unless trained long enough against strong play.
  • Single fixed reward scheme. Win / lose / draw rewards are predefined; the agent is not discovering an unknown reward structure, and reward shaping is not explored.
  • ε-greedy exploration only. Action selection uses simple ε-greedy; more advanced exploration (UCB, softmax, optimistic init) is not offered, and ε / α / γ are set manually.
  • Deterministic, fully observable. The board state is fully known and moves are deterministic — there is no hidden information, stochastic transitions, or partial observability.
  • Teaching tool, not a benchmark. Training-episode counts and visual overlays are tuned for clarity, not to demonstrate state-of-the-art RL performance.