Web Simulation 

 

 

 

 

 

Reinforcement Learning: Q-Learning with Tic-Tac-Toe 

This interactive simulation demonstrates Q-Learning using Tic-Tac-Toe. You can see the agent's Q-table—its memory—directly: empty cells show how desirable the agent thinks each move is. Watch it learn which moves lead to wins (+1), losses (-1), or draws (0.5) through trial and error.

Why Tic-Tac-Toe for RL?

Reinforcement Learning is often a "black box." Here, the state is the board (a 9-character string like X.O...X..), and the Q-table stores one value per possible move in each state. By overlaying Q-values on the board, you see the agent's "thought process"—which cells it prefers and how that changes as it learns.

  • State (s): Board configuration when it's the agent's turn (e.g. X.O...X..)
  • Action (a): Which empty cell to play (0–8)
  • Reward (R): +1 for win, -1 for loss, 0.5 for draw (only at game end)
  • Agent: Plays O (second player); X (opponent) always moves first. Training uses random X so O learns to respond.

The Q-Learning Update (Bellman Equation)

After each move, the agent updates the Q-value for the state-action pair it just used:

Q-Update (Bellman):
Q(s, a) ← Q(s, a) + α · [ R + γ · maxa' Q(s', a') − Q(s, a) ]

α (alpha) = learning rate; γ (gamma) = discount factor; R = reward; s' = next state. For terminal states (win/loss/draw), there is no next state, so max Q(s', a') is 0.

Parameters and ε-Greedy

Parameter Symbol Range Effect
Learning Rate α (alpha) 0 to 1 How much new information overrides old. High = fast but unstable, Low = slow but stable
Discount Factor γ (gamma) 0 to 1 How much to value future rewards. High = farsighted, Low = shortsighted
Exploration Rate ε (epsilon) 0 to 1 Probability of taking random action. Balances exploration vs exploitation

ε-Greedy: With probability ε the agent picks a random empty cell; otherwise it picks the cell with the highest Q-value. Setting ε to 1 in the sim shows the exploration—exploitation trade-off clearly.

What You See in the Sim

Board: You play X (first); agent plays O. Q-value overlay: Empty cells show the agent's Q(s, a) for that move. Heatmap: Green = higher (good) Q-value; red = lower (bad). Neural Console: Stats, win-rate graph, log, and Detailed Info with the last Q-update formula (Plugged-in) and the action table. Q-Table frame: Full Q-table below; when a Q-update is shown in Detailed Info, the corresponding state row and action cell are highlighted and scrolled into view. Step: Advance one half-move (X or O). Fast Train: Run 1,000 games (X moves first, O learns). Hover an empty cell to see the Bellman breakdown (TD target and error).

Board (you play X, agent plays O)
0.30
0.10
0.90
Neural Console
States: 0 Games: 0 Wins: 0 Win rate: 0%
Q-table persists across games; only Reset All clears it.
Q-values are learned for O (the agent) only; X is the opponent (random in training, or you when playing).
Q(s,a) ← Q(s,a) + α·[R + γ·max Q(s',a') − Q(s,a)]
Detailed Info
Q(s,a) ← Q(s,a) + α[R + γ·maxa' Q(s',a') − Q(s,a)]
α= , γ= , ε=
State s =
Q(s,a) for all possible actions:
Q-Table

Usage Instructions

  1. Play: You play X (first); click an empty cell. The agent (O) responds using its Q-table (ε-greedy).
  2. Q-value overlay: Empty cells show Q(s, a) for that move. Green = higher (good); red = lower (bad).
  3. Step: Advance one move (X or O). Use to see the Plugged-in formula and Q-table highlight for each Q-update.
  4. Fast Train: Run 1,000 games (X moves first, random X; O learns). Win rate and state count update in the Neural Console.
  5. Q-Table frame: Lists all states and Q(s,a) for actions 0–8. The row/cell for the last Q-update (in Detailed Info) is highlighted and auto-scrolled.
  6. Sliders: ε (exploration), α (learning rate), γ (discount). Train speed: Slow or Instant.
  7. Reset board: New game; keeps Q-table and stats.
  8. Reset All: Clears board, Q-table, stats, graph, log, and Detailed Info. Use Fast Train or Step to train from scratch.
  9. Persistence: The Q-table and stats are saved in the browser (localStorage) and restored on reload until you click Reset All.
  10. Hover: Hover an empty cell for the Bellman breakdown (TD target and error).

Experiments to Try

Experiment Setup What to Observe
Fast Train Click Fast Train (1k games), then play Agent gets much stronger; Q-values favor wins and blocks
High exploration Set ε = 1.0 Agent makes random moves; illustrates exploration vs exploitation
Step + Plugged-in Use Step to advance; when O moves, check Detailed Info Plugged-in shows the exact Bellman update; Q-Table frame highlights the updated state and action
Hover + Bellman Hover an empty cell Tooltip shows Q(s,a), TD target, and TD error for that move

Mathematical Details

TD Error (Temporal Difference Error):

δ = R + γ · maxa'Q(s', a') - Q(s, a)

The "surprise" - difference between expected and actual value.
Positive δ = outcome better than expected → increase Q
Negative δ = outcome worse than expected → decrease Q

Convergence Guarantee: Q-Learning is proven to converge to optimal Q* if:

  • All state-action pairs are visited infinitely often
  • Learning rate decreases appropriately (∑α = ∞, ∑α² < ∞)
  • In practice: ε-greedy exploration ensures sufficient exploration

Agent Moves by Q-Values, Not Rewards!

A common misconception: "The agent moves toward rewards." This is imprecise!

Common Saying Precise Statement
"Agent moves toward reward" Agent moves toward high Q-value
"Agent avoids negative reward" Agent avoids low Q-value (learned from past negative rewards)
"Reward guides behavior" Reward trains Q; Q guides behavior
Key Insight: The agent cannot see rewards before acting — rewards are only revealed after the action is taken. The action is chosen based on Q-values (learned estimates), not direct reward lookup.
Timeline of Events:

1. Agent is at state s
2. Choose action based on Q(s, a) ← Uses Q-values!
3. Execute action a
4. Environment gives reward R — Reward revealed AFTER
5. Agent arrives at state s'
6. Update Q-value using R ← Learning happens

Analogy: Reward is like a teacher's grade — you only get it after submitting your answer. Q-value is like your study notes — you consult them to decide your answer. You don't choose answers based on grades (you don't know them yet!), you choose based on what you've learned from past grades.

Known Rewards vs. True RL

In this Tic-Tac-Toe sim, we defined rewards (+1 / -1 / 0.5). The agent does not see them in advance; it discovers which moves lead to win/loss/draw through play.

Entity Knows Rewards? Notes
We (designers) ✓ Yes We defined win/loss/draw rewards
Agent (true RL) ✗ No Learns from experience via Q-updates

Q-Learning shines when:

  • Stochastic transitions: Same action → different outcomes
  • Unknown environment: Robot exploring new building
  • Delayed rewards: Chess — win/lose only at end
  • Complex reward shaping: Cumulative, discounted rewards

Bootstrapping: How We Handle Unknown Future Rewards

The Q-value represents cumulative future reward:

Q(s,a) ≈ R + γ·R' + γ²·R'' + γ³·R''' + ...
     ↑     ↑       ↑         ↑
   known  unknown  unknown  unknown

The Problem: How do we calculate Q NOW when it depends on FUTURE rewards we don't know?

The Solution — Bootstrapping: We use our current estimate of future rewards!

The Recursive Trick:

Q(s,a) = R + γ · [R' + γ·R'' + γ²·R''' + ...]
         ↑        └— This IS Q(s', a')!
      known!              This IS Q(s', a')!

So: Q(s,a) = R + γ · max Q(s', a')
               ↑              ↑
            known         estimated from Q-table!

Why Does This Work? Iteration! Each update makes the estimate slightly better:

Game 1:   Q(s,a) = 0      (initial)
Game 2:   Q(s,a) = 0.09   (after a win, α=0.1)
...
Game N:   Q(s,a) converges toward true value

The "Ripple Effect": When the agent wins (+1) or loses (-1), that reward updates the Q-value for the last move. Earlier moves in the same game get updated using the next state's Q-values. Value propagates backward through the game.

The Magic of Bootstrapping: We're using our own (initially wrong) estimates to improve those same estimates, and mathematically this process converges to the true values! This is called Temporal Difference (TD) Learning.

Real-World Applications

Domain Application
Robotics Navigation, manipulation, walking robots
Games Game AI (Atari, Go, Chess), NPC behavior
Finance Portfolio optimization, trading strategies
Healthcare Treatment planning, drug dosing
Autonomous Vehicles Decision making, path planning
Recommendation Systems Content personalization, ad placement

Extensions Beyond Basic Q-Learning

  • SARSA: On-policy variant (uses actual next action, not max)
  • Deep Q-Networks (DQN): Neural network approximates Q-function
  • Double DQN: Addresses overestimation bias
  • Policy Gradient: Directly optimize policy instead of values
  • Actor-Critic: Combines value and policy learning

Key Insights

  • Credit Assignment: How does the agent know which actions led to reward? Q-Learning propagates value backwards through the Bellman equation.
  • Delayed Rewards: The discount factor γ handles rewards that come many steps later.
  • Bootstrapping: We update estimates using other estimates - powerful but can propagate errors.
  • Tabular vs Function Approximation: This demo uses a table; real-world uses neural networks.