Web Simulation 

 

 

 

 

Hidden Markov Model (HMM) - Interactive Tutorial

A Hidden Markov Model (HMM) is a statistical model that extends the basic Markov Chain by introducing the concept of hidden states and observable emissions. While in a standard Markov Chain we can directly observe the state of the system, in an HMM the true state is hidden from us—we can only observe outputs (emissions) that depend probabilistically on the hidden state.

1. Mathematical Foundation

1.1 Components of an HMM

An HMM is formally defined by the tuple λ = (A, B, π) where:

  • S = {S0, S1, ..., SN-1} - Set of N hidden states
  • O = {O0, O1, ..., OM-1} - Set of M possible observations (emissions)
  • A = {aij} - State transition probability matrix, where aij = P(Sj at t+1 | Si at t)
  • B = {bi(k)} - Emission probability matrix, where bi(k) = P(Ok | Si)
  • π = {πi} - Initial state distribution, where πi = P(Si at t=0)

1.2 The Transition Matrix (A) — Hidden → Hidden

The transition matrix A describes the probability of moving from one hidden state to another hidden state:

Key Point: The Transition Matrix operates entirely within the hidden layer. It determines how the system moves between hidden states that we cannot directly observe.

    A = | a00  a01  ...  a0N |     where Σj aij = 1 for all i
        | a10  a11  ...  a1N |
        | ...  ...  ...  ... |
        | aN0  aN1  ...  aNN |

    aij = P(hidden state j at t+1 | hidden state i at t)

1.3 The Emission Matrix (B) — Hidden → Observed

The emission matrix B describes the probability of observing a particular output given the current hidden state:

Key Point: The Emission Matrix is the bridge between the hidden layer and the observable world. It determines what we actually see based on which hidden state the system is in.

    B = | b0(O0)  b0(O1)  ...  b0(OM) |     where Σk bi(k) = 1 for all i
        | b1(O0)  b1(O1)  ...  b1(OM) |
        | ...       ...       ...  ...       |
        | bN(O0)  bN(O1)  ...  bN(OM) |

    bi(k) = P(observe Ok | in hidden state Si)

1.4 Information Flow Diagram

The following diagram illustrates how information flows in an HMM:

HIDDEN LAYER (Unobservable)
S0
S1
S2
← Transition Matrix (A) controls these →
Emission Matrix (B) controls these
OBSERVABLE LAYER (What We See)
O0
O1
O2

Summary:

  • Transition Matrix (A): Hidden state → Hidden state (horizontal movement in diagram)
  • Emission Matrix (B): Hidden state → Observation (vertical arrows in diagram)

We can only observe the bottom row (O0, O1, O2...). The challenge of HMM is to infer the top row (S0, S1, S2...) from the observations!

2. The Three Fundamental Problems of HMMs

2.1 Problem 1: Evaluation (Forward Algorithm)

Given: An observation sequence O = (o1, o2, ..., oT) and model λ

Find: P(O|λ) - the probability of the observation sequence given the model

The Forward Algorithm computes this efficiently using dynamic programming:

Initialization: α1(i) = πi · bi(o1)

Recursion: αt+1(j) = [Σi αt(i) · aij] · bj(ot+1)

Termination: P(O|λ) = Σi αT(i)

2.2 Problem 2: Decoding (Viterbi Algorithm)

Given: An observation sequence O and model λ

Find: The most likely hidden state sequence Q* = argmaxQ P(Q|O,λ)

The Viterbi Algorithm finds the optimal path through the hidden states:

Initialization: δ1(i) = πi · bi(o1), ψ1(i) = 0

Recursion: δt(j) = maxit-1(i) · aij] · bj(ot)

ψt(j) = argmaxit-1(i) · aij]

Termination: P* = maxiT(i)], qT* = argmaxiT(i)]

Backtracking: qt* = ψt+1(qt+1*)

2.3 Problem 3: Learning (Baum-Welch Algorithm)

Given: An observation sequence O

Find: Model parameters λ = (A, B, π) that maximize P(O|λ)

The Baum-Welch algorithm (a special case of EM algorithm) iteratively updates the model parameters to maximize the likelihood. This is not demonstrated in this simulation but is crucial for training HMMs from data.

3. How to Construct/Estimate the Matrices

A critical question in HMM applications is: How do we determine the Transition Matrix (A) and Emission Matrix (B)? There are three main approaches:

3.1 Manual Specification (Domain Knowledge)

If you have expert knowledge about the domain, you can manually set the matrix values:

Example: Weather Model

  • You know from meteorological data that sunny days tend to follow sunny days (high self-transition ≈ 0.8)
  • Rainy days are more likely to follow rainy days (≈ 0.6)
  • On sunny days, people more likely go for walks; on rainy days, they clean the house

Pros: Simple, interpretable, no training data needed
Cons: Subjective, may not reflect reality, requires domain expertise

3.2 Counting from Labeled Data (Supervised Learning)

If you have both observation sequences AND the corresponding true hidden state sequences (labeled data), you can directly count transitions and emissions:

Transition Matrix:
    aij = Count(state i → state j) / Count(state i)
    
Emission Matrix:
    bi(k) = Count(state i emits Ok) / Count(state i)

Initial Distribution:
    πi = Count(sequences starting with state i) / Total sequences

Example Calculation:

Given labeled sequence: S0→S0→S1→S1→S0

TransitionCountProbability
S0 → S01 out of 2 (from S0)a00 = 0.5
S0 → S11 out of 2 (from S0)a01 = 0.5
S1 → S11 out of 2 (from S1)a11 = 0.5
S1 → S01 out of 2 (from S1)a10 = 0.5

Pros: Simple, directly computable, statistically principled
Cons: Requires labeled data (often expensive or unavailable), small datasets lead to overfitting

3.3 Baum-Welch Algorithm (Unsupervised Learning)

When you only have observation sequences (no hidden state labels), the Baum-Welch algorithm can estimate the parameters iteratively:

Algorithm Overview (Expectation-Maximization):

  1. Initialize: Set random (or heuristic) values for A, B, π
  2. E-step: Using current parameters, compute the expected number of transitions and emissions (using Forward-Backward algorithm)
  3. M-step: Update A, B, π to maximize the likelihood given the expected counts
  4. Repeat: Go to step 2 until convergence (likelihood stops improving)
Baum-Welch Iteration Flow:
Initialize
λ(0)
E-step
(expected counts)
M-step
(update λ)
Converged?
↑_______________ No, repeat _______________|

Pros: No labeled data required, finds locally optimal parameters
Cons: May converge to local optima, sensitive to initialization, computationally intensive

3.4 Comparison of Methods

Method Data Required Best Use Case Example
Manual None (domain knowledge) Well-understood domains Simple weather models
Counting Labeled (state + observation) When labels available Part-of-speech tagging with annotated corpus
Baum-Welch Unlabeled (observations only) When only observations available Speech recognition, gene finding

� ️ Practical Tips:

  • Multiple Runs: Run Baum-Welch multiple times with different initializations and keep the best result
  • Smoothing: Add small constants to counts to avoid zero probabilities (Laplace smoothing)
  • Validation: Use held-out data to check if the learned model generalizes
  • Hybrid: Combine manual knowledge with data-driven estimates

4. Key Differences: HMM vs Markov Chain

Aspect Markov Chain Hidden Markov Model
State Visibility States are directly observable States are hidden; only emissions are observed
Output State sequence Observation sequence (emissions)
Parameters Transition matrix A, Initial π Transition A, Emission B, Initial π
Inference Direct observation Requires algorithms (Forward, Viterbi)
Applications Simple sequential processes Speech recognition, bioinformatics, NLP

5. Applications of HMMs

  • Speech Recognition: Hidden states represent phonemes; observations are acoustic features
  • Part-of-Speech Tagging: Hidden states are POS tags; observations are words
  • Gene Finding: Hidden states indicate gene regions; observations are nucleotides
  • Financial Modeling: Hidden states represent market regimes; observations are price movements
  • Gesture Recognition: Hidden states model gesture phases; observations are sensor readings
50%
Hidden State Diagram (S0, S1, S2 = Hidden States)
Transition Matrix (A) controls arrows between these nodes
Initial Distribution (π)
Starting hidden state probabilities
Transition Matrix (A)
Hidden → Hidden (row = from, col = to)
Emission Diagram (Hidden → Observed)
Visual: Hidden states emit observations
Emission Matrix (B)
P(observation | hidden state)
Observation Sequence (What We See)
True Hidden State Sequence
Current Hidden State: S0
Total Steps: 0
Hidden State Visit Distribution:
Observation Visit Distribution:
HMM Process Visualization
Click Step ▶ to see HMM process
Viterbi Algorithm - Most Likely Hidden State Sequence
Forward Algorithm - Observation Probability
P(O|λ) = --
Hidden State Visit Trajectory
State 0
State 1
State 2
Obs 0
Obs 1
Obs 2

6. Understanding the Simulation

6.1 The State Diagram

The circular nodes in the diagram represent hidden states. Unlike a regular Markov Chain where you can observe the state directly, in an HMM these states are not visible to an observer—hence "hidden". The arrows show transition probabilities between states.

Visual Elements:

  • Colored Circles: Hidden states (S0, S1, S2, ...)
  • Curved Arrows: Transition probabilities (thickness indicates probability)
  • Self-loops: Probability of staying in the same state
  • Labels: Probability values on each transition
  • Glow: Indicates the current hidden state
  • Yellow Packet: Animation showing state transition

6.2 The Three Matrices

Initial Distribution (π): The probability of starting in each hidden state. When you click "Reset", the system randomly selects an initial hidden state based on this distribution.

Transition Matrix (A) — Hidden → Hidden:

Each row shows the probability of transitioning from the current hidden state to any other hidden state.
Row i, Column j gives P(next hidden state = Sj | current hidden state = Si)
This matrix controls the arrows between circular nodes in the diagram.

Emission Matrix (B) — Hidden → Observed:

Each row shows the probability of emitting (observing) each observation given the current hidden state.
Row i, Column k gives P(observe Ok | in hidden state Si)
This matrix determines what observation appears in the "Observation Sequence" panel.
� ️ Important Distinction: In the simulation, S0, S1, S2 are HIDDEN states (shown in the state diagram but NOT directly observable in a real scenario). O0, O1, O2 are OBSERVATIONS (shown in the "Observation Sequence" panel — this is all a real observer would see).

6.3 The HMM Process

Each simulation step follows this process:

  1. Transition (Hidden → Hidden):
    From the current hidden state Si, use the Transition Matrix A (row i) to randomly select the next hidden state.
    Example: If in S0, look at row 0 of A to determine probabilities of moving to S0, S1, or S2
  2. Emission (Hidden → Observed):
    From the new hidden state Sj, use the Emission Matrix B (row j) to randomly emit an observation.
    Example: If now in S1, look at row 1 of B to determine probabilities of emitting O0, O1, or O2
  3. Record:
    The emitted observation is added to the Observation Sequence — this is what a real observer would actually see!
    The hidden state sequence is shown for learning purposes only; in reality, it would be unknown.
One Step Visualization:
Si (current)
hidden
—A→
Sj (next)
hidden
—B→
Ok
observed!

6.4 The Viterbi Display

The Viterbi panel shows the δ (delta) values computed by the Viterbi algorithm at each time step. The highlighted cells indicate the most likely path through the hidden states. This is the algorithm's best guess of what the hidden states were, given only the observation sequence.

Interpretation: The "Most Likely Path" shown at the bottom is the Viterbi algorithm's reconstruction of the hidden state sequence. Compare it with the "True Hidden State Sequence" to see how well the algorithm performs.

6.5 The Forward Probability

The Forward Algorithm computes P(O|λ), the probability of observing the entire observation sequence given the model. This value typically becomes very small as the sequence grows (exponentially small), which is why it's displayed in scientific notation.

7. Usage Guide

7.1 Controls

Control Description
Preset Load predefined HMM configurations (Weather, Casino, Speech models)
States Change the number of hidden states (2-4)
Observations Change the number of possible observations (2-6)
Step ▶ Execute one step: transition to a new state and emit an observation
Run/Stop Toggle continuous automatic simulation
Reset Clear all sequences and restart from initial state
Speed Adjust the speed of automatic simulation
Matrix Inputs Edit probability values (rows should sum to 1.0)
Normalize Automatically normalize matrix rows to sum to 1.0

7.2 Presets Explained

  • Default: A generic 3-state, 3-observation model for exploration
  • Weather Model: Classic HMM example with hidden weather (Sunny/Rainy) and observable activities (Walk/Shop/Clean)
  • Casino Dice: A casino switching between a fair die and a loaded die; observations are die rolls (1-6)
  • Speech Model: Simplified phoneme model with Vowel/Consonant/Silent states and letter observations

8. Experiments to Try

8.1 Observe Viterbi Accuracy

Run the simulation and compare the "True Hidden State Sequence" with the "Most Likely Path" from Viterbi. Notice how the algorithm often correctly reconstructs the hidden states, but sometimes makes errors, especially when emission probabilities are similar across states.

8.2 Strong vs Weak Emissions

Modify the Emission Matrix to make observations more distinctive (e.g., state 0 only emits O0, state 1 only emits O1). Observe how the Viterbi algorithm becomes nearly perfect when emissions are unambiguous.

8.3 Forward Probability Decay

Watch the Forward Probability P(O|λ) as the observation sequence grows. Notice how it decreases exponentially. This is why HMMs often use log probabilities in practice.

8.4 Casino Detection

Load the Casino preset. The loaded die favors rolling 6. Watch how the Viterbi algorithm tries to detect when the casino is using the loaded die based on the sequence of rolls.

9. Mathematical Details

9.1 Complexity Analysis

  • Forward Algorithm: O(N²T) time, O(NT) space
  • Viterbi Algorithm: O(N²T) time, O(NT) space
  • Baum-Welch: O(N²T) per iteration

Where N = number of states, T = sequence length

9.2 Numerical Considerations

In practice, HMM implementations use log probabilities to avoid numerical underflow. The simulation uses raw probabilities for clarity but may show very small values for long sequences.

9.3 Scaling

The Forward algorithm can use scaling factors ct = 1/Σiαt(i) to normalize α values at each step, preventing underflow while still allowing computation of P(O|λ) = � t(1/ct).

10. Limitations and Extensions

Current Simulation Limitations:

  • Does not implement the Baum-Welch learning algorithm
  • Limited to small state/observation spaces for visualization
  • Does not show the Backward algorithm
  • Uses raw probabilities (no log-space computation)

Extensions to Explore:

  • Continuous HMMs: Emissions follow continuous distributions (e.g., Gaussian)
  • Higher-order HMMs: Transitions depend on multiple previous states
  • Factorial HMMs: Multiple interacting hidden state chains
  • Input-Output HMMs: Transitions depend on external inputs