Web Simulator | ShareTechnote

Web Simulation

Hidden Markov Model (HMM) - Interactive Tutorial

A Hidden Markov Model (HMM) is a statistical model that extends the basic Markov Chain by introducing the concept of hidden states and observable emissions. While in a standard Markov Chain we can directly observe the state of the system, in an HMM the true state is hidden from us—we can only observe outputs (emissions) that depend probabilistically on the hidden state.

Sections

1. Mathematical Foundation
1.1 Components of an HMM
1.2 The Transition Matrix (A) — Hidden → Hidden
1.3 The Emission Matrix (B) — Hidden → Observed
1.4 Information Flow Diagram
2. The Three Fundamental Problems of HMMs
2.1 Problem 1: Evaluation (Forward Algorithm)
2.2 Problem 2: Decoding (Viterbi Algorithm)
2.3 Problem 3: Learning (Baum-Welch Algorithm)
3. How to Construct/Estimate the Matrices
3.1 Manual Specification (Domain Knowledge)
3.2 Counting from Labeled Data (Supervised Learning)
3.3 Baum-Welch Algorithm (Unsupervised Learning)
3.4 Comparison of Methods
4. Key Differences: HMM vs Markov Chain
5. Applications of HMMs
6. Understanding the Simulation
6.1 The State Diagram
6.2 The Three Matrices
6.3 The HMM Process
6.4 The Viterbi Display
6.5 The Forward Probability
7. Usage Guide
7.1 Controls
7.2 Presets Explained
8. Experiments to Try
8.1 Observe Viterbi Accuracy
8.2 Strong vs Weak Emissions
8.3 Forward Probability Decay
8.4 Casino Detection
9. Mathematical Details
9.1 Complexity Analysis
9.2 Numerical Considerations
9.3 Scaling
10. Limitations and Extensions
Current Simulation Limitations:
Extensions to Explore:

1. Mathematical Foundation

1.1 Components of an HMM

An HMM is formally defined by the tuple λ = (A, B, π) where:

S = {S₀, S₁, ..., S_N-1} - Set of N hidden states
O = {O₀, O₁, ..., O_M-1} - Set of M possible observations (emissions)
A = {a_ij} - State transition probability matrix, where a_ij = P(S_j at t+1 | S_i at t)
B = {b_i(k)} - Emission probability matrix, where b_i(k) = P(O_k | S_i)
π = {π_i} - Initial state distribution, where π_i = P(S_i at t=0)

1.2 The Transition Matrix (A) — Hidden → Hidden

The transition matrix A describes the probability of moving from one hidden state to another hidden state:

Key Point: The Transition Matrix operates entirely within the hidden layer. It determines how the system moves between hidden states that we cannot directly observe.

    A = | a₀₀  a₀₁  ...  a_0N |     where Σ_j a_ij = 1 for all i
        | a₁₀  a₁₁  ...  a_1N |
        | ...  ...  ...  ... |
        | a_N0  a_N1  ...  a_NN |

    a_ij = P(hidden state j at t+1 | hidden state i at t)

1.3 The Emission Matrix (B) — Hidden → Observed

The emission matrix B describes the probability of observing a particular output given the current hidden state:

Key Point: The Emission Matrix is the bridge between the hidden layer and the observable world. It determines what we actually see based on which hidden state the system is in.

    B = | b₀(O₀)  b₀(O₁)  ...  b₀(O_M) |     where Σ_k b_i(k) = 1 for all i
        | b₁(O₀)  b₁(O₁)  ...  b₁(O_M) |
        | ...       ...       ...  ...       |
        | b_N(O₀)  b_N(O₁)  ...  b_N(O_M) |

    b_i(k) = P(observe O_k | in hidden state S_i)

1.4 Information Flow Diagram

The following diagram illustrates how information flows in an HMM:

HIDDEN LAYER (Unobservable)
S0
⟷
S1
⟷
S2
← Transition Matrix (A) controls these →
↓
↓
↓
Emission Matrix (B) controls these
↓
↓
↓
OBSERVABLE LAYER (What We See)
O0
O1
O2

Summary:

Transition Matrix (A): Hidden state → Hidden state (horizontal movement in diagram)
Emission Matrix (B): Hidden state → Observation (vertical arrows in diagram)

We can only observe the bottom row (O0, O1, O2...). The challenge of HMM is to infer the top row (S0, S1, S2...) from the observations!

2. The Three Fundamental Problems of HMMs

2.1 Problem 1: Evaluation (Forward Algorithm)

Given: An observation sequence O = (o₁, o₂, ..., o_T) and model λ

Find: P(O|λ) - the probability of the observation sequence given the model

The Forward Algorithm computes this efficiently using dynamic programming:

Initialization: α₁(i) = π_i · b_i(o₁)

Recursion: α_t+1(j) = [Σ_i α_t(i) · a_ij] · b_j(o_t+1)

Termination: P(O|λ) = Σ_i α_T(i)

2.2 Problem 2: Decoding (Viterbi Algorithm)

Given: An observation sequence O and model λ

Find: The most likely hidden state sequence Q* = argmax_Q P(Q|O,λ)

The Viterbi Algorithm finds the optimal path through the hidden states:

Initialization: δ₁(i) = π_i · b_i(o₁), ψ₁(i) = 0

Recursion: δ_t(j) = max_i[δ_t-1(i) · a_ij] · b_j(o_t)

ψ_t(j) = argmax_i[δ_t-1(i) · a_ij]

Termination: P* = max_i[δ_T(i)], q_T* = argmax_i[δ_T(i)]

Backtracking: q_t* = ψ_t+1(q_t+1*)

2.3 Problem 3: Learning (Baum-Welch Algorithm)

Given: An observation sequence O

Find: Model parameters λ = (A, B, π) that maximize P(O|λ)

The Baum-Welch algorithm (a special case of EM algorithm) iteratively updates the model parameters to maximize the likelihood. This is not demonstrated in this simulation but is crucial for training HMMs from data.

3. How to Construct/Estimate the Matrices

A critical question in HMM applications is: How do we determine the Transition Matrix (A) and Emission Matrix (B)? There are three main approaches:

3.1 Manual Specification (Domain Knowledge)

If you have expert knowledge about the domain, you can manually set the matrix values:

Example: Weather Model

You know from meteorological data that sunny days tend to follow sunny days (high self-transition ≈ 0.8)
Rainy days are more likely to follow rainy days (≈ 0.6)
On sunny days, people more likely go for walks; on rainy days, they clean the house

Pros: Simple, interpretable, no training data needed
Cons: Subjective, may not reflect reality, requires domain expertise

3.2 Counting from Labeled Data (Supervised Learning)

If you have both observation sequences AND the corresponding true hidden state sequences (labeled data), you can directly count transitions and emissions:

Transition Matrix:
    a_ij = Count(state i → state j) / Count(state i)
    
Emission Matrix:
    b_i(k) = Count(state i emits O_k) / Count(state i)

Initial Distribution:
    π_i = Count(sequences starting with state i) / Total sequences

Example Calculation:

Given labeled sequence: S0→S0→S1→S1→S0

Transition	Count	Probability
S0 → S0	1 out of 2 (from S0)	a₀₀ = 0.5
S0 → S1	1 out of 2 (from S0)	a₀₁ = 0.5
S1 → S1	1 out of 2 (from S1)	a₁₁ = 0.5
S1 → S0	1 out of 2 (from S1)	a₁₀ = 0.5

Pros: Simple, directly computable, statistically principled
Cons: Requires labeled data (often expensive or unavailable), small datasets lead to overfitting

3.3 Baum-Welch Algorithm (Unsupervised Learning)

When you only have observation sequences (no hidden state labels), the Baum-Welch algorithm can estimate the parameters iteratively:

Algorithm Overview (Expectation-Maximization):

Initialize: Set random (or heuristic) values for A, B, π
E-step: Using current parameters, compute the expected number of transitions and emissions (using Forward-Backward algorithm)
M-step: Update A, B, π to maximize the likelihood given the expected counts
Repeat: Go to step 2 until convergence (likelihood stops improving)

Baum-Welch Iteration Flow:

Initialize
λ⁽⁰⁾

→

E-step
(expected counts)

→

M-step
(update λ)

→

Converged?

↑_______________ No, repeat _______________|

Pros: No labeled data required, finds locally optimal parameters
Cons: May converge to local optima, sensitive to initialization, computationally intensive

3.4 Comparison of Methods

Method	Data Required	Best Use Case	Example
Manual	None (domain knowledge)	Well-understood domains	Simple weather models
Counting	Labeled (state + observation)	When labels available	Part-of-speech tagging with annotated corpus
Baum-Welch	Unlabeled (observations only)	When only observations available	Speech recognition, gene finding

⚠️ Practical Tips:

Multiple Runs: Run Baum-Welch multiple times with different initializations and keep the best result
Smoothing: Add small constants to counts to avoid zero probabilities (Laplace smoothing)
Validation: Use held-out data to check if the learned model generalizes
Hybrid: Combine manual knowledge with data-driven estimates

4. Key Differences: HMM vs Markov Chain

Aspect	Markov Chain	Hidden Markov Model
State Visibility	States are directly observable	States are hidden; only emissions are observed
Output	State sequence	Observation sequence (emissions)
Parameters	Transition matrix A, Initial π	Transition A, Emission B, Initial π
Inference	Direct observation	Requires algorithms (Forward, Viterbi)
Applications	Simple sequential processes	Speech recognition, bioinformatics, NLP

5. Applications of HMMs

Speech Recognition: Hidden states represent phonemes; observations are acoustic features
Part-of-Speech Tagging: Hidden states are POS tags; observations are words
Gene Finding: Hidden states indicate gene regions; observations are nucleotides
Financial Modeling: Hidden states represent market regimes; observations are price movements
Gesture Recognition: Hidden states model gesture phases; observations are sensor readings

Preset: States: Observations:

Speed: 50%

Hidden State Diagram (S0, S1, S2 = Hidden States)

Transition Matrix (A) controls arrows between these nodes

Initial Distribution (π)

Starting hidden state probabilities

Transition Matrix (A)

Hidden → Hidden (row = from, col = to)

Emission Diagram (Hidden → Observed)

Visual: Hidden states emit observations

Emission Matrix (B)

P(observation | hidden state)

Observation Sequence (What We See)

True Hidden State Sequence

Current Hidden State: S0

Total Steps: 0

Hidden State Visit Distribution:

Observation Visit Distribution:

HMM Process Visualization

Click Step ▶ to see HMM process

Viterbi Algorithm - Most Likely Hidden State Sequence

Forward Algorithm - Observation Probability

P(O|λ) = --

Hidden State Visit Trajectory

State 0

State 1

State 2

Obs 0

Obs 1

Obs 2

6. Understanding the Simulation

6.1 The State Diagram

The circular nodes in the diagram represent hidden states. Unlike a regular Markov Chain where you can observe the state directly, in an HMM these states are not visible to an observer—hence "hidden". The arrows show transition probabilities between states.

Visual Elements:

Colored Circles: Hidden states (S0, S1, S2, ...)
Curved Arrows: Transition probabilities (thickness indicates probability)
Self-loops: Probability of staying in the same state
Labels: Probability values on each transition
Glow: Indicates the current hidden state
Yellow Packet: Animation showing state transition

6.2 The Three Matrices

Initial Distribution (π): The probability of starting in each hidden state. When you click "Reset", the system randomly selects an initial hidden state based on this distribution.

Transition Matrix (A) — Hidden → Hidden:

Each row shows the probability of transitioning from the current hidden state to any other hidden state.
Row i, Column j gives P(next hidden state = S_j | current hidden state = S_i)
This matrix controls the arrows between circular nodes in the diagram.

Emission Matrix (B) — Hidden → Observed:

Each row shows the probability of emitting (observing) each observation given the current hidden state.
Row i, Column k gives P(observe O_k | in hidden state S_i)
This matrix determines what observation appears in the "Observation Sequence" panel.

⚠️ Important Distinction: In the simulation, S0, S1, S2 are HIDDEN states (shown in the state diagram but NOT directly observable in a real scenario). O0, O1, O2 are OBSERVATIONS (shown in the "Observation Sequence" panel — this is all a real observer would see).

6.3 The HMM Process

Each simulation step follows this process:

Transition (Hidden → Hidden):
From the current hidden state S_i, use the Transition Matrix A (row i) to randomly select the next hidden state.
Example: If in S0, look at row 0 of A to determine probabilities of moving to S0, S1, or S2
Emission (Hidden → Observed):
From the new hidden state S_j, use the Emission Matrix B (row j) to randomly emit an observation.
Example: If now in S1, look at row 1 of B to determine probabilities of emitting O0, O1, or O2
Record:
The emitted observation is added to the Observation Sequence — this is what a real observer would actually see!
The hidden state sequence is shown for learning purposes only; in reality, it would be unknown.

One Step Visualization:

S_i (current)

hidden

—A→

S_j (next)

hidden

—B→

O_k

observed!

6.4 The Viterbi Display

The Viterbi panel shows the δ (delta) values computed by the Viterbi algorithm at each time step. The highlighted cells indicate the most likely path through the hidden states. This is the algorithm's best guess of what the hidden states were, given only the observation sequence.

Interpretation: The "Most Likely Path" shown at the bottom is the Viterbi algorithm's reconstruction of the hidden state sequence. Compare it with the "True Hidden State Sequence" to see how well the algorithm performs.

6.5 The Forward Probability

The Forward Algorithm computes P(O|λ), the probability of observing the entire observation sequence given the model. This value typically becomes very small as the sequence grows (exponentially small), which is why it's displayed in scientific notation.

7. Usage Guide

7.1 Controls

Control	Description
Preset	Load predefined HMM configurations (Weather, Casino, Speech models)
States	Change the number of hidden states (2-4)
Observations	Change the number of possible observations (2-6)
Step ▶	Execute one step: transition to a new state and emit an observation
Run/Stop	Toggle continuous automatic simulation
Reset	Clear all sequences and restart from initial state
Speed	Adjust the speed of automatic simulation
Matrix Inputs	Edit probability values (rows should sum to 1.0)
Normalize	Automatically normalize matrix rows to sum to 1.0

7.2 Presets Explained

Default: A generic 3-state, 3-observation model for exploration
Weather Model: Classic HMM example with hidden weather (Sunny/Rainy) and observable activities (Walk/Shop/Clean)
Casino Dice: A casino switching between a fair die and a loaded die; observations are die rolls (1-6)
Speech Model: Simplified phoneme model with Vowel/Consonant/Silent states and letter observations

8. Experiments to Try

8.1 Observe Viterbi Accuracy

Run the simulation and compare the "True Hidden State Sequence" with the "Most Likely Path" from Viterbi. Notice how the algorithm often correctly reconstructs the hidden states, but sometimes makes errors, especially when emission probabilities are similar across states.

8.2 Strong vs Weak Emissions

Modify the Emission Matrix to make observations more distinctive (e.g., state 0 only emits O0, state 1 only emits O1). Observe how the Viterbi algorithm becomes nearly perfect when emissions are unambiguous.

8.3 Forward Probability Decay

Watch the Forward Probability P(O|λ) as the observation sequence grows. Notice how it decreases exponentially. This is why HMMs often use log probabilities in practice.

8.4 Casino Detection

Load the Casino preset. The loaded die favors rolling 6. Watch how the Viterbi algorithm tries to detect when the casino is using the loaded die based on the sequence of rolls.

9. Mathematical Details

9.1 Complexity Analysis

Forward Algorithm: O(N²T) time, O(NT) space
Viterbi Algorithm: O(N²T) time, O(NT) space
Baum-Welch: O(N²T) per iteration

Where N = number of states, T = sequence length

9.2 Numerical Considerations

In practice, HMM implementations use log probabilities to avoid numerical underflow. The simulation uses raw probabilities for clarity but may show very small values for long sequences.

9.3 Scaling

The Forward algorithm can use scaling factors c_t = 1/Σ_iα_t(i) to normalize α values at each step, preventing underflow while still allowing computation of P(O|λ) = ∏_t(1/c_t).