Hidden Markov Models

(1)

CS621 – Lecture 7

Hidden Markov Models

(2)

Areas of AI and their interdependencies

Search

Vision

Planning Machine

Learning

Knowledge Representation Logic

Expert Systems Robotics

NLP

Slide from Prof Bhattacharyya's lectures

(3)

So far, in the course CS621 ...

● Emergence of Artificial Intelligence

● Search – informed and uninformed

● Predicate Calculus and inference using resolution, forward chaining, backward chaining

(4)

Machine learning

● Learning from data – generating hypotheses from data

● Inductive, rather than deductive

● Consider the following example:

– If i have a biased coin with p(h) = 0.6, what is the likelihood of the sequence HHTHT? Assume each toss is independent.

– Now, turn the question: if i toss a coin and get the above sequence, what can be said about the coin?

(5)

● If i ignore the prior knowledge if any about the coin, i would assume the p(h) to be #h/total

and so is about the p(t). These are my

“estimators” of the parameters (p(h) and p(t)) of the “model”. These estimators generate

maximum likelihood estimates (MLE) of the parameters – the name derives from the fact that the estimates so obtained will maximize the likelihood of the data when plugged in in the model.

(6)

● Now consider the following case:

– I have two biased coins: one always produces H and the other always produces H. Let me call

them c1 and c2, respectively.

– I toss n times with each toss resulting from either c1 or c2. If the probability of my choosing c1

immediately after having tossed c2 is 0.6 and that of choosing c2 after c1 is 0.1, what is the

likelihood of the sequence HTTHHT?

– Could you tell me the sequence of coin flips that would have resulted into this sequence. Easy.

(7)

● State C1: coin c1 is tossed (only produces H)

● State C2: coin c2 is tossed (only produces T)

0.1

0.4

0.9 C1 C2 0.6

If the starting state is C1,

P(HHHTTHTH|Model)= 1x0.9x0.9x0.9x0.1x0.6x0.4x0.1x0.4

(8)

Markov chains

● In general,

– Set of states S={S¹...S^N}: Process moves from one state to another

– Matrix of transition probabilities A=(a^ij); a^ij=P(S^j|Sⁱ): It defines (probabilistically) how process moves among states

– Vector of initial probabilities p=(pⁱ); pⁱ=P(Sⁱ): It defines (probabilistically) the initial state

Slide borrowd from Javiar GarciaFrias's tutorial

(9)

Modeling more complicated

sequences: Hidden Markov Models

● Consider: I have two biased coins

– Coin 1 produces H with probability 0.4

– Coin 2 produces H with probability 0.6

– P(C1|C1) = .9

– P(C2|C2) = .6

● What is the likelihood of observing the sequence HHHTTHT

● In other words, how can we calculate P(O|M)?

● State S1: C1 is tossed

– H is produced with probability P(H|S1)

– T is produced with probability 1P(H|S1)

● State S2: C2 is tossed

– H is produced with probability P(H|S2)

– T is produced with probability 1P(H|S2)

(10)

● Assuming the starting state as S1, how can we calculate P(HHHTTHT|model)?

● Note: unlike the Markov model, the state

sequence is not same as observation sequence (hence the name “hidden” Markov model)

●

(11)

● P(H|M) = P(H|S1)

● P(HH|M) = P(H|S1)x.9xP(H|S1)+P(H|S1)x0.1xP(H|S2)

● P(HHT|M) = ?

● In general, there are three computational tasks associated with HMM:

– Finding probability of an observation sequence given the model P(O|M)

– Finding probability of a state sequence given the

observation sequence; or more correctly, finding the maximum likely state sequence for a given observation sequence

– Given a large collection of observation sequences and their corresponding state sequences (or just observation sequences alone), finding transition, emission, and initial probabilities.

(12)

Applications of HMM

● For sequence labeling:

– A sequence classifier or sequence labeler is a

model whose job is to assign some label or class to each unit in a sequence.

● Examples:

– Language: partofspeech tagging, namedentity recognition

– Speech: phoneme boundary detection

– Bioinformatics: profile HMM, gene prediction, protein structure prediction

(13)

Example problem for HMM:

POS tagging

● Parts of speech: noun, verb, adjective, adverb, auxiliary verb, preposition,

determiner, etc.

● Problem: given a sentence, find partof

speech of each word in it. (why is this important?)

● Example:

– Ram was eating rice.

– Ram(NNP) was(AUX) eating(VBG) rice(NNS).

(14)

Problem formulation

● What is observation sequence: sequence of words

● What is state sequence: sequence of POS tags.

● States: Partsofspeech

● Problem formulation: find the maximum likely

state sequence (pos tag sequence) for the given sentence (word sequence).

(15)

noun verb

0.7 Start 0.5

adv

adj

endofsent start_of_sent

(16)

HMM: formal model specification

● A fivetuple (S, K, T, A, B)

 S = {s₁, s₂, … , s_n} – states

 K = {k₁, k₂, … , k_M} = {1,2,…,M} – Output alphabet

 t = {t _i} it S Initial state probabilities

 A = {a_ij} i,j t S – State transition probabilities

 B = {b_ik} i, t S , k t K – Symbol emission probabilities

(17)

HMM should be characterized by three problems:

• Problem 1 (Evaluation): Given the observation sequence O=o₁,…,o_T and an HMM model M=(A,B,pi), how do we compute the probability of O given the model?

● Problem 2 (Decoding): Given the observation sequence O=o₁,…,o_T and an HMM model M=(A,B,pi), how do we find the state sequence that best explains the

observations?

• Problem 3 (Learning): How do we find the model

parameters (A,B,pi) given labeled or unlabeled training sequences?

(18)

Trellis representation

1 2 T

S1

S2

Sn

(19)

Problem 1: finding P(O|M)

● Naive implementation requires enumerating all N^K sequences

● Takes O(N^K) time

● Number of multiplications: (2K1)*N^K

● Number of additions: N^K1

● For N=5, K=100, requires around 10^72 computations!

● Need a better algorithm

(20)

Problem 1: dynamic programming

solution

(21)

Problem 2: Viterbi algorithm

(22)

Hidden Markov Models

CS621 – Lecture 7