CS621 – Lecture 7
Hidden Markov Models
Areas of AI and their interdependencies
Search
Vision
Planning Machine
Learning
Knowledge Representation Logic
Expert Systems Robotics
NLP
Slide from Prof Bhattacharyya's lectures
So far, in the course CS621 ...
● Emergence of Artificial Intelligence
● Search – informed and uninformed
● Predicate Calculus and inference using resolution, forward chaining, backward chaining
Machine learning
● Learning from data – generating hypotheses from data
● Inductive, rather than deductive
● Consider the following example:
– If i have a biased coin with p(h) = 0.6, what is the likelihood of the sequence HHTHT? Assume each toss is independent.
– Now, turn the question: if i toss a coin and get the above sequence, what can be said about the coin?
● If i ignore the prior knowledge if any about the coin, i would assume the p(h) to be #h/total
and so is about the p(t). These are my
“estimators” of the parameters (p(h) and p(t)) of the “model”. These estimators generate
maximum likelihood estimates (MLE) of the parameters – the name derives from the fact that the estimates so obtained will maximize the likelihood of the data when plugged in in the model.
● Now consider the following case:
– I have two biased coins: one always produces H and the other always produces H. Let me call
them c1 and c2, respectively.
– I toss n times with each toss resulting from either c1 or c2. If the probability of my choosing c1
immediately after having tossed c2 is 0.6 and that of choosing c2 after c1 is 0.1, what is the
likelihood of the sequence HTTHHT?
– Could you tell me the sequence of coin flips that would have resulted into this sequence. Easy.
● State C1: coin c1 is tossed (only produces H)
● State C2: coin c2 is tossed (only produces T)
0.1
0.4
0.9 C1 C2 0.6
If the starting state is C1,
P(HHHTTHTH|Model)= 1x0.9x0.9x0.9x0.1x0.6x0.4x0.1x0.4
Markov chains
● In general,
– Set of states S={S1...SN}: Process moves from one state to another
– Matrix of transition probabilities A=(aij); aij=P(Sj|Si): It defines (probabilistically) how process moves among states
– Vector of initial probabilities p=(pi); pi=P(Si): It defines (probabilistically) the initial state
Slide borrowd from Javiar GarciaFrias's tutorial
Modeling more complicated
sequences: Hidden Markov Models
● Consider: I have two biased coins
– Coin 1 produces H with probability 0.4
– Coin 2 produces H with probability 0.6
– P(C1|C1) = .9
– P(C2|C2) = .6
● What is the likelihood of observing the sequence HHHTTHT
● In other words, how can we calculate P(O|M)?
● State S1: C1 is tossed
– H is produced with probability P(H|S1)
– T is produced with probability 1P(H|S1)
● State S2: C2 is tossed
– H is produced with probability P(H|S2)
– T is produced with probability 1P(H|S2)
● Assuming the starting state as S1, how can we calculate P(HHHTTHT|model)?
● Note: unlike the Markov model, the state
sequence is not same as observation sequence (hence the name “hidden” Markov model)
●
● P(H|M) = P(H|S1)
● P(HH|M) = P(H|S1)x.9xP(H|S1)+P(H|S1)x0.1xP(H|S2)
● P(HHT|M) = ?
● In general, there are three computational tasks associated with HMM:
– Finding probability of an observation sequence given the model P(O|M)
– Finding probability of a state sequence given the
observation sequence; or more correctly, finding the maximum likely state sequence for a given observation sequence
– Given a large collection of observation sequences and their corresponding state sequences (or just observation sequences alone), finding transition, emission, and initial probabilities.
Applications of HMM
● For sequence labeling:
– A sequence classifier or sequence labeler is a
model whose job is to assign some label or class to each unit in a sequence.
● Examples:
– Language: partofspeech tagging, namedentity recognition
– Speech: phoneme boundary detection
– Bioinformatics: profile HMM, gene prediction, protein structure prediction
Example problem for HMM:
POS tagging
● Parts of speech: noun, verb, adjective, adverb, auxiliary verb, preposition,
determiner, etc.
● Problem: given a sentence, find partof
speech of each word in it. (why is this important?)
● Example:
– Ram was eating rice.
– Ram(NNP) was(AUX) eating(VBG) rice(NNS).
Problem formulation
● What is observation sequence: sequence of words
● What is state sequence: sequence of POS tags.
● States: Partsofspeech
● Problem formulation: find the maximum likely
state sequence (pos tag sequence) for the given sentence (word sequence).
noun verb
0.7 Start 0.5
adv
adj
endofsent start_of_sent
HMM: formal model specification
● A fivetuple (S, K, T, A, B)
S = {s1, s2, … , sn} – states
K = {k1, k2, … , kM} = {1,2,…,M} – Output alphabet
t = {t i} it S Initial state probabilities
A = {aij} i,j t S – State transition probabilities
B = {bik} i, t S , k t K – Symbol emission probabilities
HMM should be characterized by three problems:
• Problem 1 (Evaluation): Given the observation sequence O=o1,…,oT and an HMM model M=(A,B,pi), how do we compute the probability of O given the model?
● Problem 2 (Decoding): Given the observation sequence O=o1,…,oT and an HMM model M=(A,B,pi), how do we find the state sequence that best explains the
observations?
• Problem 3 (Learning): How do we find the model
parameters (A,B,pi) given labeled or unlabeled training sequences?
Trellis representation
1 2 T
S1
S2
Sn
Problem 1: finding P(O|M)
● Naive implementation requires enumerating all N^K sequences
● Takes O(N^K) time
● Number of multiplications: (2K1)*N^K
● Number of additions: N^K1
● For N=5, K=100, requires around 10^72 computations!
● Need a better algorithm