Graphical Models for Sequence Labeling in NLP

(1)

Graphical Models for Sequence Labeling in NLP

M.Tech Seminar Report

Submitted by

Anup Abhay Kulkarni

Roll No: 08305045

Under the guidance of

Prof. Pushpak Bhattacharyya

Department of Computer Science and Engineering Indian Institute of Technology Bombay

Mumbai

2009

(2)

Abstract

In NLP task it is important to understand interaction between words and their labels. More information can be fetched from context than treating words a unit. Thus it looks and proved beneficial to represent NLP task in the form of graph and train probabilistic model on that graph. In this report, we will discuss how NLP tasks can be represented and solved using different probabilistic graphical models.

(3)

Introduction

The ultimate goal of natural language processing is to understand language. But it is hard task to achieve.

So this task is broken down to relatively simple and computationally achievable form of sequence labeling.

Sequence labeling particularly looks for different patterns in sequences of text and attempts to label them accordingly. We will discuss mainly three main sequence labeling tasks

• POS Tagging

• Chunking

• Named Entity Recognition

1.1 POS Tagging

One of the important tasks in Natural Language Processing is to classify each word in to its lexical category. The simplest form lexical categories can be: Noun, Verb, Adjective, Adverb etc. But depending upon usage of different words, morphology these words are categorized into more number of POS tags.

POS tagging is task of tagging each word with its correct POS.

Example:

The AT postman NN collected VBD letters NNS and CNJ left VBD.

In the example given above The is article so its marked as AT. Collected and left are verbs in their past tense forms and hence tagged as VBD. Though postman and letters both are nouns postman is singular but letters is plural so they are tagged differently. In this example it looks like every word has a unique POS tag. Consider another example in which same word can be tagged differently depending upon role it plays.

He PN plays VBZ violin NN in PP the AT play NN.

Here word “Play” appears twice and tagged differently. First time word “Play” is used as verb and hence tagged as VBZ(verb in 3rd person singular form). Whereas second time it is used as noun and tagged as NN.

These two examples shows how POS tags are labeled and it is more than simply keeping dictionary of words and their POS tags. To correctly identify POS tags of a word we need to know what role that word plays in that sentence. To identify the role we require to identify certain rules in the language which will help to identify correct tags.

Example: In English, article is followed by Noun. So if we could correctly identify articles then we could correctly label nouns following it. Word been is always followed by verb in past tense or verb in gerund.

We use graphical models to find such rules from the training dataset.

(6)

1.2 Chunking

Chunking is task of understanding structure of sentence without fully parsing it. Output after chunking will be segmentation of sentence into collection of words which are in same grammatical unit. The labels given after chunking will be the grammatical phrase appended by B, I and O. B for begins the phrase, I for continues the current phrase and O for outside current phrase. Question can be asked is how graphical models can be used for chunking.

Example:

He< B−N P > Reckons < B−V P > T he < B−N P > Current < I−N P > Account < I−N P >

Def icit < I−N P > W ill < B−V P > N arrow < I−V P > . < O >

In this “He” begins a Noun Phrase. “Reckons” again begins new verb phrase. “The” starts new Noun Phrase which is continued by “Current Account Deficit”. Presence of auxiliary verbs and verbs in infinitive form demands for I-VP that is continuing same verb phrase.

It can be seen that since I is for continues the current phrase, I will never occur as a label for the first word in the sentence. Also N P −I (meaning continues current Noun Phrase) can never follow V P−I(continues current verb phrase). There must beV P−Blabel in-between them stating beginning of verb phrase. Thus the labels associated with every word are not independent from each other and there mutual interaction can be understood by constructing a graph structure out of them. Since a chunk is more likely to contain semantically related words, chunking plays important role in information extraction.

1.3 Named Entity Recognition

Named Entity as name suggests is entity with name given to it. The entities are proper nouns and task is to classify them into name of person, name of a location, name of a organization etc. Recognizing named entity plays very important role in Question Answering system, Information Extraction system and Search Engines.

Example:

I bought 100 shares of IBM Corp< ORG >.

In the above example “IBM Corp”. is tagged as name of organization. Now lets consider an example when same word is labeled with different tags.

IIT Bombay< ORG >is located in Powai which is in north Bombay< LOC >.

Now in this example “Bombay” appears twice, first is tagged as name of organization and second is tagged as location.

Wesley< P ER >went to buy the book published by Addison Wesley< ORG >.

Again here “Wesley” is first tagged as name of person whereas second time it is tagged as name of organization.

(7)

Chapter 2

Random Fields

2.1 Labeling Problem

Labeling problem is formally defined as assign label from set L to each node yi of the graph. The set Y =y1, y2, . . . , yn is called labeling or configuration.

Suppose total number of possible labels at each node is|L|=lthen the total number of configurations possible are

Y =l×l×l×. . . l=lⁿ Where n is size of set Y.

Example: Consider a example of POS tagging. In this task, possible labels areL=N, V, ADJ, ADV, P P where N is noun, V is verb, ADJ is adjective, ADV is adverb and PP is preposition. Consider one example sentence “IIT-Bombay is one of the best institute in India”. Since there are 9 words to label and each word can be given 5 possible tags there are 5⁹ possible labeling.

In sequence labeling task, images or text is considered as a graph with every pixel or every word being a node in that graph. Edges of this graph are defined implicitly by considering mutual interaction amongst the nodes. Eg. In image labeling, given an image task is to label each pixel correctly as to which object in the image that pixel belongs to. In text labeling, task is to find which phrase of the sentence a word belongs to. Edges in this graph are then defined between the words (or pixels) which directly influence or assumed to influence labels of the other word (or pixel).

2.2 Neighboring System

Once we define a graph over text data we define neighboring system in that graph. A neighboring system Ne is defined as

N e=N e(i)|∀i∈Y

Where each Ne(i) is set of nodes directly connected to i. Ne(i) have following properties

A site is not neighboring to itself i.e. i /∈N e(i) Neighbours are mutual i.e. ifj∈N e(i)theni∈N e(j) In case of images neighboring set for a pixel can be defined by considering 4-connectivity or 8 connectivity of pixel. The 4-connectivity and 8-connectivity is shown in fig. In case of text data, the graph usually is chain graph and hence there is 2-connectivity for each node. Which means each word is connected to its previous word and the next word.

In graph theory clique is defined as maximally connected sub-graph. In this discussion we take all possible cliques in our consideration. So set of cliques is

C=c1, c2, c3, . . .

Where c1 is set of single nodes, c2 is set of pair of connected nodes, c3 is triple neighboring nodes and so on. Since in text application graph is usually chain graph, cliques up to size 2 are considered.

(8)

2.3 Random Fields

A random field is a generalization of a stochastic process such that the underlying parameter need no longer be a simple real, but can instead be a multidimensional vector space. In simple words instead of single random variable over a sample space we have list or vector of random variables then the corresponding random process is called as random field.

LetY =y1, y2, . . . , yn be set of random variables defined over the set S in which in random variable yi takes valueli∈L. The probability associated with every random variable is denoted as Pr(yi =li).

Whereas probability of joint event is given by,

Pr(Y) = Pr(y1, y2, . . . , yn)

2.3.1 Markov Random Fields

Markov random field, is a model of the (full) joint probability distribution of a set X of random variables having the Markov property. A random field is said to be Markov Random Field (MRF) when it holds following properties

• Positivity A random field is said to be positive when

∀i: Pr(yi>0)

• Markovianity

Pr(yi|yN−i) = Pr(yi|yN e(i)) Where Ne(i) = set of all neighbours of node i in the graph.

Markovian property tells that probability of a node taking a particular label given labels of all the other nodes in the graph is equal to the probability of node taking that label given only labels of neighbors of that node.

Example: Consider same graph for sentence “ IIT-Bombay is one of the best institute in India”

then by Markovian property label of word “institute” given all the other labels in the sentence is same as its label given only label of “best” and “in” as shown in 2.1

Figure 2.1: Graphical model induced on a sentence. The word “insitute” depends only on “best” and

“in”

2.3.2 Gibbs Random Field

A random field is said to be Gibbs random field if and only if its configuration obeys Gibbs distribution.

Gibbs distribution is defined as

Pr(y) = exp(−^U(y)_T ) Z Where Z is defined as

Z=X

y∈Y

exp(−U(y) T )

(9)

Z is called as normalizing constant. T is called as temperature and its 1 unless and until explicitly mentioned. And U(y) is called as energy function defined as

U(y) =X

c∈C

Vc(y)

That is sum of all clique potentials. The value ofVc(y) depends upon local configuration of clique which is redefined as features in later discussion. Energy function can also be written as

U(y) =X

c₁

V1(y) +X

c₂

V2(y) +. . .

Here V1 is called as potential over single nodes or node potential,V2 is called as potential over pair of neighboring nodes or edge-potential.

2.4 Gibbs-Markov Equivalence

An MRF is characterized by its local property (the Markovianity) whereas a GRF is characterized by its global property (the Gibbs distribution). Hammersley-Clifford theorem establishes the equivalence of these two types of properties.

Hammersley-Clifford theorem says that, F is MRF if and only if F is GRF.

2.4.1 Proof of GRF → M RF

LetDi=N e(i)∪Xi be set of neighbors ofXi andXi itself.

We have to prove that if Gibbs distribution is satisfied then markovian property is hold. That means given Gibbs distribution we have to arrive at

Pr(Xi|XS−i) = Pr(Xi|XN e(i)) (2.1)

Starting with LHS

Pr(Xi|XS−i) =Pr(Xi, XS−i) Pr(XS−i)

= Pr(Xi, XS−i) P

XiPr(Xi, XS−i)

=

exp(P

c∈CVc(X)) Z(X)

P

Xi

exp(P

c∈CVc(X)) Z(X)

= exp(P

c∈CVc(X)) P

Xiexp(P

c∈CVc(X))

Now cliques C in the graph is set of cliques A which contains nodeXi and B which do not contain node Xi. Using this,

Pr(Xi|XS−i) = exp(P

c∈CVc(X)) P

Xiexp(P

c∈CVc(X))

= exp(P

c∈AVc(X))×exp(P

c∈BVc(X)) P

Xiexp(P

c∈AVc(X))×exp(P

c∈BVc(X)) since cliques in B do not contain nodeXi, the termexp(P

c∈BVc(X)) can be brought out of summation.

Pr(Xi|XS−i) = exp(P

c∈AVc(X))×exp(P

c∈BVc(X)) P

Xiexp(P

c∈AVc(X))×exp(P

c∈BVc(X))

= exp(P

c∈AVc(X)) P

Xiexp(P

c∈AVc(X))

(10)

The term in the numerator contains all the cliques which containsXi. We arrive at the same expression from RHS as

Pr(Xi|XN e(i)) = Pr(Xi, XN e(i)) Pr(XN e(i))

=

exp(P

c∈AVc(X)) Z(X)

P

Xi

exp(P

c∈AVc(X)) Z(X)

= exp(P

c∈AVc(X)) P

Xiexp(P

c∈AVc(X)) Thus 2.1 is proved.

2.4.2 Proof of M RF → GRF

For this Mobius inversion principle is used, F(A) = X

B:B⊆A

G(B)⇔G(B) = X

B:B⊆A

(−1)^|B|−|C|F(C) This when written for energy function U(X) and potentials V(X),

U(xv) = X

B:B⊆V

V(xb)⇐⇒V(xb) = X

C:C⊆B

(−1)^|B|−|C|U(xc) where V our graph and B is subgraph.

Gibbs distribution requires energy to be defined over all cliques of the graph. So to prove that

V(xb) = 0 (2.2)

whenever B is not completely connected ie. B is not clique.

when B is not a clique, letZ1and Z2be two nodes such that they are not connected in B.

let S be separator : Z1SZ2 =B now,V(xb) =P

C:C⊆B(−1)^|B|−|C|U(xc) can be written as

V(xb) = X

C^′:C^′⊆S

(−1)^|B|−|C^′^|U(xc^′)

+ X

C^′:C^′⊆S

(−1)^|B|−|Z¹^C^′^|U(xz₁c^′)

+ X

C^′:C^′⊆S

(−1)^|B|−|C^′^Z²^|U(xc^′z₂)

+ X

C^′:C^′⊆S

(−1)^|B|−|Z¹^C^′^Z²^|U(xz₁c^′z₂) adjusting exponents of (−1)

V(xb) = X

C^′:C^′⊆S

(−1)^|B|−|C^′^|[U(xc^′)−U(xz₁c^′)−U(xc^′z₂) +U(xz₁c^′z₂)] (2.3) Now we will prove that bracketed term [.] is 0. Taking exp of [.] term,

exp[.] =exp(U(xc^′)−U(xz₁c^′)−U(xc^′z₂) +U(xz₁c^′z₂))

= exp(U(xc^′))×exp(U(xz₁c^′z₂)) exp(U(xz₁c^′))×exp(U(xc^′z₂))

=

exp(U(x_c′))

Z ×^exp(U(x_Z^z¹^c^′^z²⁾⁾

exp(U(x_z

1c′))

Z ×^exp(U(x_Z^c^′^z²⁾⁾

= Pr(xc^′) Pr(xz₁c^′)

Pr(xz₁c^′z₂) Pr(xc^′z₂)

(11)

Using Baye’s Therom,

exp[.] = Pr(xz₁|xc^′z₂) Pr(xz₁|xc^′)

sinceXZ₁ andXZ₂ are not directly connected, using Markovian property, Pr(xz₁|xc^′z₂) = Pr(xz₁|xc^′)

exp[.] = 1 so [.] in 2.3 is 0. So by 2.2 Every MRF is GRF.

(12)

Chapter 3

Maximum Entropy Model

3.1 HMM to MEMM

3.1.1 Problems in Traditional HMM

[5] discuss need to graduate from Hidden Markov Models. In Hidden Markov Model, observation probabilities are expressed as multinomial distribution over the observation data and then parameters are estimated. There are two main problems with this appoach

1. Though input given is raw text data, many important characteristics can be used in constructing model. One of those characteristics are Capitalization, suffix, position in the page, POS tag of the word. Example: for unseen words, we can make use of the suffix of the unknown word. And see the distribution of that suffix taking perticular label in the corpus. Similar is the case with capitalization. This extra ssinformation is not used by traditional HMM.

2. Almost in all the applications of text labeling, the problem is to find label(or label sequence) given observation sequence. So even though problem requires conditional model, HMM models joint probability. Due to such joint probability modeling the dependancy graph of HMM becomes observation depends upon label, which intuitively should be label depends on observation.

With these drawbacks in traditional HMM, we look for new probability model.

Figure 3.1: A.Dependency Graph for traditional HMM B.Dependency Graph for Maximum Entropy Markov Model(MEMM)

3.1.2 Basics of Maximum Entropy

Consider one sequence labeling task: POS tagging. In this task we tag every word in the sentence with Part-Of-Speech tags. We are given with obeserved dataset in which words are correctly tagged by human experts. Our main aim is to build a model which correctly fits observed data.

Consider, for simplicity we have only 5 tags namely Noun(N), Verb(V), Adjective(ADJ), Adverb(ADV) and Preposition(PP). So given a word W, we have only 5 choices of tags to tag. This imposes one constraint on probability model that:

(13)

Given only this equation there can be infinitely many probability models are possible. One model which will make very strong assumptions and assign Pr(N) = 1 and others 0, which means this model will assign POS tag N to every word. One model can assign equal probabilities to all Noun and Verb, and 0 to others:

Pr(N|W) + Pr(V|W) = 1

Both of these models make some extra assumptions than what is observed from the data.Another model can be given as

which assumes every tag is possible. Now, suppose its seen from observed data that the word W is tagged as Verb 70% of the times so we can incorporate this information in our model as:

Now we need a model for Pr(N|W) + Pr(ADJ|W) + Pr(ADV|W) + Pr(P P|W) = 0.3. But for this we dont have any information available from the observed data. In general the question we need to answer is how to model or what to assume about unknown information. Here we make use of maximum entropy principle which states that:

if incomplete information about probability distribution is availale, the only unbiased assumption that can be made is a distribution which is as uniform as possible with known information as constraint.

Since uniformity of probability distribution is given by maximizing entropy, so we select probability model which will maximize entropy under given constraints.

3.2 Formulation

In this section we will formally define Maximum Entropy model and obtain formula for it as discussed by [2]. To state again, our goal is to come up with a model which correctly fits the observed data. We are given N samples in the form of sample pairs (word, label) : (x1, y1),(x2, y2), ...,(xN, yN). We use these training examples to calculated empirical probability distribution

∼

Pr(x, y) = N o.of times(x, y)occursinsample N

One of the important part of this model is statistic collected from the sample data. This statistic should give us maximum possible information about the trend between word and its label.

Example: In most of the cases article(a, an, the) is followed by noun. Or, if word ‘play’ is preceded by

‘noun’ then ‘play’ is Verb.

Such information about the interaction between words and labels is captured using “features”. A feature is indicator function which takes form like

f(x, y) =

1 ifx=play, y=verb 0 otherwise

We populate these features from the training data and put constraint over our model that expectation of features in our model should be equal to expectation of features in training data.

Pr(f∼ ) = Pr(f) (3.1)

where expected value of featuref in empirical distribution is defined as

∼

Pr(f) =X

x,y

∼

Pr(x, y)f(x, y) similarly expected value of featuref in our probability model is

Pr(f) =X

x,y

Pr(x, y)f(x, y)

(14)

using chain rule of probability

Pr(f) =X

x,y

Pr(x) Pr(y|x)f(x, y)

here assumption is made that probability of a word in empirical distribution is exactly equal to probability of word in model which we are formulating,

Pr(f) =X

x,y

∼

Pr(x) Pr(y|x)f(x, y) putting this formula in 3.1

X

x,y

∼

Pr(x, y)f(x, y) =X

x,y

∼

Pr(x) Pr(y|x)f(x, y) (3.2)

this is one constraint on our model. Another constraint is imposed by definition of probability:

X

y

Pr(y|x) = 1 (3.3)

We maximize entropy of the model under the constraints 3.2 and 3.3. Conditional entropy is defined as, H(y|x) =−X

(x,y)

∼

Pr(x) Pr(y|x)logPr(y|x)

our probability model is then,

∗

Pr(y|x) =arg max

Pr(y|x)H(y|x)

3.2.1 Formulating Dual Objecive

In order to maximize the entropy with respect to constraints we introduce Langrange Multipliers for every constraint. Letλ1...m be Langrange Multiplier associated 3.2 for each featuref1...m. Let λm+1 be the Langrange Multiplier associated 3.3. Then our dual objective is:

D(Pr(y|x), λ) =H(y|x) +

m

X

i=1

λi(Pr∼(fi)−Pr(f^∼ i)) +λm+1(X

y

Pr(y|x)−1) wheremis the total number of features. now we differentiate this to find the optimal value,

∂

∂Pr(y|x)H(y|x) = ∂

∂Pr(y|x)−X

(x,y)

∼

Pr(x) Pr(y|x)logPr(y|x)

=−Pr(x)(log^∼ Pr(y|x) + 1)

∂

∂Pr(y|x)

m

X

i=1

λi(Pr(fi)−Pr(f^∼ i)) = ∂

∂Pr(y|x)

m

X

i=1

λi(X

(x,y)

∼

Pr(x) Pr(y|x)fi(x, y)−(X

(x,y)

∼

Pr(x, y)fi(x, y)))

=

m

X

i=1

λi

∼

Pr(x)fi(x, y)

∂

∂Pr(y|x)λm+1(X

y

Pr(y|x)−1) =λm+1

the differentiation of complete dual is

∂

∂Pr(y|x)D(Pr(y|x), λ) =−Pr(x)(1 +^∼ logPr(y|x)) +

m

X

i=1

λi

∼

Pr(x)fi(x, y) +λm+1 (3.4)

(15)

equating to 0,

∂

∂Pr(y|x)H(y|x) = 0 Pr(x)(1 +∼ logPr(y|x)) =

m

X

i=1

λi

Pr(x)f∼ i(x, y) +λm+1

1 +logPr(y|x) =

m

X

i=1

λifi(x, y) + λm+1

Pr^∼(x) Pr(y|x) =exp(

m

X

i=1

λifi(x, y) + λm+1

Pr^∼(x)−1)

Pr(y|x) =exp(

m

X

i=1

λifi(x, y))×exp( λm+1

Pr^∼(x)−1) (3.5)

putting this in 3.3

X

y

Pr(y|x) = 1

X

y

exp(

m

X

i=1

λifi(x, y))×exp( λm+1

Pr^∼(x)−1) = 1 exp( λm+1

Pr^∼(x)−1) = 1

P

yexp(Pm

i=1λifi(x, y)) putting this value in 3.5 we get our final probability model

Pr(y|x) = exp(Pm

i=1λifi(x, y)) P

yexp(Pm

i=1λifi(x, y)) (3.6)

Once we get the model we are interested in our next task is to find values of parameters λi...m. For which IIS training algorithm is used.

3.3 Training

3.3.1 Parameter Estimation

Paramter is estimated using Improved Iterative Scaling algorithm [1]. We select the Λ such that it will minimizeDKL(Pr^∼(x, y)||Pr(y|x)).

DKL(Pr(x, y)||^∼ Pr(y|x)) =X

x,y

(Pr(x, y)log(^∼ Pr^∼(x, y) PrΛ(y|x)))

=X

x,y

(Pr(x, y)log(^∼ Pr(x, y)))^∼ −X

x,y

(Pr(x, y)log(Pr^∼

Λ(y|x))) sinceP

x,y(Pr^∼(x, y)log(Pr^∼(x, y))) does not change with respect to ΛDKL(Pr^∼(x, y)||Pr(y|x)) will min- imize when−P

x,y(Pr^∼(x, y)log(PrΛ(y|x))) is minimized, which means whenP

x,y(Pr^∼(x, y)log(PrΛ(y|x))) is maximized.

LetO(Λ) =−P

x,y(Pr^∼(x, y)log(PrΛ(y|x)))

Now, suppose Λ is changed by ∆(that means each λi is changed byδi) O(Λ + ∆) =X

x,y

(Pr(x, y)log( Pr^∼

Λ+∆(y|x))) change in objective from Λ to Λ + ∆ is

(16)

O(Λ + ∆)−O(Λ) =X

x,y

(Pr(x, y)log( Pr^∼

Λ+∆(y|x)))−X

x,y

(Pr(x, y)log(Pr^∼

Λ(y|x))) putting the values from 3.6

O(Λ + ∆)−O(Λ) =X

x,y

(Pr(x, y)log(^∼ exp(Pm

i=1(λi+δi)fi(x, y)) P

yexp(Pm

i=1(λi+δi)fi(x, y))))

−X

x,y

(Pr(x, y)log(^∼ exp(Pm

i=1λifi(x, y)) P

yexp(Pm

i=1λifi(x, y))))

=X

x,y

(Pr(x, y)^∼ X

i

δifi(x, y))−X

x,y

(Pr(x, y)log(^∼ ZΛ+∆(x) ZΛ(x) ))

=X

x,y

(Pr(x, y)^∼ X

i

δifi(x, y))−X

x

(Pr(x)log(^∼ ZΛ+∆(x) ZΛ(x) )) We will now make use of inequality, fora >0

−log(a)≥(1−a)

O(Λ + ∆)−O(Λ)≥X

x,y

(Pr(x, y)^∼ X

i

δifi(x, y)) + 1−X

x

(Pr(x)^∼ ZΛ+∆(x) ZΛ(x) )

=X

x,y

(Pr(y|x)^∼ X

i

x

∼

Pr(x) P

yexp(Pm

i=1(λi+δi)fi(x, y)) P

yexp(Pm

i=1(λi)fi(x, y))

=X

x,y

(Pr(y|x)^∼ X

i

x

∼

Pr(x)X

y

PrΛ(y|x)exp(X

i

δifi(x, y)) renaming RHS toA(Λ,∆)

A(Λ,∆) =X

x,y

(Pr(y|x)^∼ X

i

x

Pr(x)∼ X

y

PrΛ(y|x)exp(X

i

δifi(x, y))

sinceO(Λ + ∆)−O(Λ)≥A(Λ,∆), we can find ∆ for whichA(Λ,∆)>0. Let total count of features be, T(x, y) =P

ifi(x, y). Using this, we can writeA(Λ,∆) as, A(Λ,∆) =X

x,y

(Pr(x, y)^∼ X

i

x

∼

Pr(x)X

y

PrΛ(y|x)exp(X

i

δifi(x, y))

=X

x,y

(Pr(x, y)^∼ X

i

x

∼

Pr(x)X

y

PrΛ(y|x)exp(T(x, y)X

i

δifi(x, y) T(x, y) ) By the definition ofT(x, y), the term ^f_T(x,y)ⁱ^(x,y) forms probability distribution over features. Hence we can apply Jensen’s inequality,

exp(X

x

p(x)q(x))≤X

x

p(x)exp(q(x))

A(Λ,∆)≥X

x,y

(Pr(y|x)^∼ X

i

x

∼

Pr(x)X

y

PrΛ(y|x)X

i

(fi(x, y)

T(x, y)exp(δiT(x, y))) let the RHS of this eqn beB(∆,Λ). SinceO(Λ + ∆)−O(Λ)≥A(Λ,∆)≥B(∆,Λ), B(∆,Λ) gives a lower bound onO(Λ + ∆)−O(Λ). Differentiating B(∆,Λ) with respect toδi gives,

∂

∂δi

B(∆,Λ) =X

x,y

(Pr(x, y)f^∼ i(x, y)−X

x

∼

Pr(x)X

y

PrΛ(y|x)fi(x, y)exp(δiT(x, y))

(17)

equating to 0, X

x,y

(Pr(x, y)f^∼ i(x, y)−X

x

∼

Pr(x)X

y

PrΛ(y|x)fi(x, y)exp(δiT(x, y)) = 0

X

x,y

(Pr(x, y)f^∼ i(x, y) =X

x

∼

Pr(x)X

y

PrΛ(y|x)fi(x, y)exp(δiT(x, y)) (3.7) Solving this equation we get updatedeltai

3.3.2 Training Algorithm

Using the parameter estimation objective given in previous discussion, we can write training algorithm for maximum entropy model as,

Algorithm 1Algorithm for training MaxEnt model

InputFeature Functionsf1, f2, ..., fm;empirical distribution Pr^∼(x, y) OuputOptimal Parameters values Λ^∗; optimal model PrΛ^∗(y|x) repeat

fori= 1 to mdo δi be solution to 3.7

update values asλⁿ⁺¹_i ←λⁿ_i +δi

end for

untilΛ is converged

The key step in the algorithm is finding solution to 3.7. IfT(x, y) = M for all x,y then 3.7 can be written as,

δi=log P

x,y(Pr^∼(x, y)fi(x, y) P

xPr^∼(x)P

yPrΛ(y|x)fi(x, y) which becomes update rule.

3.4 Inferencing

Maximum Entropy model is descriminative model as opposed to HMM which is generative model. Graph- ical representations of both of these is given in Fig. Being generative model HMM believes that observation is generated by the state. Whereas Maximum Entropy model does not makes this statement. Thus in terms of alpha-beta notation, recurence equation for probability of being in statesfor HMM was,

αt+1(y) = X

y^′∈Y

Pr(y^′|y)×Pr(xt+1|y)αt(y^′) but in case of Maximum Entropy model this equation is changed as

αt+1(y) = X

y^′∈Y

Pry^′(y|xt+1)αt(y^′)

where Pry^′(y|xt+1) means we will use edge featuresf(y^′, y). Thus we can write equation for entire tag sequence as

Pr(Y|X) = Pr

start(y1|X)×Pr

y₁(y2|X)×Pr

y₂(y3|X)×...

in general

Pr(Y|X) =Y

i yPri−1

(yi|X)

(18)

Condition Features xi is not rare xi=W, yi=K

xi is rare W is prefix ofxi ,|W| ≤4, yi=K W is suffix ofxi ,|W| ≤4, yi=K xi contains number, yi=K

xi contains upper case character, yi=K xi contains hyphen, yi=K

∀xi yi−1=T, yi=K yi−2yi−1=T2T1, yi=K xi−2=W, yi=K xi−1=W, yi=K xi+1=W, yi=K xi+2=W, yi=K

Table 3.1: Table summerizing features used by [6]

3.5 Application

This section will discuss application of Maximum Entropy Markov model for POS tagging as discussed in [6] paper. As we have seen in earlier sections, features plays important role in maximum entropy models emphasis is given on what features are used for POS tagging application. Also one of the important issues in POS tagging application is how to handle words which do not appear in test dataset. This section will also discuss the unknown word features used by [6].

3.5.1 Feature Selection for POS tagging

Features mainly depend upon the local history of current work. Typically for tagging wordxihistoryhi

is chosen as

hi=xi, xi−1, xi−2, yi−1, yi−2

The typical feature set for POS tagging application is as given in the table. .

• When word xi is not rare, which means xi is in training dataset then can use the information in the proximity of that word for modeling features related with that word. But,

• When wordxi is rare, which meansxi is not in training dataset then can not use the information in the proximity of that word for modeling features. So in order to find features related to that word, we look for the proximity of the words which “look like” this new word xi. By “look like”

means we check for suffixes or prefixes of the word. Example.: if the word has suffix “tion” then it is most likely that the word is “Noun”. Also, the unknown words are usually proper nouns, to check if a word is proper noun, it is checked if it contains capital character. If word is begining with capital character then it is more likely to be proper noun. Also, numbers are given different tag.

• Along with the word based features as discussed in previous points, there are features which takes care of interaction between successive words. In sequence labeling task its believed that label at position i is influenced by its neighbours. So, tags used by two preceding words, means yi−2yi−1

is considered while fixing yi. This is called as tri-gram assumption, co-occurence upto 3 labels is considered.

3.5.2 Example

Consider one example sentence: “Young generation chooses Information Technology.” Suppose currently we are tagging word “Information”. Then Maximum Entropy model will look for which features are fired for all possible tags for this word.

(19)

• xi=Information andyi=N

• xi=Information andyi=V

• xi=Information andyi=ADJ

• xi=Information andyi=ADV

Also it will look for if in the training data word “information” is preceded by

• xi−1=chooses

• xi−2=generation

Following tri-gram feature this model will also look for what is the likely label sequence of “generation chooses” and whether that sequence be followed by “information”. Also looking at the capitalized first word and noun prefering suffix “tion” features for proper noun tag will get fired.

(20)

Chapter 4

Cyclic Dependancy Network

4.1 From HMM to Cyclic Dependency Network

In sequence labeling task, probability of label sequence is obtained from probability of local label sequences. If we consider a sentence as a chain graph then this local model will only consist of neighbors of current node under investigation. That means label at yi is influenced by both labels at yi−1 and yi+1. However, Hidden Markov Model considers only one direction of influence. That means in Left- Right HMMyi is influenced byyi−1 whereas in Right-to-Left HMMyi is influenced byyi+1. Though in unidirectional HMM interaction betweenyi−1 andyi is explicit butyi+1 andyi is implicitly considered while calculating label for yi+1. It is intuitive that one should consider effect of both yi−1 and yi+1

while deciding label at yi. Cyclic Dependency Network jointly considers effect of both preceding and succeeding labels on current label.

Figure 4.1: (A)Left-Right unidirectional CMM, (B)Right-Left CMM, (C)cyclic dependency network Example: The example given in [Manning] illustrates the above phenomenon. In a sequence “Will to fight”, ‘will’ plays role of noun than modal verb. But this effect is not captured in unidirectional HMM because: P r(< M OD >|will, < start >) is higher than P r(< N oun > |will, < start >), also lexical probabilityP r(< T O >|to) is 1 hence unidirectional model fails to tag ‘will’ as< N oun >.

4.2 Cyclic Dependency Network

A very simple unidirectional dependency network is shown in image below. We can factorise these graphs easily using chain rule as,

P r(X) =Y

i

Pr(Xi|P a(Xi))

(21)

Figure 4.2: (A)Left-Right unidirectional dependency network, (B)Right-Left unidirectional dependency network, (C)cyclic dependency network

whereP a(Xi) = Parents of nodeXi

Using this formula for network in Fig.(A) we will get Pr(a, b) = Pr(a) Pr(b|a). Similarly for network in Fig.(B) we can write Pr(a, b) = Pr(b) Pr(a|b). But if we try to apply chain rule for network in Fig.(C) we will get Pr(a, b) = Pr(b|a) Pr(a|b) [Manning] have given very simple observation data to disprove that chain rule does not hold for cyclic dependency networks as in Fig.(C)

Example: Suppose we have two random variables a, b which can be represented using dependency network as given in Fig. (C). Suppose we have sample observation overa, b=<11,11,11,12,21,33>. In this observation sample, Pr(a= 1, b= 1) = 0.5. Now, Pr(a= 1|b= 1) = 3/4 and Pr(b= 1|a= 1) = 3/4.

So applying chain rule, we will get

Pr(a= 1, b= 1) = Pr(a= 1|b= 1) Pr(b= 1|a= 1) = 3/4∗3/4 = 9/16 which is wrong.

Thus we cannot use chain rule to calculate probability of joint distribution over the random variables when they follow Cyclic Dependency.

4.3 Inferencing in Cyclic Dependency Network

Though we cannot easily compute the joint probability distribution, our main aim is not to find joint probability but to find heighest scoring sequence of labels. Hence we can use the product termQ

iPr(Xi|P a(Xi)) to score the sequences which are labeled by our model.

score(x) =Y

i

Pr(Xi|P a(Xi))

Using this score we can use modified Viterbi Algorithm to find the maximizing label sequence in poly- nomial time. The psuedocode for the algorithm is given in [Manning] as:

Algorithm 2Psuedocode for inference for the first order bidirectional cyclic dependency model function bestScore()

return bestScoreSub(n+ 2, < end, end, end >);

functionbestScoreSub(i+ 1, < ti−1, ti, ti+1>) /*left boundry case*/

if i=−1 then

if < ti−1, ti, ti+1>==< start, start, start > then return 1;

else

return 0;

end if end if

/*recursive case*/

return maxti−2 bestScoreSub(i, < ti−2, ti−1, ti>)×Pr(ti|ti−1, ti+1, wi);

The state diagram for the inferencing algorithm can be seen above. In this diagram, 1, 2 and 3 are sites for which labels are to be infered. Each of these sites can take labels A, B and C. Base case for the recursion is defined as< ti−1, ti, ti+1>==< start, start, start >wheni=−1. In next step wheni= 0,

(22)

Figure 4.3: State diagram for inferencing algorithm for Cyclic Dependency Network

best score will be obtained by recursive step, bestScoreSub(i, < ti−2, ti−1, ti >)×Pr(ti|ti−1, ti+1, wi).

bestScoreSub will return 1 becasue i = −1. Pr(ti|ti−1, ti+1, wi) is calculated from the model. Each time when window is at position i instead of receiving Pr(ti|ti−1, ti+1, wi) one will receive score for Pr(ti−1|ti−2, ti, wi−1).

Example: When window will be at state 2A, score received are max amongst all states at i−2 which is only one state start, bestScore for < start,1A,2A >,< start,1B,2A >,< start,1C,2A >.

Similarly when window will be at state 2B, scores received will be for bestScore for< start,1A,2B >,<

start,1B,2B >,< start,1C,2B >. And the process will continue.

(23)

Chapter 5

Conditional Random Fields

5.1 Label Bias Problem

As we have seen in chapter 3, Maximum Entropy models normalize scores at each stage, giving constraint P

yPr(y|x) = 1. This leads to label bias problem. Consider fig 5.1A in which all the states transitions are given score. In fig 5.1B all those score are normalized after every state and their corresponding probability values are obtained. Maximum Entropy models use these probabilities to find out heighest scoring label sequence. Label bias problem occurs when a particular state is biased towards only few of the all possible next stages. Consider state 1A, for example. This state gives scores 10 to both 2A and 2B. and scores 0 to all the other states in stage 2. Since scores are normalized after every state, the probability of reaching 2A and 2B will become 0.5.

On the other hand consider state 1B which gives score of 35 to 2A and 2B, and 10 to all the others.

After normalizing the scores will become 0.35 to 2Aand 2B, and 0.1 to all the others. Though actual scores of transition from 1Aare smaller than 1B, after normalizing their probability value goes up. And hence maximum entropy model will end up choosing path will lower score. The solution to this problem is not to normalize the scores after each state but normalize at the end with respect to all possible path scores. This gives rise to Conditional Random Fields.

Figure 5.1: A.Transitions with un-normalized scores B.Transitions with probability values

5.2 Conditional Random Field

Conditional Random Fields(CRF) is defined by [Lafferty] as Let G = (V, E) be a graph such that Y = (Yv)v∈V, so that Y is indexed by vertices of G. The (X, Y) is a conditional radom field when every

(24)

random variable Yv is conditioned on X and obeys Markovian property with respect to the graph G.

Here X is input sequence, also called as observation sequence, and is set of observationsX = (x1, x2, . . .).

AndY is called output sequence, also called label sequence,Y = (y1, y2, . . .).

5.2.1 Formulation

Expression for CRF is derived as ,

Pr(Y|X) =Pr(X, Y) Pr(X)

= Pr(X, Y) P

Y Pr(X, Y)

By definition of CRF, every CRF is MRF and as seen in chapter 2, every MRF follows Gibbs distribution, Pr(Y|X) = Pr(X, Y)

P

Y Pr(X, Y)

=

1 Zexp(P

c∈CVc(Xc, Yc))

1 Z

P

Yexp(P

c∈CVc(Xc, Yc)

= exp(P

c∈CVc(Xc, Yc) P

Y exp(P

c∈CVc(Xc, Yc) So CRF can be mathematically written as,

Pr(Y|X) = exp(P

c∈CVc(Xc, Yc)

Z(X) (5.1)

where Vc(Xc, Yc) is clique potential over clique c. As discussed in previous chapters, since in text labeling task cliques are nodes and pair cliques,Vc(Xc, Yc) perticulary takes form of node potentials and edge potentials. Such potentials are specified using local featuresfiand corresponding weightsλi. Each feature can be node feature, also called state feature,s(yj, X, j); or edge feature, also called as transition feature,t(yj−1, yj, X, j), for all positionsj.

AndZ(X) =P

Y exp(P

c∈CVc(Xc, Yc). It can be seen in the CRF equation that normalizing constant Z(X) is over all possible state sequencesY. Thus, scores are not normalized at every state but kept as it is till the end, and finally scores are converted to probability distribution by deviding by normalizing constant. Thus CRF is not suffered from label bias problem.

Suppose there are M total features f1...M and corresponding weights λ1...M then clique potentials can be obtained by summing outλifiand hence 5.1 can be written as,

Pr(Y|X) = exp(Pn j=1

PM

i=1λifi(yj−1, yj, X, j))

Z(X) (5.2)

whereZ(X) =P

Y exp(Pn j=1

PM

i=1λifi(yj−1, yj, X, j)).

Task of finding features and appropriate weights is done during the training.

5.3 Training

During training data is given in the form of obeservation sequence and labeling sequence pairsT = (X, Y).

CRF is formulated using set of features and corresponding weights. Two main tasks to be carried out during the training phase are

• Parameter Estimation

• Efficient Feature Selection

(25)

5.3.1 Parameter Estimation

During the parameter estimationλi are estimated so that probability model best describes the training data. Methods for parameter estimation are:

1. Penalized Maximum Likelihood Method 2. L-BFGS Method

3. Voted Perceptron Method

Penalized Maximum Likelihood Method

In penalized maximum likelihood method, log-likelihood of probability model is maximized with some penalty to avoid overfitting. Log-likelihood of CRF is

L(T) = X

(X,Y)∈T

log(Pr(Y|X))

= X

(X,Y)∈T

log( exp(Pn j=1

PM

i=1λifi(yj−1, yj, X, j)) P

Yexp(Pn j=1

PM

i=1λifi(yj−1, yj, X, j))) to avoid overfitting maximum likelihood is penalized with term−PM

i=1 λ²_i 2σ². L(T) = X

(X,Y)∈T

log( exp(Pn j=1

PM

i=1λifi(yj−1, yj, X, j)) P

Y exp(Pn j=1

PM

i=1λifi(yj−1, yj, X, j)))−

M

X

i=1

λ²_i 2σ²

= X

(X,Y)∈T





n

X

j=1 M

X

i=1

λifi(yj−1, yj, X, j)−log(X

Y

exp(

n

X

j=1 M

X

i=1

λifi(yj−1, yj, X, j)))



−

M

X

i=1

λ²_i 2σ²

= X

(X,Y)∈T n

X

j=1 M

X

i=1

λifi(yj−1, yj, X, j)− X

(X,Y)∈T

log(X

Y

exp(

n

X

j=1 M

X

i=1

λifi(yj−1, yj, X, j)))−

M

X

i=1

λ²_i 2σ² partial derivatives of these with respect toλk are

∂

∂λk

X

(X,Y)∈T n

X

j=1 M

X

i=1

λifi(yj−1, yj, X, j) = X

(X,Y)∈T n

X

j=1

fk(yj−1, yj, X, j)

∂

∂λk

X

(X,Y)∈T

= X

(X,Y)∈T

X

Y

PrΛ(Y|X)

n

X

j=1

fk(yj−1, yj, X, j)

∂

∂λk

=−λk

σ² equating this to 0,

∂

∂λk

L(T) = 0 X

(X,Y)∈T n

X

j=1

fk(yj−1, yj, X, j) + X

(X,Y)∈T

X

Y

PrΛ(Y|X)

n

X

j=1

fk(yj−1, yj, X, j)−λk

σ² = 0 solving this equation optimal Λ are obtained.

The underlined term in the above equation involves summation over all possible tag sequences. This can be efficiently calculated using alpha-beta message passing given by [3]

αj(s|X) = X

s^′∈P red(s)

αj−1(s^′|X)×exp(

M

X

i

λifi(yj−1=s^′, yj=s, X, j))

βj(s|X) = X

s^′∈Succ(s)

βj+1(s^′|X)×exp(

M

X

i

λifi(yj−1=s, yj =s^′, X, j))

(26)

Using this definitions X

(X,Y)∈T

X

Y

PrΛ(Y|X)

n

X

j=1

fk(yj−1, yj, X, j) = X

(X,Y)∈T

1 Z^Λ(X)

n

X

j=1

X

s∈S

X

s^′∈Succ(s)

fi(s, s^′, X, j)×

αj(s|X)exp(

M

X

i

λifi(yj−1=s, yj=s^′, X, j))βj+1(s^′|X)

L-BFGS Method

Using Taylor’s expansion,

L(Λ + ∆) =L(Λ) + ∆^T▽(Λ) +1

2∆^TH(Λ)∆

diff wrt ∆

d

d∆L(Λ + ∆) =▽(Λ) +H(Λ)∆

equating to 0 to get optimal step size ∆

▽(Λ) +H(Λ)∆ = 0

∆ =−H⁻¹(Λ)▽(Λ) so the update rule becomes,

Λk+1= Λk+ ∆

inverse of Hessian is computationally hard to compute. So its approximated as, let H⁻¹(Λ = B for notational convinience,

Bk+1=Bk+sks^T_k y^T_ksk

(y^T_kBkyk

y_k^Tsk

+ 1)− 1 y^T_ksk

(sky^T_kBk+Bkyks^T_k) where,

sk= Λk−Λk−1

yk=▽_k−▽_k−1 Voted Perceptron Method

Perceptron uses misclassified example for training the weights. In CRF, after every iteration of Λ^t , training example is labeled using this values of Λ^t. This labeled observation sequence is used for obtaining feature vector. The difference between the feature vector obtained from actual training data and the feature vector obtained from probability model is taken as step to change weights. Step size,

∆^t=F(Y, X)−F(Y^∗, X) whereY^∗ is best scoring labels obtained using model.

Λ^t+1= Λ^t+ ∆^t

In voted perceptron, to avoid over fitting. To avoid this unweighted average of eachδi is taken and that value is added to obtain new weights.

(27)

5.3.2 Feature Selection

Typically the features are set of hand crafted rules in the form of templates. Some of which have been as discussed in [6].[4] have given algorithm to select the features which will increase log-likelihood at each step.

1. Start with no features 2. Consider set of new features

3. select for inclusion only those candidate features that will most increase the log-likelihood of the correct state path

4. train weights for all included features 5. iterate to step 2.

Probability after adding new featureg and its weight µcan be efficiently calculated using the formula as given by [4]

Λ+g,µPr (Y|X) =PrΛ(Y|X)exp(PT

t=1µg(yt−1, yt, X, t)) ZX(Λ, g, µ)

where

ZX(Λ, g, µ) =X

Y

PrΛ(Y|X)exp(

T

X

t=1

µg(yt−1, yt, X, t))

Log-likelihood of the training data after adding new feature is LΛ,g,µ= (

N

X

i=1

log( Pr

Λ,g,µ(yⁱ|xⁱ)))− µ² 2σ²−

K

X

k=1

λ²_k 2σ² which is approximated as

LΛ,g,µ= (

N

X

i=1 Ti

X

t=1

log( Pr

Λ,g,µ(y_tⁱ|xⁱ, t)))− µ² 2σ²−

K

X

k=1

λ²_k 2σ² Rather than summing over all output variables of all instances,PN

i=1

PTi

t=1its efficient to sum only over variables which are not correctly labeled

letx(i) :i= 1...M be such variables, then likelihood will be LΛ,g,µ= (

M

X

i=1

log( Pr

Λ,g,µ(y_tⁱ|xⁱ, t)))− µ² 2σ²−

K

X

k=1

λ²_k 2σ² Gain of feature is defined as improvement in log-likelihood,

GΛ(g, µ) =LΛ,g,µ−LΛ

5.4 Inferencing

The problem of inference is to find the most likely sequence of Y given observations X. The quantity deltaj(s|X), which is the heighest score along a path, at a position j, which ends in s, is defined as

δj(s|X) =maxy₁,y₂,...,yj−1Pr(y1, y2, ..., yj=s|X) This can be efficiently computed using recursive algorithm,

1. Initialization: The values for all steps from the start state⊥to all possible first states s are set to the corresponding factor value.

∀s∈S:δ1(s) =V1(X,⊥, s)

=ψ1(s) =⊥

(28)

q(yi−1, yi) p(x, i)

yi=y true

yi=y, yi−1=y^′ c(yi) =ci

yi=y wi=w

wi−1=w wi+1=w wi−2=w wi+2=w

wi=w, wi−1=w^′ wi=w, wi+1 =w^′ ti=t

ti−1=t ti−2=t ti+1=t ti+2=t ti=t, ti−1=t^′ ti=t, ti+1=t^′

Table 5.1: table summerizing features used by [7]

2. Recursion: The values for the next steps are computed from the current value and the maximum values regarding all possible succeesing statess^′

∀s∈S: 1≤j≤:δj(s) = max

s^′∈Sδj−1(s^′)Vj(X, s^′, s) ψj(s) = arg max

s^′∈Sδj−1(s^′)Vj(X, s^′, s) 3. Termination:

p^∗= max

s^′∈Sδn(s^′) y_n^∗= arg max

s^′∈Sδn(s^′)

4. Path Backtracking: Recomputing the optimal path from the lattice using the track keeping values ψt

y_t^∗=ψt+1(y_t+1^∗ ) where t=n−1, n−2, . . . ,1

5.5 Application

5.5.1 chunking

[7] discuss the application of CRF for chunking of the text. Each word can take on of the tags B, I and O. CRF is modelled as second order Markov network, which means chunk tag at positioni will depend on chunk tags at positioni−1 andi−2. This is achieved by making CRF labels as pair of consecutive tags like BI,BO,IO etc. So CRF will label each word with two chunk tags: label at word i isyi=ci−1ci

whereci is the actual chunk tag of wordi.And label at 0^thposition is explicitly taken to be O. Features are devided into two sets and represented as

f(yi−1, yi, X, i) =p(x, i)q(yi−1, yi)

out of which p(x, i) takes into account words and POS tags of the word whereas q(yi−1, yi) takes into account the influence of one label on the other. The feature templets are summarized in the table as

Graphical Models for Sequence Labeling in NLP