Data Driven Probabilistic Dependency Parsing

(1)

Data Driven Probabilistic Dependency Parsing

Geetanjali Rakshit

Dept of Computer Science and Engg IIT Bombay

Lecture 19 in CS626, 4^th Sept, 2014

(2)

Dependency Parse

• A dependency parse is a directed acyclic graph, which may be unlabeled or labeled with dependency relations.

• All words are either directly or indirectly dependent on the verb.

sbj obj mod

John hit the ball John hit the ball

Unlabeled Parse Labeled Parse

(3)

Applications

• relation extraction

• machine translation

• synonym generation

• lexical resource augmentation

• paraphrase acquisition

(4)

Projective Parsing

• When the words in the sentence are put in he linear order, the edges can be drawn above the words without crossings, .i.e., the words and its descendants form a contiguous substring of the sentence.

root John hit the ball with the bat

• Projective trees are sufficient to analyze most sentence types in English.

(5)

Non-Projective Parsing

• There are crossings between edges of the graph.

root John saw a dog yesterday which was a Yorkshire terrier

• In languages with more flexible word order than English, such as German, Dutch and Czech, non-projective dependencies are more frequent.

(6)

MSTParser

• Main Idea: Dependency parsing can be formalised as the search for a maximum spanning tree in a directed graph.

• Work by Ryan McDonald et al.

• Algorithms

• Eisner: For projective parsing

• Chu Liu Edmonds: For non-projective parsing

• Works for both labeled and unlabeled parsing.

(7)

Edge-based Factorization

• Let = , … be a generic sentence

• = generic dependency tree for

• , ϵ if there is a dependency in from to , i.e., there is a directed edge from word to word in the tree, or, is the parent of _.

• score of a dependency tree as the sum of the scores of all edges in the tree.

s , = . ,

where s , = score of edge ,

, = high dimensional feature representation of edge = weight vector

• score of a dependency tree for sentence is s , = ∑ _, s , = ∑ _, . ,

(8)

Modeling Questions

• Assuming an appropriate feature representation, as well as a weight vector , dependency parsing is the task of finding the dependency tree with highest score for a given sentence .

• ( ) as the set of possible dependency trees for the input sentence and bestk( ; ) is the set of k dependency trees in ( )

∗ = max

( ) s ,

• Questions

– How to find the dependency tree with highest score for sentence ? – How to learn an appropriate weight vector from the training data?

– What feature representation , should be used?

(9)

Features

root John hit the ball with the bat

Score given by:

s , = ∑ _, s , = ∑ _, . ,

Features could be real or binary Example:

, = 1 = ℎ = ′

0

(10)

Closer look at features

• p-word: word of parent node in dependency tree.

• c-word: word of child node.

• p-pos: POS of parent node.

• c-pos: POS of child node.

• p-pos+1: POS to the right of parent in sentence.

• p-pos-1: POS to the left of parent.

• c-pos+1: POS to the right of child.

• c-pos-1: POS to the left of child.

• b-pos: POS of a word in between parent and child nodes.

(11)

Closer look at features

Basic Unigram Features p-word, p-pos p-word

p-pos

c-word, c-pos c-word

Basic Bigram Features p-word, p-pos, c- word, c-pos

p-pos, c-word, c-pos p-word, c-word, c- pos

p-word, p-pos, c-pos p-word, p-pos, c- word

p-word, c-word

In Between POS Features p-pos, b-pos, c-pos

SurroundingWord POS Features

p-pos, p-pos+1, c-pos-1, c-pos

p-pos-1, p-pos, c-pos-1, c-pos

p-pos, p-pos+1, c-pos, c- pos+1

p-pos-1, p-pos, c-pos, c- pos+1

(12)

Learning the weights

• It uses MIRA: Margin Infused Relaxed Algorithm (Crammer and Singer 2003)

• It is an online learning where a single training instance is considered on each iteration, and parameters updated by applying an algorithm-specific update rule to the instance under consideration.

• It returns an averaged weight vector, averaged over each iteration, throughout training.

(13)

MIRA

• MIRA attempts to keep the new weight vector as close as possible to the old weight vector, subject to correctly classifying the instance under consideration with a margin given by the loss of the incorrect classifications.

• min || − | |

s.t. , − , ≥ L , , ∀ ∈

L , = the loss of a tree is defined to be the number of words with incorrect parents relative to the correct tree.

( ) is defined as the set of possible dependency trees for the input sentence . Instead of ( ) , ( ; ) used to reduce complexity.

• The more errors a tree has, the farther away its score will be from the score of the correct tree.

• ( ; ) as the set of dependency trees in that are given the highest scores by weight vector

(14)

MIRA Learning Algorithm

Training data: = ,

1. = 0; = 0; = 0

2. For

: 1 . .

3. for

: 1. .

4.

min || − |

| s.t.

, −

, ≥ L , , ∀ ∈

5.

= + 6. = + 1

7. = /( ∗ )

(15)

Eisner Algorithm - to find the dependency tree with highest score for sentence ?

• Runtime of ( ) and has been employed successfully in both generative and discriminative parsing models.

• Similar to CKY method

• Instead of storing subtrees, storing spans

• Non-constituent spans will be concatenated into larger spans

• Span = substring where no internal word links to any word outside of the span

• A span consists of:

– >= 2 adjacent words – Tags for all these words

– A list of all dependency links between words in the span.

(16)

• Each internal word has a parent in the span

• A span of the dependency parse with either one parentless endword or two parentless endwords

• In a span, only the endwords are active (meaning they still need a parent)

• Internal part of the span is grammatically inert

• Covered-concatenation: if span a ends on the same word i that starts span b, then the parser tries to combine the two.

• No cycles, multiple parents and crossing links are allowed in the span.

(17)

The Algorithm

• Build an incomplete subtree by merging two adjacent subtrees and , where is right-facing and is left- facing.

The new subtree is left-facing if we add a dependency from the right edge of to the left edge of ; otherwise it’s right- facing.  

• Build a complete left-facing subtree by merging a complete left-facing subtree with an adjacent incomplete left-facing subtree .  

• Build a complete right-facing subtree by merging an incomplete right-facing subtree with a complete left- facing subtree .  

(18)

Notations

, , ,

: score of the best subtree from to in direction and completeness .

where,

- indices

and are in the range 1, n

- direction

∈ ←, →

←

:

← →

:

→

- completeness

∈ 0, 1

1: subtree will not take any more dependencies

0: subtree needs to be completed

(19)

Calculating Scores

• , , ←, 0 = _, + max , , →, 1 + + 1, ,←, 1

• , , →, 0 = _, + max , , →, 1 + + 1, ,←, 1

• , , ←, 1 = max , , ←, 1 + , , ←, 0

• , , →, 1 = max , , →, 0 + + 1, , →, 1

• The score of the final parse is 1, , →, 1

(20)

Example

plastic cup holders

Let the log potentials _, = , be given as under:

1 ROOT

2 Plastic

3 Cup

4 Holders

1 ROOT 1 1 1

2 plastic -∞ -1 -1

3 cup -∞ 2 -1

4 holders -∞ 0 4

(21)

1 ROOT

2 Plastic

3 Cup

4 Holders

1 ROOT 1 1 1

2 plastic -∞ -1 -1

3 cup -∞ 2 -1

4 holders -∞ 0 4

1, 2,←, 0 = 1, 1,→, 1 + 2, 2,←, 1 + 2, 1 = 2, 1 = −∞

1, 2,→, 0 = 1, 1,→, 1 + 2, 2,←, 1 + 1, 2 = 1, 2 = 1

1, 2,←, 1 = 1, 1,←, 1 + 1, 2,←, 0 = 1, 2,←, 0 = 2, 1 = −∞

1, 2,→, 1 = 1, 2,→, 0 + 2, 2,→, 1 = 1, 2,→, 0 = 1, 2 = 1 2, 3,←, 0 = 3, 2 = 2

2, 3,→, 0 = 2, 3 = -1 2, 3,←, 1 = 3, 2 = 2 2, 3,→, 1 = 2, 3 = -1

3, 4,←, 0 = 4,3 = 4 3,4,→, 0 = 3, 4 = -1 3,4,←, 1 = 4, 3 = 4 3,4,→, 1 = 3, 4 = -1

(22)

1 ROOT

2 Plastic

3 Cup

4 Holders

1 ROOT 1 1 1

2 plastic -∞ -1 -1

3 cup -∞ 2 -1

4 holders -∞ 0 4

1, 3,←, 0 = 3, 1 + max ( 1, 1,→, 1 + 2, 3,←, 1 , 1, 2,→, 1 + 3 3,←, 1 ) = 3, 1 +max( 3, 2 , 1, 2) = 3, 1 + 3, 2 = −∞

1, 3,→, 0 = 1, 3 + max ( 2, 3,←, 1 , 1, 2,→, 1 ) = 1, 3 + 3, 2 = 3

1, 3,←, 1 = max ( 1, 1,←, 1 + 1, 3,←, 0 , 1, 2,←, 1 + 2, 3,←, 0 ) = max(0− ∞,−∞+ 2) = −∞

1, 3,→, 1 =max( 1, 2,→, 0 + 2, 3,→, 1 , 1, 3,→, 0 + 3, 3 →, 1 = max ( 1, 2 + 2, 3), 1, 3 + 3, 2 = max 0, 3 = 3

(23)

1 ROOT

2 Plastic

3 Cup

4 Holders

1 ROOT 1 1 1

2 plastic -∞ -1 -1

3 cup -∞ 2 -1

4 holders -∞ 0 4

2, 4,←, 0 = 4, 2 + max ( 3, 4,←, 1 , 2, 3,→, 1 ) = 4, 2 + max( 4, 3 , 2, 3 ) = 4

2, 4,→, 0 = 2,4 + max( 4, 3 , 2, 3 ) = 3

2, 4,←, 1 = max ( 2, 4,←, 0 , 2, 3,←, 1 + 3, 4,←, 0 ) = max( 4, 2 + 4, 3 , 3, 2 + 4, 3 ) = 6

2, 4,→, 1 =max( 2, 3,→, 0 + 3, 4,→, 1 , 2, 4,→, 0 ) = max( 2, 3 + 3, 4 , 2, 4 + 4, 3 ) = 3

(24)

1 ROOT

2 Plastic

3 Cup

4 Holders

1 ROOT 1 1 1

2 plastic -∞ -1 -1

3 cup -∞ 2 -1

4 holders -∞ 0 4

1, 4,←, 0 = 4, 1 +… = −∞

1, 4,→, 0 = 1, 4 + max ( 2, 4,←, 1 , 1, 2,→, 1 + 3, 4,←, 1 , [1, 3,→ , 1]) = 1, 4 + max ( 3, 2 + 4, 3), 1, 2 + 4,3 , 1,3 +

1 + 6 = 7

1, 4,←, 1 = max(0− ∞,−∞+. . ,−∞+⋯) = −∞

1, 4,→, 1 =max( 1, 2,→, 0 + 2, 4,→, 1 , 1, 3,→, 0 + [3, 4 → , 1], 1, 4 →, 0 = max ( 1, 2 + 2, 4) + 4, 3 , 1,3 + 3, 2 +

3, 4 , 1, 4 + 4, 3 + 3,2

(25)

Final tree

ROOT plastic cup holders

(26)

Chu Liu Edmonds Algorithm

• Find highest scoring incoming arc for each vertex

• If this is a tree, then we have found MST

• If not a tree, identify cycle and contract its nodes into a pseudo-node.

• Recalculate arc weights into and out-of cycle and recursively call algorithm on new graph. The key here is that the weight of the MST of this contracted graph is equal to the weight of the MST for the original graph and therefore we can obtain the MST of the original graph from it.

(27)

Example

John saw Mary.

(28)

Steps

• For each word, find the highest incoming edge.

• Cycle exists!!

(29)

Next Steps

• Contract cycle to a single node and recalculate weights.

• represents the contraction of vertices

(30)

Getting the tree

(31)

The Algorithm

Chu-Liu-Edmonds(G, s) Graph G = (V, E)

Edge weight func on s : E → R 1. Let = ^∗, : ∈ , ^∗ = argmax ( , )

2. Let = ,

3. If GM has no cycles, then it is an MST: return GM

4. Otherwise, find a cycle C in GM 5. Let GC = contract(G, C, s)

6. Let y = Chu-Liu-Edmonds(GC , s) 7. Find a vertex ∈ s.t. , ∈

, , ∈

8. Return ∪ − ,

contract(G = (V, E), C, s)

1. Let GC be the subgraph of G excluding nodes in C

2. Add a node c to GC representing cycle C

3. For ∈ − : ∃ _∈ , ∈ Add edge(x,c) to GC with

, = _∈ ,

4. For ∈ − : ∃ _∈ , ∈ Add edge(x,c) to GC with

, = _∈ [ , −

, + ( )]

where a(v) is the predecessor of v in C

and = ∑ _∈ ( , )

5. return GC

(32)

Difference between Eisner Algorithm and Chi-Liu-Edmonds Algorithm

• The Eisner is a bottom-up dynamic programming algorithm while the Chi-Liu-Edmonds algorithm is a greedy recursive one.

• A bottom-up algorithm is necessary for the projective case since it must maintain the nested structural constraint, which is unnecessary for the non-projective case.

(33)

References

 McDonald, Ryan, Kevin Lerman, and Fernando Pereira. "Multilingual dependency analysis with a two-stage discriminative parser." Proceedings of the Tenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2006.

 McDonald, Ryan T., and Fernando CN Pereira. "Online Learning of Approximate Dependency Parsing Algorithms." EACL. 2006.

 McDonald, Ryan, et al. "Non-projective dependency parsing using spanning tree algorithms." Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2005.

 McDonald, Ryan, Koby Crammer, and Fernando Pereira. "Online large-margin training of dependency parsers." Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.

 Eisner, Jason M. "Three new probabilistic models for dependency parsing: An exploration."

In Proceedings of the 16th conference on Computational linguistics-Volume 1, pp. 340-345.

Association for Computational Linguistics, 1996.

 Jacob Eisenstein, Eisner’s Algorithm for Projective Dependency Parsing: An Example

Worksheet: http://www.cc.gatech.edu/~jeisenst/classes/cs7650_sp12/eisner_worksheet.pdf