• No results found

Unsupervised Stemming

N/A
N/A
Protected

Academic year: 2022

Share "Unsupervised Stemming"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Vasudevan N., Research Scholar,

CSE Dept., IIT Bombay

28/8/14

(lecture given in the CS626 course on NLP)

Unsupervised Stemming

(2)

Introduction

Stemming – Identifying stem by removing inflectional suffix

boys boy

Stemming is relevant for NLP and IR applications

Corpus based stemming – for resource scarce languages

Linguistic rules may not be available

Unsupervised stemming – unavailability of annotated corpus

Is it possible to do stemming with only a list of words?

(3)

Unsupervised Stemming using Minimum

Description Length

[Goldsmith, 2000]

(4)

Minimum Description Length (MDL)

Data is given – How to select model ?

Select the model which leads to best compression of data

No other assumption about underlying truth

Basic Idea of MDL based Learning

Learning is finding regularity

Regularity can be used to compress

More learning More compression

MDL is a practical Kolmogorov complexity

(5)

Kolmogorov Complexity

Kolmogorov complexity of a sequence, S

Length of smallest p s.t., U(p)=S

Where U is the universal turing machine

Length of the shortest program that prints S

E.g., S = “010101010101010101010101”

Program 1:- print “010101010101010101010101”

Program 2:- for i=1 to 12: print “01”

Less complexity More regularity

Kolmogorov complexity is uncomputable

(6)

Kolmogorov Complexity to MDL

H is the set of models/hypothesis and D is the data then best model is

L(h): Length of description of h

L(D/h): Length of description of D when encoded according to h

( ) ( / )

min L h L D h Arg

H h

(7)

Length

How to calculate the length of model and data?

Information inequality:

If the data are distributed according to P,

The code with length −log P achieves the minimum expected code length

Number of bits (length) required to represent any element in list [e1,e2,…,en] is log(n)

In the absence of any other information assume uniform distribution

Pr(e1)=1/n, Pr(e2)=1/n, …, Pr(en)=1/n

Code length = -log(1/n) = log(n)

(8)

MDL for Stemming

Given a list of word forms split each word forms into its stem and suffix

Word forms (W) Cat

Cats Dog Dogs Girl Girls Boy

Splits

Cat+NULL Cat+s

Dog+NULL Dog+s

Girl+NULL Girl+s

Boy+NULL

(9)

Model

Stemming model have a stem list, a suffix list and a signature list to encode the list of splits

Signature is a list of all the suffixes with which a stem appears in the given corpus.

Splits

Cat+NULL Cat+s

Dog+NULL Dog+s

Girl+NULL Girl+s

Boy+NULL

Stems (T) Cat

Dog Girl Boy

Suffixes (F) s

NULL

Signatures (∑)

) (

) ( )

( ) (

) (

NULL ptr

s ptr girl

ptr dog ptr

cat ptr

ptr(boy)  ptr(NULL)

(10)

Length of Model – Stem List

L(Model)=L(T)+L(F)+L()

Represent the stem list and each elements in the stem list

Where |T| is the number of stems in T

= log(4)+(3+3+4+3)log(26) ≈ 62.1 Stems (T)

Cat Dog Girl Boy

  

T t

t length T

T

L( ) log ( ) log(26)

(11)

Length of Model – Suffix List

Represent the suffix list and each elements in the suffix list

Where |F| is the number of suffixes in F

= log(2)+(1+0)log(26) ≈ 5.7

  

F f

f length F

F

L( ) log ( ) log(26)

Suffixes (F) s

NULL

(12)

Length of Model – Signature List

L(Model)=L(T)+L(F)+L()

Represent the signature list and each elements in the signature list

Each signature contains a list of stem pointers and a list of suffix pointers

Each stem should be in exactly one signature

Suffix can be in different signature

Different suffix ptr for different signatures

  

) ( log

)

( L

L

Signatures (∑)

) (

) ( )

( ) (

) (

NULL ptr

s ptr girl

ptr dog ptr

cat ptr

ptr(boy)  ptr(NULL)

(13)

Length of Model – Signature List

|T(σ)|: number of stems in σ

|F(σ)|: number of suffixes in σ

|W| : number of words in corpus

|t| : number of words with t

|W(σ)|: number of words in σ

|W(f/σ)|: number of words in σ with suffix f

) ( )

( ( / )

) log (

) ) ( log(

log )

) ( log(

) (

F f T

t W f

F W t

T W L

   

 

( ) ( )

) / Pr(

log )

) ( log(

) Pr(

log )

) ( log(

) (

F f T

t

f F

t T

L

(14)

Length of Model - Signature List

| | : number of signatures

|T(σ)|: number of stems in σ

|F(σ)|: number of suffixes in σ

|W| : number of words in corpus

|t| : number of words with t

|W(σ)|: number of words in σ

|W(f/σ)|: number of words in σ with suffix f

Length of first signature: log(3)+log(2)+ (7/2+7/2 +7/2)+(6/3+6/3) ≈ 17

Length of second signature: log(1) + log(1) + (7/1) + (1/1) = 8

L(∑) = 17 + 8 = 25

) ( )

( ( / )

) log (

log )

) ( log(

) ) ( log(

) log(

) (

F f T

t W f

W t

F W T

L

Signatures (∑)

) (

) ( )

( ) (

) (

NULL ptr

s ptr girl

ptr dog ptr

cat ptr

ptr(boy)  ptr(NULL)

(15)

Length of Data(Corpus)

f t

w W

f W W

t W W

W

) (

) / ( )

( ) ( )

log (

 

f t w

f

t) Pr( / ) Pr(

) Pr(

log

Where σ is the signature of w

|W(σ)|: number of words in σ

|W(f/σ)|: number of words in σ with suffix f

|W| : number of words in corpus

|W(t)|: number of words in corpus with stem t

f t

w W f

W t

W W

) / (

) ( )

log (

(16)

Length of Data(Corpus)

Splits

Cat+NULL Cat+s

Dog+NULL Dog+s

Girl+NULL Girl+s

Boy+NULL

f t

w W f

W t

W W

) / (

) ( )

log (

65 . 1 19

1 1 log 7 3

6 2

log 7

6

|W(σ)|: number of words in σ

|W(f/σ)|: number of words in σ with suffix f

|W| : number of words in corpus

|W(t)|: number of words in corpus with stem t

(17)

Stem Identification Process

Identify splits by using different heuristics

One of the heuristics is,

Initialize probability to each possible splits of each words

Select maximum probable split fro each word

Recalculate probability of each split t+f of each word w as

Z is the normalization factor

Iterate until there is no improvement

Select best split based on MDL

Iteratively modify splits and further minimize MDL

Identify stems of each words from its signature Z

elen(t)freq(t)len( f )freq( f )

(18)

Minimum Morpheme Set Model for

Stemming

[Vasudevan N and Pushpak Bhattacharyya]

(19)

Problem Definition

Input: List of word forms such that

Frequency of every stem and suffix in the word list should be at least 2

Output: Split of every word forms in the word list

Main idea:

Minimize the number of distinct stems and suffixes

Minimize the number of splits with non-empty suffixes

(20)

Some Notations

Split: Outcome of process of segmentation

E.g., b+oy, bo+y, boy+NULL

Correct split: Linguistically correct split

E.g., boy+s, boy+NULL

Splitset: Set of splits obtained from whole input word list

For every word – exactly one split

{bo+ys, girl+NULL, play+ing} from {boys, girl, playing}

Correct splitset: Set of correct splits from the whole input word list

{boy+s, girl+NULL, play+ing} from {boys, girl, playing}

(21)

Some Notations

T(splitset): Set of all identified stems from splitset

T({bo+ys, girl+NULL, play+ing})={bo,girl,play}

X(splitset): Set of all identified stems from splitset

X({bo+ys, girl+NULL, play+ing})={ys,NULL,ing}

S(splitset): Set of all splits with non-empty suffix from splitset

S({bo+ys, girl+NULL, play+ing})={bo+ys, play+ing}

(22)

Motivating Example

Word forms (W) Boy

Girls Toy Boys Girl Toys

Splitset - 1 Boy+NULL Girl+s

Toy+NULL Bo+ys

Girl+NULL To+ys

Splitset - 2 Boy+NULL Girl+s

Toy+NULL Boy+s

Girl+NULL Toy+s

3 stems 2 suffixes 5 stems 3 suffixes

(23)

Motivating Example

Word forms (W) Moss

Mosses Gas Gases

Splitset – 1 Mos+s

Mos+ses Ga+s Ga+ses

2 stems, 2 suffixes 2 splits with non- empty suffix

2 stems, 2 suffixes 4 splits with non- empty suffix

Splitset - 2 Mos+s

Mos+ses Ga+s Ga+ses

(24)

Optimization Problem

Min{|T(splitset)|, |X(splitset)|, |S(splitset)|}, s.t.,

For all t in T(splitset) freq(t) ≥ 2

For all x in X(splitset) freq(x) ≥ 2

Multi objectives to a single aggregate objective

Min{λt | T(splitset)|+ λx |X(splitset)|+ λs |S(splitset)|}

This optimization problem is NP-Hard

Move to a probabilistic version

(25)

Probabilistic Model

Probability distribution Prw for each word w in the corpus

Prboy(b+oy) + Prboy(bo+y) + Prboy(boy+NULL) = 1

Learn the probability values by solving an optimization problem

Objective: Minimize,

Number of distinct prefixes with expected frequency ≥ 2

Number of distinct suffixes with expected frequency ≥ 2

Expected number of splits with non-empty suffix

Constraints: Expected frequency of any prefix,

Either ≥ 2 (selected as stem)

Or ≤ 0.00001 (not selected as stem)

Constraints: Expected frequency of any suffix,

Either ≥ 2 (selected as suffix)

Or ≤ 0.00001 (not selected as suffix)

(26)

Stem Identification Process

Aggregate multiple objectives using weight parameters

Formulate the above learning problem as a Mixed Integer Programming

Select the most probable split based on the learned probability

Tune the weight parameters based on the performance

Identify stems with tuned weight parameters

(27)

Summary

Discussed about the idea of MDL

An MDL based stemming approach is discussed

Stemming approach by minimizing number of distinct morphemes is also discussed

References

Related documents

 Mitesh Khapra, Sapan Shah, Piyush Kedia and Pushpak Bhattacharyya, Domain-Specific Word Sense Disambiguation Combining Corpus Based and Wordnet Based Parameters, 5th

Since ∧ is associative, commutative and absorbs multiple occurrences, a CNF formula may be referred as a set of clauses..

„ Step1: From each sense marked sentence containing the ambiguous word , a training example is constructed using:. POS of w as well as POS of

Step1: From each sense marked sentence containing the ambiguous word , a training example is constructed using:. POS of w as well as POS of

Microsoft Word or MS-WORD (often called Word) is a graphical word processing program that users can type with.. It is made by

Further, the SI shall ensure that the PoS devices, application software functionalities and any other components, equipment, peripherals involved in implementation

Several shapes like square, rectangle, star, arrows etc. can be selected from the shapes menu and inserted in the document by a click of mouse. When a picture or shape is inserted,

To overcome the taxonomical conflicts the morphological, anatomical, stomatal, palynological and seed cover studies were performed, the results of which will help to put forth