Vasudevan N., Research Scholar,
CSE Dept., IIT Bombay
28/8/14
(lecture given in the CS626 course on NLP)
Unsupervised Stemming
Introduction
Stemming – Identifying stem by removing inflectional suffix
boys → boy
Stemming is relevant for NLP and IR applications
Corpus based stemming – for resource scarce languages
Linguistic rules may not be available
Unsupervised stemming – unavailability of annotated corpus
Is it possible to do stemming with only a list of words?
Unsupervised Stemming using Minimum
Description Length
[Goldsmith, 2000]Minimum Description Length (MDL)
Data is given – How to select model ?
Select the model which leads to best compression of data
No other assumption about underlying truth
Basic Idea of MDL based Learning
Learning is finding regularity
Regularity can be used to compress
More learning → More compression
MDL is a practical Kolmogorov complexity
Kolmogorov Complexity
Kolmogorov complexity of a sequence, S
Length of smallest p s.t., U(p)=S
Where U is the universal turing machine
Length of the shortest program that prints S
E.g., S = “010101010101010101010101”
Program 1:- print “010101010101010101010101”
Program 2:- for i=1 to 12: print “01”
Less complexity → More regularity
Kolmogorov complexity is uncomputable
Kolmogorov Complexity to MDL
H is the set of models/hypothesis and D is the data then best model is
L(h): Length of description of h
L(D/h): Length of description of D when encoded according to h
( ) ( / )
min L h L D h Arg
H h
Length
How to calculate the length of model and data?
Information inequality:
If the data are distributed according to P,
The code with length −log P achieves the minimum expected code length
Number of bits (length) required to represent any element in list [e1,e2,…,en] is log(n)
In the absence of any other information assume uniform distribution
Pr(e1)=1/n, Pr(e2)=1/n, …, Pr(en)=1/n
Code length = -log(1/n) = log(n)
MDL for Stemming
Given a list of word forms split each word forms into its stem and suffix
Word forms (W) Cat
Cats Dog Dogs Girl Girls Boy
Splits
Cat+NULL Cat+s
Dog+NULL Dog+s
Girl+NULL Girl+s
Boy+NULL
Model
Stemming model have a stem list, a suffix list and a signature list to encode the list of splits
Signature is a list of all the suffixes with which a stem appears in the given corpus.
Splits
Cat+NULL Cat+s
Dog+NULL Dog+s
Girl+NULL Girl+s
Boy+NULL
Stems (T) Cat
Dog Girl Boy
Suffixes (F) s
NULL
Signatures (∑)
) (
) ( )
( ) (
) (
NULL ptr
s ptr girl
ptr dog ptr
cat ptr
ptr(boy) ptr(NULL)
Length of Model – Stem List
L(Model)=L(T)+L(F)+L(∑)
Represent the stem list and each elements in the stem list
Where |T| is the number of stems in T
= log(4)+(3+3+4+3)log(26) ≈ 62.1 Stems (T)
Cat Dog Girl Boy
T t
t length T
T
L( ) log ( ) log(26)
Length of Model – Suffix List
Represent the suffix list and each elements in the suffix list
Where |F| is the number of suffixes in F
= log(2)+(1+0)log(26) ≈ 5.7
F f
f length F
F
L( ) log ( ) log(26)
Suffixes (F) s
NULL
Length of Model – Signature List
L(Model)=L(T)+L(F)+L(∑)
Represent the signature list and each elements in the signature list
Each signature contains a list of stem pointers and a list of suffix pointers
Each stem should be in exactly one signature
Suffix can be in different signature
Different suffix ptr for different signatures
) ( log
)
( L
L
Signatures (∑)
) (
) ( )
( ) (
) (
NULL ptr
s ptr girl
ptr dog ptr
cat ptr
ptr(boy) ptr(NULL)
Length of Model – Signature List
|T(σ)|: number of stems in σ
|F(σ)|: number of suffixes in σ
|W| : number of words in corpus
|t| : number of words with t
|W(σ)|: number of words in σ
|W(f/σ)|: number of words in σ with suffix f
) ( )
( ( / )
) log (
) ) ( log(
log )
) ( log(
) (
F f T
t W f
F W t
T W L
( ) ( )
) / Pr(
log )
) ( log(
) Pr(
log )
) ( log(
) (
F f T
t
f F
t T
L
Length of Model - Signature List
| ∑ | : number of signatures
|T(σ)|: number of stems in σ
|F(σ)|: number of suffixes in σ
|W| : number of words in corpus
|t| : number of words with t
|W(σ)|: number of words in σ
|W(f/σ)|: number of words in σ with suffix f
Length of first signature: log(3)+log(2)+ (7/2+7/2 +7/2)+(6/3+6/3) ≈ 17
Length of second signature: log(1) + log(1) + (7/1) + (1/1) = 8
L(∑) = 17 + 8 = 25
) ( )
( ( / )
) log (
log )
) ( log(
) ) ( log(
) log(
) (
F f T
t W f
W t
F W T
L
Signatures (∑)
) (
) ( )
( ) (
) (
NULL ptr
s ptr girl
ptr dog ptr
cat ptr
ptr(boy) ptr(NULL)
Length of Data(Corpus)
f t
w W
f W W
t W W
W
) (
) / ( )
( ) ( )
log (
f t w
f
t) Pr( / ) Pr(
) Pr(
log
Where σ is the signature of w
|W(σ)|: number of words in σ
|W(f/σ)|: number of words in σ with suffix f
|W| : number of words in corpus
|W(t)|: number of words in corpus with stem t
f t
w W f
W t
W W
) / (
) ( )
log (
Length of Data(Corpus)
Splits
Cat+NULL Cat+s
Dog+NULL Dog+s
Girl+NULL Girl+s
Boy+NULL
f t
w W f
W t
W W
) / (
) ( )
log (
65 . 1 19
1 1 log 7 3
6 2
log 7
6
|W(σ)|: number of words in σ
|W(f/σ)|: number of words in σ with suffix f
|W| : number of words in corpus
|W(t)|: number of words in corpus with stem t
Stem Identification Process
Identify splits by using different heuristics
One of the heuristics is,
Initialize probability to each possible splits of each words
Select maximum probable split fro each word
Recalculate probability of each split t+f of each word w as
Z is the normalization factor
Iterate until there is no improvement
Select best split based on MDL
Iteratively modify splits and further minimize MDL
Identify stems of each words from its signature Z
elen(t)freq(t)len( f )freq( f )
Minimum Morpheme Set Model for
Stemming
[Vasudevan N and Pushpak Bhattacharyya]Problem Definition
Input: List of word forms such that
Frequency of every stem and suffix in the word list should be at least 2
Output: Split of every word forms in the word list
Main idea:
Minimize the number of distinct stems and suffixes
Minimize the number of splits with non-empty suffixes
Some Notations
Split: Outcome of process of segmentation
E.g., b+oy, bo+y, boy+NULL
Correct split: Linguistically correct split
E.g., boy+s, boy+NULL
Splitset: Set of splits obtained from whole input word list
For every word – exactly one split
{bo+ys, girl+NULL, play+ing} from {boys, girl, playing}
Correct splitset: Set of correct splits from the whole input word list
{boy+s, girl+NULL, play+ing} from {boys, girl, playing}
Some Notations
T(splitset): Set of all identified stems from splitset
T({bo+ys, girl+NULL, play+ing})={bo,girl,play}
X(splitset): Set of all identified stems from splitset
X({bo+ys, girl+NULL, play+ing})={ys,NULL,ing}
S(splitset): Set of all splits with non-empty suffix from splitset
S({bo+ys, girl+NULL, play+ing})={bo+ys, play+ing}
Motivating Example
Word forms (W) Boy
Girls Toy Boys Girl Toys
Splitset - 1 Boy+NULL Girl+s
Toy+NULL Bo+ys
Girl+NULL To+ys
Splitset - 2 Boy+NULL Girl+s
Toy+NULL Boy+s
Girl+NULL Toy+s
3 stems 2 suffixes 5 stems 3 suffixes
Motivating Example
Word forms (W) Moss
Mosses Gas Gases
Splitset – 1 Mos+s
Mos+ses Ga+s Ga+ses
2 stems, 2 suffixes 2 splits with non- empty suffix
2 stems, 2 suffixes 4 splits with non- empty suffix
Splitset - 2 Mos+s
Mos+ses Ga+s Ga+ses
Optimization Problem
Min{|T(splitset)|, |X(splitset)|, |S(splitset)|}, s.t.,
For all t in T(splitset) freq(t) ≥ 2
For all x in X(splitset) freq(x) ≥ 2
Multi objectives to a single aggregate objective
Min{λt | T(splitset)|+ λx |X(splitset)|+ λs |S(splitset)|}
This optimization problem is NP-Hard
Move to a probabilistic version
Probabilistic Model
Probability distribution Prw for each word w in the corpus
Prboy(b+oy) + Prboy(bo+y) + Prboy(boy+NULL) = 1
Learn the probability values by solving an optimization problem
Objective: Minimize,
Number of distinct prefixes with expected frequency ≥ 2
Number of distinct suffixes with expected frequency ≥ 2
Expected number of splits with non-empty suffix
Constraints: Expected frequency of any prefix,
Either ≥ 2 (selected as stem)
Or ≤ 0.00001 (not selected as stem)
Constraints: Expected frequency of any suffix,
Either ≥ 2 (selected as suffix)
Or ≤ 0.00001 (not selected as suffix)
Stem Identification Process
Aggregate multiple objectives using weight parameters
Formulate the above learning problem as a Mixed Integer Programming
Select the most probable split based on the learned probability
Tune the weight parameters based on the performance
Identify stems with tuned weight parameters
Summary
Discussed about the idea of MDL
An MDL based stemming approach is discussed
Stemming approach by minimizing number of distinct morphemes is also discussed