Bayesian Learning
[Read Ch. 6]
[Suggested exercises: 6.1, 6.2, 6.6]
Bayes Theorem
MAP, ML hypotheses
MAP learners
Minimum description length principle
Bayes optimal classier
Naive Bayes learner
Example: Learning over text data
Bayesian belief networks
Expectation Maximization algorithm
Two Roles for Bayesian Methods
Provides practical learning algorithms:
Naive Bayes learning
Bayesian belief network learning
Combine prior knowledge (prior probabilities) with observed data
Requires prior probabilities
Provides useful conceptual framework
Provides \gold standard" for evaluating other learning algorithms
Additional insight into Occam's razor
Bayes Theorem
P
(h
jD
) =P
(D
jh
)P
(h
)P
(D
)
P
(h
) = prior probability of hypothesish
P
(D
) = prior probability of training dataD
P
(h
jD
) = probability ofh
givenD
P
(D
jh
) = probability ofD
givenh
Choosing Hypotheses
P
(h
jD
) =P
(D
jh
)P
(h
)P
(D
)Generally want the most probable hypothesis given the training data
Maximum a posteriori hypothesis
h
MAP:h
MAP = argmaxh2H
P
(h
jD
)= argmaxh
2H
P
(D
jh
)P
(h
)P
(D
)= argmaxh
2H
P
(D
jh
)P
(h
)If assume
P
(h
i) =P
(h
j) then can further simplify, and choose the Maximum likelihood (ML)hypothesis
h
ML = arg maxhi
2H
P
(D
jh
i)Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present.
Furthermore,
:
008 of the entire population have this cancer.P
(cancer
) =P
(:cancer
) =P
(+jcancer
) =P
(?jcancer
) =P
(+j:cancer
) =P
(?j:cancer
) =Basic Formulas for Probabilities
Product Rule: probability
P
(A
^B
) of a conjunction of two events A and B:P
(A
^B
) =P
(A
jB
)P
(B
) =P
(B
jA
)P
(A
)Sum Rule: probability of a disjunction of two events A and B:
P
(A
_B
) =P
(A
) +P
(B
) ?P
(A
^B
)Theorem of total probability: if events
A
1;:::;A
nare mutually exclusive with Pni=1
P
(A
i) = 1, thenP
(B
) = iXn=1
P
(B
jA
i)P
(A
i)Brute Force MAP Hypothesis Learner
1. For each hypothesis
h
inH
, calculate the posterior probabilityP
(h
jD
) =P
(D
jh
)P
(h
)P
(D
)2. Output the hypothesis
h
MAP with the highest posterior probabilityh
MAP = argmaxh2H
P
(h
jD
)Relation to Concept Learning
Consider our usual concept learning task
instance space
X
, hypothesis spaceH
, training examplesD
consider the FindS learning algorithm (outputs most specic hypothesis from the version space
V S
H;D)What would Bayes rule produce as the MAP hypothesis?
Does
FindS
output a MAP hypothesis??Relation to Concept Learning
Assume xed set of instances h
x
1;:::;x
miAssume
D
is the set of classicationsD
= hc
(x
1);:::;c
(x
m)iChoose
P
(D
jh
):Relation to Concept Learning
Assume xed set of instances h
x
1;:::;x
miAssume
D
is the set of classicationsD
= hc
(x
1);:::;c
(x
m)iChoose
P
(D
jh
)
P
(D
jh
) = 1 ifh
consistent withD
P
(D
jh
) = 0 otherwiseChoose
P
(h
) to be uniform distribution
P
(h
) = jH1j for allh
inH
Then,P
(h
jD
) =8
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
1
jV SH ;Dj if
h
is consistent withD
0 otherwiseEvolution of Posterior Probabilities
hypotheses hypotheses hypotheses
P(h|D1, D2) P(h|D1)
P h )(
a
( ) ( )b ( )c
Characterizing Learning Algorithms by Equivalent MAP Learners
Inductive system
Output hypotheses
Output hypotheses Brute force
MAP learner Candidate Elimination Algorithm
Prior assumptions made explicit
P(h) uniform
P(D|h) = 0 if inconsistent, = 1 if consistent
Equivalent Bayesian inference system Training examples D
Hypothesis space H
Hypothesis space H Training examples D
Learning A Real Valued Function
hML f
e y
Consider any real-valued target functionx
f
Training examples hx
i;d
ii, whered
i is noisy training value
d
i =f
(x
i) +e
i
e
i is random variable (noise) drawnindependently for each
x
i according to some Gaussian distribution with mean=0Then the maximum likelihood hypothesis
h
ML is the one that minimizes the sum of squared errors:h
ML = argminh2H Xm
i (
d
i ?h
(x
i))2Learning A Real Valued Function
h
ML = argmaxh2H
p
(D
jh
)= argmaxh
2H
m
Y
i=1
p
(d
ijh
)= argmaxh
2H
m
Y
i=1 1
p2
2e
?12(di?h(x i))2 Maximize natural log of this instead...h
ML = argmaxh2H
m
X
i=1 ln 1p2
2 ? 120
B
B
@
d
i ?h
(x
i)1
C
C
A 2
= argmaxh
2H
m
X
i=1 ?1 2
0
B
B
@
d
i ?h
(x
i)1
C
C
A 2
= argmaxh
2H
m
X
i=1 ?(
d
i ?h
(x
i))2= argminh
2H
m
X
i=1 (
d
i ?h
(x
i))2Learning to Predict Probabilities
Consider predicting survival probability from patient data
Training examples h
x
i;d
ii, whered
i is 1 or 0 Want to train neural network to output a probability givenx
i (not a 0 or 1)In this case can show
h
ML = argmaxh2H
m
X
i=1
d
i lnh
(x
i) + (1 ?d
i)ln(1 ?h
(x
i)) Weight update rule for a sigmoid unit:w
jkw
jk +w
jkwhere
w
jk = iXm=1
(
d
i ?h
(x
i))x
ijkMinimum Description Length Principle
Occam's razor: prefer the shortest hypothesis MDL: prefer the hypothesis
h
that minimizesh
MDL = argminh2H
L
C1(h
) +L
C2(D
jh
)where
L
C(x
) is the description length ofx
under encodingC
Example:
H
= decision trees,D
= training data labels
L
C1(h
) is # bits to describe treeh
L
C2(D
jh
) is # bits to describeD
givenh
{
NoteL
C2(D
jh
) = 0 if examples classiedperfectly by
h
. Need only describe exceptionsHence
h
MDL trades o tree size for training errorsMinimum Description Length Principle
h
MAP = arg maxh2H
P
(D
jh
)P
(h
)= arg maxh
2H log2
P
(D
jh
) + log2P
(h
)= arg minh
2H ? log2
P
(D
jh
) ? log2P
(h
) (1) Interesting fact from information theory:The optimal (shortest expected coding
length) code for an event with probability
p
is? log2
p
bits.So interpret (1):
? log2
P
(h
) is length ofh
under optimal code? log2
P
(D
jh
) is length ofD
givenh
under optimal code! prefer the hypothesis that minimizes
length
(h
) +length
(misclassifications
)Most Probable Classication of New Instances So far we've sought the most probable hypothesis given the data
D
(i.e.,h
MAP)Given new instance
x
, what is its most probable classication?
h
MAP(x
) is not the most probable classication!Consider:
Three possible hypotheses:
P
(h
1jD
) =:
4; P
(h
2jD
) =:
3; P
(h
3jD
) =:
3Given new instance
x
,h
1(x
) = +; h
2(x
) = ?; h
3(x
) = ?What's most probable classication of
x
?Bayes Optimal Classier Bayes optimal classication:
arg maxv
j 2V
X
hi2H
P
(v
jjh
i)P
(h
ijD
) Example:P
(h
1jD
) =:
4; P
(?jh
1) = 0; P
(+jh
1) = 1P
(h
2jD
) =:
3; P
(?jh
2) = 1; P
(+jh
2) = 0P
(h
3jD
) =:
3; P
(?jh
3) = 1; P
(+jh
3) = 0 thereforeX
hi2H
P
(+jh
i)P
(h
ijD
) =:
4X
hi2H
P
(?jh
i)P
(h
ijD
) =:
6 and argmaxvj 2V
X
hi2H
P
(v
jjh
i)P
(h
ijD
) = ?Gibbs Classier
Bayes optimal classier provides best result, but can be expensive if many hypotheses.
Gibbs algorithm:
1. Choose one hypothesis at random, according to
P
(h
jD
)2. Use this to classify new instance
Surprising fact: Assume target concepts are drawn at random from
H
according to priors onH
. Then:E
[error
Gibbs] 2E
[error
BayesOptimal]Suppose correct, uniform prior distribution over
H
, thenPick any hypothesis from VS, with uniform probability
Its expected error no worse than twice Bayes optimal
Naive Bayes Classier
Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods.
When to use
Moderate or large training set available
Attributes that describe instances are
conditionally independent given classication Successful applications:
Diagnosis
Classifying text documents
Naive Bayes Classier
Assume target function
f
:X
!V
, where each instancex
described by attributes ha
1;a
2:::a
ni. Most probable value off
(x
) is:v
MAP = argmaxvj
2V
P
(v
jja
1;a
2:::a
n)v
MAP = argmaxvj
2V
P
(a
1;a
2:::a
njv
j)P
(v
j)P
(a
1;a
2:::a
n)= argmaxv
j
2V
P
(a
1;a
2:::a
njv
j)P
(v
j) Naive Bayes assumption:P
(a
1;a
2:::a
njv
j) = YiP
(a
ijv
j) which givesNaive Bayes classier: v
NB = argmaxvj
2V
P
(v
j)YiP
(a
ijv
j)Naive Bayes Algorithm
Naive Bayes Learn(
examples
) For each target valuev
jP
^(v
j) estimateP
(v
j)For each attribute value
a
i of each attributea P
^(a
ijv
j) estimateP
(a
ijv
j)Classify New Instance(
x
)v
NB = argmaxvj
2V
P
^(v
j)aYi
2x
P
^(a
ijv
j)Naive Bayes: Example
Consider PlayTennis again, and new instance
h
Outlk
=sun;Temp
=cool;Humid
=high;Wind
=strong
i Want to compute:v
NB = argmaxvj
2V
P
(v
j)YiP
(a
ijv
j)P
(y
)P
(sun
jy
)P
(cool
jy
)P
(high
jy
)P
(strong
jy
) =:
005P
(n
)P
(sun
jn
)P
(cool
jn
)P
(high
jn
)P
(strong
jn
) =:
021!
v
NB =n
Naive Bayes: Subtleties
1. Conditional independence assumption is often violated
P
(a
1;a
2:::a
njv
j) = YiP
(a
ijv
j)...but it works surprisingly well anyway. Note don't need estimated posteriors ^
P
(v
jjx
) to be correct; need only thatargmaxv
j
2V
P
^(v
j)YiP
^(a
ijv
j) = argmaxvj
2V
P
(v
j)P
(a
1:::;a
njv
j)see [Domingos & Pazzani, 1996] for analysis
Naive Bayes posteriors often unrealistically close to 1 or 0
Naive Bayes: Subtleties
2. what if none of the training instances with target value
v
j have attribute valuea
i? ThenP
^(a
ijv
j) = 0, and...P
^(v
j) YiP
^(a
ijv
j) = 0Typical solution is Bayesian estimate for ^
P
(a
ijv
j)P
^(a
ijv
j)n
c +mp
n
+m
where
n
is number of training examples for whichv
=v
j,
n
c number of examples for whichv
=v
j anda
=a
i
p
is prior estimate for ^P
(a
ijv
j)
m
is weight given to prior (i.e. number of\virtual" examples)
Learning to Classify Text
Why?
Learn which news articles are of interest
Learn to classify web pages by topic
Naive Bayes is among most eective algorithms What attributes shall we use to represent text documents??
Learning to Classify Text
Target concept
Interesting
? :Document
! f+;
?g 1. Represent each document by vector of wordsone attribute per word position in document 2. Learning: Use training examples to estimate
P
(+)
P
(?)
P
(doc
j+)
P
(doc
j?)Naive Bayes conditional independence assumption
P
(doc
jv
j) = lengthiY(doc)=1
P
(a
i =w
kjv
j)where
P
(a
i =w
kjv
j) is probability that word in positioni
isw
k, givenv
jone more assumption:
P
(a
=w v
) =P
(a
=w v
); i;m
Learn naive Bayes text(
Examples;V
)1. collect all words and other tokens that occur in
Examples
V ocabulary
all distinct words and other tokens inExamples
2. calculate the required
P
(v
j) andP
(w
kjv
j) probability termsFor each target value
v
j inV
do{ docs
j subset ofExamples
for which the target value isv
j{ P
(v
j) jExamplesjdocsjj j{ Text
j a single document created by concatenating all members ofdocs
j{ n
total number of words inText
j (counting duplicate words multiple times){
for each wordw
k inV ocabulary
n
k number of times wordw
k occurs inText
j
P
(w
kjv
j) n+jV ocabularynk+1 jClassify naive Bayes text(
Doc
)
positions
all word positions inDoc
that contain tokens found inV ocabulary
Return
v
NB, wherev
NB = argmaxvj
2V
P
(v
j)i Y2positions
P
(a
ijv
j)Twenty NewsGroups
Given 1000 training documents from each group Learn to classify new documents according to which newsgroup it came from
comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles
comp.sys.mac.hardware rec.sport.baseball comp.windows.x rec.sport.hockey
alt.atheism sci.space soc.religion.christian sci.crypt
talk.religion.misc sci.electronics talk.politics.mideast sci.med
talk.politics.misc talk.politics.guns
Naive Bayes: 89% classication accuracy
Article from rec.sport.hockey
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu From: xxx@yyy.zzz.edu (John Doe)
Subject: Re: This year's biggest and worst (opinion)...
Date: 5 Apr 93 09:53:39 GMT
I can only comment on the Kings, but the most obvious candidate for pleasant surprise is Alex Zhitnik. He came highly touted as a defensive
defenseman, but he's clearly much more than that.
Great skater and hard shot (though wish he were more accurate). In fact, he pretty much allowed the Kings to trade away that huge defensive
liability Paul Coffey. Kelly Hrudey is only the biggest disappointment if you thought he was any good to begin with. But, at best, he's only a mediocre goaltender. A better choice would be Tomas Sandstrom, though not through any fault of his own, but because some thugs in Toronto decided
Learning Curve for 20 Newsgroups
0 10 20 30 40 50 60 70 80 90 100
100 1000 10000
20News
Bayes TFIDF PRTFIDF
Accuracy vs. Training set size (1/3 withheld for test)
Bayesian Belief Networks
Interesting because:
Naive Bayes assumption of conditional independence too restrictive
But it's intractable without some such assumptions...
Bayesian Belief networks describe conditional independence among subsets of variables
! allows combining prior knowledge about
(in)dependencies among variables with observed training data
(also called Bayes Nets)
Conditional Independence
Denition: X
is conditionally independent ofY
givenZ
if the probability distributiongoverning
X
is independent of the value ofY
given the value ofZ
; that is, if(8
x
i;y
j;z
k)P
(X
=x
ijY
=y
j;Z
=z
k) =P
(X
=x
ijZ
=z
k) more compactly, we writeP
(X
jY;Z
) =P
(X
jZ
)Example:
Thunder
is conditionally independent ofRain
, givenLightning
P
(Thunder
jRain;Lightning
) =P
(Thunder
jLightning
) Naive Bayes uses cond. indep. to justifyP
(X;Y
jZ
) =P
(X
jY;Z
)P
(Y
jZ
)=
P
(X
jZ
)P
(Y
jZ
)Bayesian Belief Network
Storm
Campfire Lightning
Thunder ForestFire
Campfire C
¬C
¬S,B ¬S,¬B 0.4
0.6
0.1 0.9
0.8 0.2
0.2 0.8 S,¬B
BusTourGroup
S,B
Network represents a set of conditional independence assertions:
Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.
Directed acyclic graph
Bayesian Belief Network
Storm
Campfire Lightning
Thunder ForestFire
Campfire C
¬C
¬S,B ¬S,¬B 0.4
0.6
0.1 0.9
0.8 0.2
0.2 0.8 S,¬B
BusTourGroup
S,B
Represents joint probability distribution over all variables
e.g.,
P
(Storm;BusTourGroup;:::;ForestFire
)in general,
P
(y
1;:::;y
n) = iYn=1
P
(y
ijParents
(Y
i)) whereParents
(Y
i) denotes immediatepredecessors of
Y
i in graphso, joint distribution is fully dened by graph, plus the
P
(y
iParents
(Y
i))Inference in Bayesian Networks
How can one infer the (probabilities of) values of one or more network variables, given observed values of others?
Bayes net contains all information needed for this inference
If only one variable with unknown value, easy to infer it
In general case, problem is NP hard In practice, can succeed in many cases
Exact inference methods work well for some network structures
Monte Carlo methods \simulate" the network randomly to calculate approximate solutions
Learning of Bayesian Networks
Several variants of this learning task
Network structure might be known or unknown
Training examples might provide values of all network variables, or just some
If structure known and observe all variables
Then it's easy as training a Naive Bayes classier
Learning Bayes Nets
Suppose structure known, variables partially observable
e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campre...
Similar to training neural network with hidden units
In fact, can learn network conditional probability tables using gradient ascent!
Converge to network
h
that (locally) maximizesP
(D
jh
)Gradient Ascent for Bayes Nets
Let
w
ijk denote one entry in the conditional probability table for variableY
i in the networkw
ijk =P
(Y
i =y
ijjParents
(Y
i) = the listu
ik of values) e.g., ifY
i =Campfire
, thenu
ik might beh
Storm
=T;BusTourGroup
=F
i Perform gradient ascent by repeatedly1. update all
w
ijk using training dataD w
ijkw
ijk + dX2D
P
h(y
ij;u
ikjd
)w
ijk2. then, renormalize the
w
ijk to assure
Pj
w
ijk = 10
w
ijk 1More on Learning Bayes Nets
EM algorithm can also be used. Repeatedly:
1. Calculate probabilities of unobserved variables, assuming
h
2. Calculate new
w
ijk to maximizeE
[lnP
(D
jh
)]where
D
now includes both observed and(calculated probabilities of) unobserved variables When structure unknown...
Algorithms use greedy search to add/substract edges and nodes
Active research topic
Summary: Bayesian Belief Networks
Combine prior knowledge with observed data
Impact of prior knowledge (when correct!) is to lower the sample complexity
Active research area
{
Extend from boolean to real-valued variables{
Parameterized distributions instead of tables{
Extend to rst-order instead of propositional systems{
More eective inference methods{
...Expectation Maximization (EM)
When to use:
Data is only partially observable
Unsupervised clustering (target value unobservable)
Supervised learning (some instance attributes unobservable)
Some uses:
Train Bayesian Belief Networks
Unsupervised clustering (AUTOCLASS)
Learning Hidden Markov Models
Generating Data from Mixture of k Gaussians
p(x)
x
Each instance
x
generated by1. Choosing one of the
k
Gaussians with uniform probability2. Generating an instance at random according to that Gaussian
EM for Estimating k Means
Given:
Instances from
X
generated by mixture ofk
Gaussian distributionsUnknown means h
1;:::;
ki of thek
GaussiansDon't know which instance
x
i was generated by which GaussianDetermine:
Maximum likelihood estimates of h
1;:::;
kiThink of full description of each instance as
y
i = hx
i;z
i1;z
i2i, where
z
ij is 1 ifx
i generated byj
th Gaussian
x
i observable
z
ij unobservableEM for Estimating k Means
EM Algorithm: Pick random initial
h
= h1;
2i, then iterateE step: Calculate the expected value
E
[z
ij] of each hidden variablez
ij, assuming the current hypothesish
= h1;
2i holds.E
[z
ij] =p
(x
=x
ij = j)P
2n=1
p
(x
=x
ij = n)=
e
?212(xi?j)2P
2n=1
e
?212(xi?n)2M step: Calculate a new maximum likelihood hypothesis
h
0 = h01;
02i, assuming the value taken on by each hidden variablez
ij is its expected valueE
[z
ij] calculated above. Replaceh
= h1;
2i byh
0 = h01;
02i. jPmi=1
E
[z
ij]x
iPmi=1
E
[z
ij]EM Algorithm
Converges to local maximum likelihood
h
and provides estimates of hidden variables
z
ijIn fact, local maximum in
E
[lnP
(Y
jh
)]
Y
is complete (observable plus unobservable variables) dataExpected value is taken over possible values of unobserved variables in
Y
General EM Problem
Given:
Observed data
X
= fx
1;:::;x
mgUnobserved data
Z
= fz
1;:::;z
mgParameterized probability distribution
P
(Y
jh
), where{ Y
= fy
1;:::;y
mg is the full datay
i =x
i [z
i{ h
are the parameters Determine:
h
that (locally) maximizesE
[lnP
(Y
jh
)]Many uses:
Train Bayesian belief networks
Unsupervised clustering (e.g.,
k
means)Hidden Markov Models
General EM Method
Dene likelihood function
Q
(h
0jh
) which calculatesY
=X
[Z
using observedX
and currentparameters
h
to estimateZ
Q
(h
0jh
)E
[lnP
(Y
jh
0)jh;X
] EM Algorithm:Estimation (E) step: Calculate
Q
(h
0jh
) using the current hypothesish
and the observed dataX
to estimate the probability distribution overY
.Q
(h
0jh
)E
[lnP
(Y
jh
0)jh;X
]Maximization (M) step: Replace hypothesis
h
by the hypothesish
0 that maximizes thisQ
function.
h
argmaxh0