• No results found

Maximum Margin Classifiers

N/A
N/A
Protected

Academic year: 2022

Share "Maximum Margin Classifiers"

Copied!
66
0
0

Loading.... (view fulltext now)

Full text

(1)

institution-logo

Maximum Margin Classifiers

with Specified False-Positive and False-Negative Error Rates

Saketha Nath J, Dr. Chiranjib Bhattacharyya

Department of Computer Science and Automation Indian Institute of Science

SDM-07

(2)

institution-logo

Outline

1 Introduction Motivation Previous Work Proposed Approach

2 Theory

Formulation

Geometric Interpretation Iterative Algorithm Non-linear Classifiers

3 Results

4 Conclusions

(3)

institution-logo

1 Introduction Motivation Previous Work Proposed Approach

2 Theory

Formulation

Geometric Interpretation Iterative Algorithm Non-linear Classifiers

3 Results

4 Conclusions

(4)

institution-logo

Biased Classification - Heart Disease

(5)

institution-logo

(6)

institution-logo

Biased Classification - SVM Solution

(7)

institution-logo

(8)

institution-logo

Biased Classification - Not another BOEING IDS

(9)

institution-logo

(10)

institution-logo

Biased Classification - Desired features

Preferential bias to one class.

Guarantee on False Positive & False Negative.

Must handle unbalanced data.

(11)

institution-logo

We restrict discussion to SVMs and its variants:

Sampling methods [Chawla et.al.-04, Kubat et.al.-97]

— under-sampling, oversampling problems.

Methods like varying C + and C , b [F.Bach et.al.-05, Provost-00]

— hard to build direct quantitative connections to performance.

Biased Minimax Probability Machine [Huang et.al.-04]

— difficult to control bias towards one class.

(12)

institution-logo

Biased Classification - Proposed Approach

Uses second order moments of class-conditional densities ((µ i , Σ i ), i = 1, 2). (Effect of Unbalanced data is reduced)

µ1 µ2

(13)

institution-logo

Uses second order moments of class-conditional densities ((µ i , Σ i ), i = 1, 2). (Effect of Unbalanced data is reduced)

σ11

σ21

σ12 σ22

(14)

institution-logo

Biased Classification - Proposed Approach

Uses η 1 , η 2 — tolerable false-negative and false-positive error rates.

(Varying η i bias can be introduced)

Does Maximum Margin Classification. (Implies good generalization)

(15)

institution-logo

SVM Proposed Method

Solves Convex Quadratic Program

Solves Convex Second Order Cone Program (SOCP)

Minimize distance between Convex Hulls

Minimize distance between Ellipsoids Fast iterative algorithms

(Nearest Point Algorithm, SMO etc.) exist

Fast geometric iterative

algorithm is proposed

Non-linear classifiers Non-linear classifiers

(16)

institution-logo

Outline

1 Introduction Motivation Previous Work Proposed Approach

2 Theory

Formulation

Geometric Interpretation Iterative Algorithm Non-linear Classifiers

3 Results

4 Conclusions

(17)

institution-logo

H 2 (w x − b ≤ −1) H 1 (w x − b ≥ 1)

wTx−b=0

wTx−b=1 wTx−b=−1

X1 ∼ (µ11) X2 ∼ (µ22)

Must constrain w X 1 − b ≥ 1 and w X 2 − b ≤ −1

(18)

institution-logo

New Formulation

H 2 (w x − b ≤ −1) H 1 (w x − b ≥ 1)

wTx−b=0

wTx−b=1 wTx−b=−1

X1 ∼ (µ11) X2 ∼ (µ22)

Must constrain w X 1 − b ≥ 1 and w X 2 − b ≤ −1 — Impossible

(19)

institution-logo

Instead constrain:

Prob(X 1 ∈ H 2 ) ≤ η 1 , Prob(X 2 ∈ H 1 ) ≤ η 2

(20)

institution-logo

New Formulation

Proposed formulation

min w ,b

Complexity

s.t.

Classify X 1 , X 2 correctly

(1)

X 1 ∼ (µ 1 , Σ 1 ) X 2 ∼ (µ 2 , Σ 2 ) (2)

(21)

institution-logo

Proposed formulation

min

w ,b

1 2 kwk 2 2

s.t. Prob (X 1 ∈ H 2 ) ≤ η 1

Prob (X 2 ∈ H 1 ) ≤ η 2

X 1 ∼ (µ 1 , Σ 1 ) X 2 ∼ (µ 2 , Σ 2 ) (1)

(22)

institution-logo

New Formulation

Multivariate one-sided Chebyshev inequality [Marshall et.al-60] gives P rob(X 1 ∈ H 2 ) ≤ η 1 ⇐ w µ 1 − b ≥ 1 + κ 1 a p

w Σ 1 w

a

κ

i

= q

1−η

i ηi

Details

(23)

institution-logo

Formulation can be re-written as a deterministic Optimization problem using the Chebyshev inequality:

min

w ,b

1 2 kwk 2 2 s.t. w µ 1 − b ≥ 1 + κ 1

p w Σ 1 w b − w µ 2 ≥ 1 + κ 2

p w Σ 2 w (2) This can be cast into standard SOCP form (Σ i = C i C i ):

w min ,b,t t s.t. t ≥ kwk 2

w µ 1 − b ≥ 1 + κ 1 kC 1 wk 2

b − w µ 2 ≥ 1 + κ 2 kC 2 wk 2 (3)

Back

(24)

institution-logo

Dual - Geometric Interpretation

Dual of the proposed formulation is:

u min

1

, u

2

kz 1 − z 2 k 2 2

z 1 ∈ B 1 (µ 1 , C 1 , κ 1 ), z 2 ∈ B 2 (µ 2 , C 2 , κ 2 )

B i (µ i , C i , κ i ) = {z i |z i = µ i − κ i C i u i , ku i k 2 ≤ 1} (4) This is the problem of finding distance between two ellipsoids

1

Shape of ellipsoids is controlled by Σ

i 2

Size of ellipsoids is controlled by η

i 3

Location of ellipsoids by µ

i

Analogous with SVM (where dual is problem of finding distance

between two convex hulls)

(25)

institution-logo

Note that both the values of w and b change with η i

(26)

institution-logo

Effect of η i

Note that both the values of w and b change with η i

(27)

institution-logo

Note that both the values of w and b change with η i

(28)

institution-logo

Effect of η i

Note that both the values of w and b change with η i

(29)

institution-logo

Note that both the values of w and b change with η i

(30)

institution-logo

Effect of η i

Note that both the values of w and b change with η i

(31)

institution-logo

Note that both the values of w and b change with η i

(32)

institution-logo

Effect of η i

Note that both the values of w and b change with η i

(33)

institution-logo

A simple iterative algorithm, based on [Lin and Han 02]:

Minimum distant points on two spheres are the points of

intersection of the line segment joining the centers with the spheres Algorithm can be viewed as if the two ellipsoids were being

iteratively approximated locally by spheres

(34)

institution-logo

Iterative Algorithm

(35)

institution-logo

(36)

institution-logo

Iterative Algorithm

(37)

institution-logo

(38)

institution-logo

Iterative Algorithm

(39)

institution-logo

(40)

institution-logo

Iterative Algorithm

(41)

institution-logo

(42)

institution-logo

Iterative Algorithm

(43)

institution-logo

(44)

institution-logo

Iterative Algorithm

Solving two simple 1-d quadratic equations one can get the iterates.

One can prove the convergence of the sequence of points to the optimum.

We derive the formulae and details for our form of ellipsoids.

(45)

institution-logo

Solving two simple 1-d quadratic equations one can get the iterates.

One can prove the convergence of the sequence of points to the optimum.

We derive the formulae and details for our form of ellipsoids.

Initialization cost is high

Calculation of coefficients of 1-d quadratic equations involve matrix-vector operations.

Storage of matrices

(46)

institution-logo

Non-Linear Classifiers

Let T 1 = [x 11 , . . . , x 1m

1

] be a n × m 1 , data matrix for class 1 Similarly, let T 2 = [x 21 , . . . , x 2m

2

] be a n × m 2 data matrix for the other class

Let the i th row j th column entry for the matrix K 12 (i, j) = x 1i x 2j

(47)

institution-logo

Let T 1 = [x 11 , . . . , x 1m

1

] be a n × m 1 , data matrix for class 1 Similarly, let T 2 = [x 21 , . . . , x 2m

2

] be a n × m 2 data matrix for the other class

Let the i th row j th column entry for the matrix K 12 (i, j) = x 1i x 2j

The empirical estimates of the mean and covariance are given as:

µ 1 = 1 m 1

T 1 e 1 , µ 2 = 1 m 2

T 2 e 2 ,

Σ 1 = m 1

1

(T 1 − µ 1 e 1 )(T 1 − e 1 µ 1 )

= m 1

1

T 1 (I 1 − e m

1

e

1

1

) 2 T 1 , Σ 2 = 1

m 2

T 2 (I 2 − e 2 e 2 m 2

) 2 T 2

(48)

institution-logo

Non-Linear Classifiers

w can be written as [T 1 , T 2 ]s + w ⊥

w

is component of w orthogonal to training data points

s are combining coefficients

(49)

institution-logo

w can be written as [T 1 , T 2 ]s + w ⊥

w

is component of w orthogonal to training data points s are combining coefficients

Terms involving w in the constraints of (2) can be written as:

w µ 1 = s g 1 , g 1 = [ K 11 e 1

m 1

; K 21 e 1

m 1

], w µ 2 = s g 2 , g 2 = [ K 12 e 2

m 2

; K 22 e 2

m 2

], w Σ i w = s G i s,

G 1 = 1 m 1

[K 11 ; K 21 ](I 1 − e 1 e 1 m 1

) 2 [K 11 , K 12 ] G 2 = 1

m 2

[K 12 ; K 22 ](I 2 − e 2 e 2 m 2

) 2 [K 21 , K 22 ]

(50)

institution-logo

Non-Linear Classifiers

w ⊥ = 0 because:

Constraints do not involve w

Objective is to minimize kwk

22

(51)

institution-logo

w ⊥ = 0 because:

Constraints do not involve w

Objective is to minimize kwk

22

Solve in any feature space given by φ : R n → R d as long as the dot products, φ(x i ) φ(x j ), are available

One way of specifying dot product is by kernel functions satisfying

positive definite conditions

(52)

institution-logo

Non-Linear Classifiers

w ⊥ = 0 because:

Constraints do not involve w

Objective is to minimize kwk

22

Solve in any feature space given by φ : R n → R d as long as the dot products, φ(x i ) φ(x j ), are available

One way of specifying dot product is by kernel functions satisfying positive definite conditions

The formulation (2) can be written as (K = [K 11 , K 12 ; K 21 , K 22 ]):

min s ,b

1 2 s Ks s.t. s g 1 − b ≥ 1 + κ 1

p s G 1 s, b − s g 2 ≥ 1 + κ 2

p s G 2 s (5)

(53)

institution-logo

Assuming K is positive definite, K = L L, L is full square matrix Re-parametrize the formulation (5), in terms of a new variable v = Ls

min v ,b

1 2 kvk 2 2 s.t. v h 1 − b ≥ 1 + κ 1

p v H 1 v, b − v h 2 ≥ 1 + κ 2

p v H 2 v (6)

where, h i = L T g i and H i = L T G i L 1

(54)

institution-logo

Non-Linear Classifiers

Assuming K is positive definite, K = L L, L is full square matrix Re-parametrize the formulation (5), in terms of a new variable v = Ls

min v ,b

1 2 kvk 2 2 s.t. v h 1 − b ≥ 1 + κ 1

p v H 1 v, b − v h 2 ≥ 1 + κ 2

p v H 2 v (6) where, h i = L T g i and H i = L T G i L 1

Above formulation is similar to the original formulation (2) Solve using SeDuMi

or Iterative algorithm (We extend the iterative algorithm to the

Kernelized case)

(55)

institution-logo

1 Introduction Motivation Previous Work Proposed Approach

2 Theory

Formulation

Geometric Interpretation Iterative Algorithm Non-linear Classifiers

3 Results

4 Conclusions

(56)

institution-logo

Results

Table: Results on benchmark datasets, comparing the errors obtained with L-SVM and L-SOCP .

Dataset Method C + /η 1 C − /η 2 % err PIMA L-SVM 5.5 4.5 22.53 L-SOCP 0.1 0.5 23.44

B. Cancer L-SVM 5 5 5.1

L-SOCP 0.76 0.76 2.99

(57)

institution-logo

Table: Results on benchmark datasets, comparing the performance of the algorithms.

Dataset η1 η2 NL-SOCP% err L-SOCP% err

+ve -ve +ve -ve

Breast Cancer 0.9 0.3 12.74 00.56 16.98 00.00 m= 569, n= 30, 0.7 0.3 10.85 01.12 13.68 00.00 m1= 212, m2= 357, 0.5 0.3 04.72 01.96 05.19 00.84

σ= 0.032 0.3 0.3 03.30 02.24 03.77 02.80

0.1 0.3 02.36 04.20 × ×

Ring Norm 0.9 0.7 30.14 31.94 × ×

m= 400, n= 2, 0.7 0.7 20.57 36.65 × ×

m1= 209, m2= 191, 0.5 0.7 12.44 41.89 × ×

σ= 3 0.3 0.7 10.05 46.07 × ×

0.1 0.7 07.66 47.12 × ×

(58)

institution-logo

Outline

1 Introduction Motivation Previous Work Proposed Approach

2 Theory

Formulation

Geometric Interpretation Iterative Algorithm Non-linear Classifiers

3 Results

4 Conclusions

(59)

institution-logo

We presented a maximum margin classifier whose probability of misclassification, for each of the two classes, is bounded above by a specified value.

Can be applied to the case of biased, unbalanced data classification Vary η

1

, η

2

to introduce bias

Convex SOCP formulation — efficient solvers exist

(60)

institution-logo

Conclusions

We presented a maximum margin classifier whose probability of misclassification, for each of the two classes, is bounded above by a specified value.

Can be applied to the case of biased, unbalanced data classification Vary η

1

, η

2

to introduce bias

Convex SOCP formulation — efficient solvers exist

Performance is comparable to that of SVMs. However parameters

introducing bias η 1 , η 2 , have elegant geometric interpretation.

(61)

institution-logo

We presented a maximum margin classifier whose probability of misclassification, for each of the two classes, is bounded above by a specified value.

Can be applied to the case of biased, unbalanced data classification Vary η

1

, η

2

to introduce bias

Convex SOCP formulation — efficient solvers exist

Performance is comparable to that of SVMs. However parameters introducing bias η 1 , η 2 , have elegant geometric interpretation.

A simple geometric iterative algorithm can be derived to solve the

dual.

(62)

institution-logo

Conclusions

We presented a maximum margin classifier whose probability of misclassification, for each of the two classes, is bounded above by a specified value.

Can be applied to the case of biased, unbalanced data classification Vary η

1

, η

2

to introduce bias

Convex SOCP formulation — efficient solvers exist

Performance is comparable to that of SVMs. However parameters introducing bias η 1 , η 2 , have elegant geometric interpretation.

A simple geometric iterative algorithm can be derived to solve the dual.

Can be extended to Non-linear datasets — solved efficiently using

SeDuMi or the iterative algorithm.

(63)

institution-logo

THANK YOU

(64)

institution-logo

Derivation of the SOC constraint

One-sided Chebyshev’s Inequality

Given random variable X and ǫ ≥ 0, Chebyshev’s inequality gives:

P (X − E X ≤ ǫ) ≥ ǫ 2 ǫ 2 + var (X)

Back

(65)

institution-logo

Theorem

Let X be an n dimensional random vector. The mean and covariance of X be µ ∈ R n and Σ ∈ R n × n . Given w ∈ R n , w 6= 0 and b ∈ R , let

H(w, b) = {z|w z < b, z ∈ R n }

be a half space. Then

P rob(X ∈ H) ≥ s 2

s 2 + w Σw (7)

where s = (b − w µ) + , (x) + = max(x, 0).

Back

(66)

institution-logo

Derivation of the SOC constraint

Proof

Consider the case b > w x (other is trivial case). In one-sided Chebyshev Inequality, using X as w X 1 (then

E X = w µ 1 , var (X) = w Σ 1 w) and ǫ as b − w µ 1 , we have P (X ∈ H) = P (X − E X ≤ ǫ) ≥ ǫ 2

ǫ 2 + var (X) Hence result follows.

Back

References

Related documents

Percentage of countries with DRR integrated in climate change adaptation frameworks, mechanisms and processes Disaster risk reduction is an integral objective of

This report provides some important advances in our understanding of how the concept of planetary boundaries can be operationalised in Europe by (1) demonstrating how European

These gains in crop production are unprecedented which is why 5 million small farmers in India in 2008 elected to plant 7.6 million hectares of Bt cotton which

Planned relocation is recognized as a possible response to rising climate risks in the Cancun Adaptation Framework under the United Nations Framework Convention for Climate Change

Angola Benin Burkina Faso Burundi Central African Republic Chad Comoros Democratic Republic of the Congo Djibouti Eritrea Ethiopia Gambia Guinea Guinea-Bissau Haiti Lesotho

In Beamforming technique an array of sensors, in our case sensors are microphones, is used so that maximum reception can be achieved in a desired specified

Above MIMO model is a straightforward generalization of the SISO model in equation (4.3). In general, a different model horizon can be specified for each input-output pair. Now

The average value of the diagonal lengths of the indentation mark for each load was used to calculate the hardness.. Maximum indentor load applied for K D P was