institution-logo
Maximum Margin Classifiers
with Specified False-Positive and False-Negative Error Rates
Saketha Nath J, Dr. Chiranjib Bhattacharyya
Department of Computer Science and Automation Indian Institute of Science
SDM-07
institution-logo
Outline
1 Introduction Motivation Previous Work Proposed Approach
2 Theory
Formulation
Geometric Interpretation Iterative Algorithm Non-linear Classifiers
3 Results
4 Conclusions
institution-logo
1 Introduction Motivation Previous Work Proposed Approach
2 Theory
Formulation
Geometric Interpretation Iterative Algorithm Non-linear Classifiers
3 Results
4 Conclusions
institution-logo
Biased Classification - Heart Disease
institution-logo
institution-logo
Biased Classification - SVM Solution
institution-logo
institution-logo
Biased Classification - Not another BOEING IDS
institution-logo
institution-logo
Biased Classification - Desired features
Preferential bias to one class.
Guarantee on False Positive & False Negative.
Must handle unbalanced data.
institution-logo
We restrict discussion to SVMs and its variants:
Sampling methods [Chawla et.al.-04, Kubat et.al.-97]
— under-sampling, oversampling problems.
Methods like varying C + and C − , b [F.Bach et.al.-05, Provost-00]
— hard to build direct quantitative connections to performance.
Biased Minimax Probability Machine [Huang et.al.-04]
— difficult to control bias towards one class.
institution-logo
Biased Classification - Proposed Approach
Uses second order moments of class-conditional densities ((µ i , Σ i ), i = 1, 2). (Effect of Unbalanced data is reduced)
µ1 µ2
institution-logo
Uses second order moments of class-conditional densities ((µ i , Σ i ), i = 1, 2). (Effect of Unbalanced data is reduced)
σ11
σ21
σ12 σ22
institution-logo
Biased Classification - Proposed Approach
Uses η 1 , η 2 — tolerable false-negative and false-positive error rates.
(Varying η i bias can be introduced)
Does Maximum Margin Classification. (Implies good generalization)
institution-logo
SVM Proposed Method
Solves Convex Quadratic Program
Solves Convex Second Order Cone Program (SOCP)
Minimize distance between Convex Hulls
Minimize distance between Ellipsoids Fast iterative algorithms
(Nearest Point Algorithm, SMO etc.) exist
Fast geometric iterative
algorithm is proposed
Non-linear classifiers Non-linear classifiers
institution-logo
Outline
1 Introduction Motivation Previous Work Proposed Approach
2 Theory
Formulation
Geometric Interpretation Iterative Algorithm Non-linear Classifiers
3 Results
4 Conclusions
institution-logo
H 2 (w ⊤ x − b ≤ −1) H 1 (w ⊤ x − b ≥ 1)
wTx−b=0
wTx−b=1 wTx−b=−1
X1 ∼ (µ1,Σ1) X2 ∼ (µ2,Σ2)
Must constrain w ⊤ X 1 − b ≥ 1 and w ⊤ X 2 − b ≤ −1
institution-logo
New Formulation
H 2 (w ⊤ x − b ≤ −1) H 1 (w ⊤ x − b ≥ 1)
wTx−b=0
wTx−b=1 wTx−b=−1
X1 ∼ (µ1,Σ1) X2 ∼ (µ2,Σ2)
Must constrain w ⊤ X 1 − b ≥ 1 and w ⊤ X 2 − b ≤ −1 — Impossible
institution-logo
Instead constrain:
Prob(X 1 ∈ H 2 ) ≤ η 1 , Prob(X 2 ∈ H 1 ) ≤ η 2
institution-logo
New Formulation
Proposed formulation
min w ,b
Complexity
s.t.
Classify X 1 , X 2 correctly
(1)
X 1 ∼ (µ 1 , Σ 1 ) X 2 ∼ (µ 2 , Σ 2 ) (2)
institution-logo
Proposed formulation
min
w ,b
1 2 kwk 2 2
s.t. Prob (X 1 ∈ H 2 ) ≤ η 1
Prob (X 2 ∈ H 1 ) ≤ η 2
X 1 ∼ (µ 1 , Σ 1 ) X 2 ∼ (µ 2 , Σ 2 ) (1)
institution-logo
New Formulation
Multivariate one-sided Chebyshev inequality [Marshall et.al-60] gives P rob(X 1 ∈ H 2 ) ≤ η 1 ⇐ w ⊤ µ 1 − b ≥ 1 + κ 1 a p
w ⊤ Σ 1 w
a
κ
i= q
1−ηi ηi
Details
institution-logo
Formulation can be re-written as a deterministic Optimization problem using the Chebyshev inequality:
min
w ,b
1 2 kwk 2 2 s.t. w ⊤ µ 1 − b ≥ 1 + κ 1
p w ⊤ Σ 1 w b − w ⊤ µ 2 ≥ 1 + κ 2
p w ⊤ Σ 2 w (2) This can be cast into standard SOCP form (Σ i = C i C ⊤ i ):
w min ,b,t t s.t. t ≥ kwk 2
w ⊤ µ 1 − b ≥ 1 + κ 1 kC ⊤ 1 wk 2
b − w ⊤ µ 2 ≥ 1 + κ 2 kC ⊤ 2 wk 2 (3)
Back
institution-logo
Dual - Geometric Interpretation
Dual of the proposed formulation is:
u min
1, u
2kz 1 − z 2 k 2 2
z 1 ∈ B 1 (µ 1 , C 1 , κ 1 ), z 2 ∈ B 2 (µ 2 , C 2 , κ 2 )
B i (µ i , C i , κ i ) = {z i |z i = µ i − κ i C i u i , ku i k 2 ≤ 1} (4) This is the problem of finding distance between two ellipsoids
1
Shape of ellipsoids is controlled by Σ
i 2Size of ellipsoids is controlled by η
i 3Location of ellipsoids by µ
iAnalogous with SVM (where dual is problem of finding distance
between two convex hulls)
institution-logo
Note that both the values of w and b change with η i
institution-logo
Effect of η i
Note that both the values of w and b change with η i
institution-logo
Note that both the values of w and b change with η i
institution-logo
Effect of η i
Note that both the values of w and b change with η i
institution-logo
Note that both the values of w and b change with η i
institution-logo
Effect of η i
Note that both the values of w and b change with η i
institution-logo
Note that both the values of w and b change with η i
institution-logo
Effect of η i
Note that both the values of w and b change with η i
institution-logo
A simple iterative algorithm, based on [Lin and Han 02]:
Minimum distant points on two spheres are the points of
intersection of the line segment joining the centers with the spheres Algorithm can be viewed as if the two ellipsoids were being
iteratively approximated locally by spheres
institution-logo
Iterative Algorithm
institution-logo
institution-logo
Iterative Algorithm
institution-logo
institution-logo
Iterative Algorithm
institution-logo
institution-logo
Iterative Algorithm
institution-logo
institution-logo
Iterative Algorithm
institution-logo
institution-logo
Iterative Algorithm
Solving two simple 1-d quadratic equations one can get the iterates.
One can prove the convergence of the sequence of points to the optimum.
We derive the formulae and details for our form of ellipsoids.
institution-logo
Solving two simple 1-d quadratic equations one can get the iterates.
One can prove the convergence of the sequence of points to the optimum.
We derive the formulae and details for our form of ellipsoids.
Initialization cost is high
Calculation of coefficients of 1-d quadratic equations involve matrix-vector operations.
Storage of matrices
institution-logo
Non-Linear Classifiers
Let T 1 = [x 11 , . . . , x 1m
1] be a n × m 1 , data matrix for class 1 Similarly, let T 2 = [x 21 , . . . , x 2m
2] be a n × m 2 data matrix for the other class
Let the i th row j th column entry for the matrix K 12 (i, j) = x ⊤ 1i x 2j
institution-logo
Let T 1 = [x 11 , . . . , x 1m
1] be a n × m 1 , data matrix for class 1 Similarly, let T 2 = [x 21 , . . . , x 2m
2] be a n × m 2 data matrix for the other class
Let the i th row j th column entry for the matrix K 12 (i, j) = x ⊤ 1i x 2j
The empirical estimates of the mean and covariance are given as:
µ 1 = 1 m 1
T 1 e 1 , µ 2 = 1 m 2
T 2 e 2 ,
Σ 1 = m 1
1
(T 1 − µ 1 e ⊤ 1 )(T ⊤ 1 − e 1 µ ⊤ 1 )
= m 1
1
T 1 (I 1 − e m
1e
⊤11
) 2 T 1 ⊤ , Σ 2 = 1
m 2
T 2 (I 2 − e 2 e ⊤ 2 m 2
) 2 T 2 ⊤
institution-logo
Non-Linear Classifiers
w can be written as [T 1 , T 2 ]s + w ⊥
w
⊥is component of w orthogonal to training data points
s are combining coefficients
institution-logo
w can be written as [T 1 , T 2 ]s + w ⊥
w
⊥is component of w orthogonal to training data points s are combining coefficients
Terms involving w in the constraints of (2) can be written as:
w ⊤ µ 1 = s ⊤ g 1 , g 1 = [ K 11 e 1
m 1
; K 21 e 1
m 1
], w ⊤ µ 2 = s ⊤ g 2 , g 2 = [ K 12 e 2
m 2
; K 22 e 2
m 2
], w ⊤ Σ i w = s ⊤ G i s,
G 1 = 1 m 1
[K 11 ; K 21 ](I 1 − e 1 e ⊤ 1 m 1
) 2 [K 11 , K 12 ] G 2 = 1
m 2
[K 12 ; K 22 ](I 2 − e 2 e ⊤ 2 m 2
) 2 [K 21 , K 22 ]
institution-logo
Non-Linear Classifiers
w ⊥ = 0 because:
Constraints do not involve w
⊥Objective is to minimize kwk
22institution-logo
w ⊥ = 0 because:
Constraints do not involve w
⊥Objective is to minimize kwk
22Solve in any feature space given by φ : R n → R d as long as the dot products, φ(x i ) ⊤ φ(x j ), are available
One way of specifying dot product is by kernel functions satisfying
positive definite conditions
institution-logo
Non-Linear Classifiers
w ⊥ = 0 because:
Constraints do not involve w
⊥Objective is to minimize kwk
22Solve in any feature space given by φ : R n → R d as long as the dot products, φ(x i ) ⊤ φ(x j ), are available
One way of specifying dot product is by kernel functions satisfying positive definite conditions
The formulation (2) can be written as (K = [K 11 , K 12 ; K 21 , K 22 ]):
min s ,b
1 2 s ⊤ Ks s.t. s ⊤ g 1 − b ≥ 1 + κ 1
p s ⊤ G 1 s, b − s ⊤ g 2 ≥ 1 + κ 2
p s ⊤ G 2 s (5)
institution-logo
Assuming K is positive definite, K = L ⊤ L, L is full square matrix Re-parametrize the formulation (5), in terms of a new variable v = Ls
min v ,b
1 2 kvk 2 2 s.t. v ⊤ h 1 − b ≥ 1 + κ 1
p v ⊤ H 1 v, b − v ⊤ h 2 ≥ 1 + κ 2
p v ⊤ H 2 v (6)
where, h i = L − T g i and H i = L − T G i L − 1
institution-logo
Non-Linear Classifiers
Assuming K is positive definite, K = L ⊤ L, L is full square matrix Re-parametrize the formulation (5), in terms of a new variable v = Ls
min v ,b
1 2 kvk 2 2 s.t. v ⊤ h 1 − b ≥ 1 + κ 1
p v ⊤ H 1 v, b − v ⊤ h 2 ≥ 1 + κ 2
p v ⊤ H 2 v (6) where, h i = L − T g i and H i = L − T G i L − 1
Above formulation is similar to the original formulation (2) Solve using SeDuMi
or Iterative algorithm (We extend the iterative algorithm to the
Kernelized case)
institution-logo
1 Introduction Motivation Previous Work Proposed Approach
2 Theory
Formulation
Geometric Interpretation Iterative Algorithm Non-linear Classifiers
3 Results
4 Conclusions
institution-logo
Results
Table: Results on benchmark datasets, comparing the errors obtained with L-SVM and L-SOCP .
Dataset Method C + /η 1 C − /η 2 % err PIMA L-SVM 5.5 4.5 22.53 L-SOCP 0.1 0.5 23.44
B. Cancer L-SVM 5 5 5.1
L-SOCP 0.76 0.76 2.99
institution-logo
Table: Results on benchmark datasets, comparing the performance of the algorithms.
Dataset η1 η2 NL-SOCP% err L-SOCP% err
+ve -ve +ve -ve
Breast Cancer 0.9 0.3 12.74 00.56 16.98 00.00 m= 569, n= 30, 0.7 0.3 10.85 01.12 13.68 00.00 m1= 212, m2= 357, 0.5 0.3 04.72 01.96 05.19 00.84
σ= 0.032 0.3 0.3 03.30 02.24 03.77 02.80
0.1 0.3 02.36 04.20 × ×
Ring Norm 0.9 0.7 30.14 31.94 × ×
m= 400, n= 2, 0.7 0.7 20.57 36.65 × ×
m1= 209, m2= 191, 0.5 0.7 12.44 41.89 × ×
σ= 3 0.3 0.7 10.05 46.07 × ×
0.1 0.7 07.66 47.12 × ×
institution-logo
Outline
1 Introduction Motivation Previous Work Proposed Approach
2 Theory
Formulation
Geometric Interpretation Iterative Algorithm Non-linear Classifiers
3 Results
4 Conclusions
institution-logo
We presented a maximum margin classifier whose probability of misclassification, for each of the two classes, is bounded above by a specified value.
Can be applied to the case of biased, unbalanced data classification Vary η
1, η
2to introduce bias
Convex SOCP formulation — efficient solvers exist
institution-logo
Conclusions
We presented a maximum margin classifier whose probability of misclassification, for each of the two classes, is bounded above by a specified value.
Can be applied to the case of biased, unbalanced data classification Vary η
1, η
2to introduce bias
Convex SOCP formulation — efficient solvers exist
Performance is comparable to that of SVMs. However parameters
introducing bias η 1 , η 2 , have elegant geometric interpretation.
institution-logo
We presented a maximum margin classifier whose probability of misclassification, for each of the two classes, is bounded above by a specified value.
Can be applied to the case of biased, unbalanced data classification Vary η
1, η
2to introduce bias
Convex SOCP formulation — efficient solvers exist
Performance is comparable to that of SVMs. However parameters introducing bias η 1 , η 2 , have elegant geometric interpretation.
A simple geometric iterative algorithm can be derived to solve the
dual.
institution-logo
Conclusions
We presented a maximum margin classifier whose probability of misclassification, for each of the two classes, is bounded above by a specified value.
Can be applied to the case of biased, unbalanced data classification Vary η
1, η
2to introduce bias
Convex SOCP formulation — efficient solvers exist
Performance is comparable to that of SVMs. However parameters introducing bias η 1 , η 2 , have elegant geometric interpretation.
A simple geometric iterative algorithm can be derived to solve the dual.
Can be extended to Non-linear datasets — solved efficiently using
SeDuMi or the iterative algorithm.
institution-logo
THANK YOU
institution-logo
Derivation of the SOC constraint
One-sided Chebyshev’s Inequality
Given random variable X and ǫ ≥ 0, Chebyshev’s inequality gives:
P (X − E X ≤ ǫ) ≥ ǫ 2 ǫ 2 + var (X)
Back
institution-logo
Theorem
Let X be an n dimensional random vector. The mean and covariance of X be µ ∈ R n and Σ ∈ R n × n . Given w ∈ R n , w 6= 0 and b ∈ R , let
H(w, b) = {z|w ⊤ z < b, z ∈ R n }
be a half space. Then
P rob(X ∈ H) ≥ s 2
s 2 + w ⊤ Σw (7)
where s = (b − w ⊤ µ) + , (x) + = max(x, 0).
Back
institution-logo
Derivation of the SOC constraint
Proof
Consider the case b > w ⊤ x (other is trivial case). In one-sided Chebyshev Inequality, using X as w ⊤ X 1 (then
E X = w ⊤ µ 1 , var (X) = w ⊤ Σ 1 w) and ǫ as b − w ⊤ µ 1 , we have P (X ∈ H) = P (X − E X ≤ ǫ) ≥ ǫ 2
ǫ 2 + var (X) Hence result follows.
Back