Clustering Algorithms

(1)

Clustering Algorithms

Vaibhav Selot (04305813)

Md Shabeeruddin (04305901)

(2)

Roadmap

» What is Clustering?

» Motivation for Clustering Algorithms.

» Components of Clustering task.

» Issues in Clustering Algorithms.

» Clustering Algorithms.

» Conclusion.

(3)

What is Clustering?



Clustering is Unsupervised classification of patterns based on similarity.



A cluster is therefore a collection of objects which are similar

between them and are dissimilar to the objects belonging to

other clusters.

(4)

Motivation

• Supervised classification vs Unsupervised classification 1. Provided with preclassified 1. No such supervision in

(labeled) patterns. this Classification.

2. Problem is to correctly classify 2. Problem is to group a given

a new pattern. collection of unlabeled patterns into meaningful clusters.

• Exploratory Data Analysis applications like Data Mining , Face Recognition,Signature Matching, FingerPrint ,Image segmentation, Knowledge Acquisition have little prior knowledge of information about the data.

• Clustering can be used for exploring interrelationships among the data points in these applications.

(5)

Notations

Pattern X = ⁽

^x₁

, . . . .

x_d)

where ^x

_i

= feature or attribute

d = dimensionality of pattern space

Pattern set p =

^{

^X

1

, . . . ,X

_n}

p is n x d pattern matrix.

(6)

Components of Clustering Tasks

●

Pattern representation and Feature selection / extraction.

●

Similarity Measure.

●

Clustering algorithm.

●

Data abstraction.

●

Assessment of output .

(7)

Pattern Representation and Feature Selection/Extraction

● P a tte rn s a re re p re s e n te d b y n -d im e n s io n a l fe a tu re v e c to rs .

● T h e m o s t d is c rim in a tin g fe a tu re s a re s e le c te d .

● N e w fe a tu re s a re c o m p u te d u s in g fe a tu re e x tra c tio n te c h n iq u e s .

● T o re d u c e th e d im e n s io n a lity o f p ro b le m s p a c e , o n ly s u b s e t o f fe a tu re s a re s e le c te d fo r c lu s te rin g a lg o rith m .

● F e a tu re s e le c tio n /e x tra c tio n n e e d s g o o d d e a l o f d o m a in k n o w le d g e .

(8)

This is a function w hich takes tw o sets of data item s as input, and returns as output a sim ilarity m easure betw een them .

Conventional Similarity measure is distance.

1)Minkowski distance

2) Manhattan distance

3)Euclidean distance

Similarity Measures

(9)

4) Mutual Neighbor Distance

(MND) s(X_i,X_j)=f(X_i,X_j,ξ) where ξ is the Context

MND(X_i,X_j)=NN(X_i,X_j)+NN(X_j,X_i)

NN(X_i,X_j) is neighbor number of X_jw.r.t X_i In fig 1. NN(A,B)=NN(B,A)=1

NN(B,C)=1 NN(C,B)=2

MND(A,B)=2 MND(B,C)=3 In fig 2.

MND(A,B)=5 MND(B,C)=3

Similarity Measures

(10)

5) Conceptual Measure

s (X_i,X_j)= f(X _i,X_j,ξ,ζ) where

ξ is the C o n te x t

ζ is a set of Predefined concepts

Similarity Measures

(11)

• Clustering algorithm uses a similarity measure.

• Choice of Clustering algorithm depends on the desired properties of the final clustering result.

• Time and space complexity also affect the above choice.

Clustering algorithm

(12)

• Each cluster resulted due to clustering is compactly described in terms of representative patterns such as

centroids.

• Assessment of Clustering output is based on specific criterion of optimality. This criterion selected depends on the Domain.

Data Abstraction & Assessment of output

(13)

1.Agglomerative vs. divisive 2.Monothetic vs. polythetic 3.Hard vs. fuzzy

4.Deterministic vs. stochastic

5.Incremental vs. non-incremental

Issues in Clustering Techniques

(14)

Taxonomy of Clustering Techniques

(15)

Produces a nested series of partitions based on criterion for merging/splitting based on similarity

.

Algorithm

1. Start with each data item in a distinct cluster.

2. Merge the two clusters with min distance.

3. Continue step 2 , until clustering criterion is satisfied.

Clustering Criterion

# of desired Clusters.

Threshold on Distance.

Hierarchical Clustering

(16)

Distance Measurement Singlelink Clustering

Min of each pairwise distances of the data items of two clusters.

Completelink Clustering

Max of each pairwise distances of the data items of two clusters.

Hierarchical Clustering

(17)

Dendrogram

Hierarchical Clustering

(18)

Disadvantage of Singlelink Chaining Effect

Singlelink Completelink

Hierarchical Clustering

(19)

Disadvantage in Hierarchical Clustering

For large data sets, constructing Dendrogram is computationly expensive.

Partitional Clustering

Identifies the partition that optimizes a Criterion function.

Criterion function

where

X

^(j)_i is the i th pattern belonging to the j th cluster

C

_j is the centroid of the j th cluster.

n_j is # of patterns in j th cluster. K is # of clusters.

P is Pattern set. L is set of clusters.

Partitional Clustering

(20)

kmeans Clustering Algorithm

1.Start with k cluster and initialize each with a randomlychosen patterns .

2.Assign each pattern to the closest cluster center.

3.Recompute the cluster centers using the current cluster memberships.

If a convergence criterion is not met, go to step 2.

Typical convergence criteria are:

no (or minimal) reassignment of patterns to new cluster, or minimal decrease in squared error.

Partitional Clustering

(21)

Example

Disadvantage 1. # of clusters should be

known in advance.

2. Algorithm is sensitive to the selection of initial partition.

3. Not suitable to discover

clusters with nonconvex shapes.

kmeans Clustering Algorithm

(22)

1. Will the algorithm terminate?

In each step , we shift data item ^X_i from cluster C_m to C_n only when

| ^X_i C_n | < | ^X_i – C_m | 2. Will it find an optimal clustering?

(23)

3. Why to choose Centroid in calculating Sum squared error ?

The Partial derivative of error w.r.t center location must be zero.

(24)

SingleLink Algorithm O(n

²

) CompleteLink Algorithm O(n

³

)

n mearging steps, each requies O(n

²

) comparisons K means algorithm O(n)

Each iteration requires O(n) Constant number of iterations

Time Complexities

(25)

K means algorithm is very efficient in terms of computational time, but it is very sensitive to initial choice and it is less versatile.

On the other hand , hierarchical algorithms are more versatile ,but they are computationally expensive.

In Practise, Kmeans and its various forms are used for large data sets ahead of Hierarchical algorithms.

Conclusions

(26)

1. A.K. Jain, M. N. Murthy, and P .J. Flynn. Data Clustering: a review.ACM Computing Surveys, 1999.

2. Andrew W. Moore. Kmeans and Hierarchical Clustering.

www.cs.cmu.edu/~awm/tutorials.

3. Hassan. Clustering Algorithms.

http://www.carleton.ca/~hmasum/clustering.html

References

(27)

Observing Performance of various Clustering algorithms on Sample data sets

Clustering Algorithms

Clustering Algorithms

Vaibhav Selot (04305813)

Md Shabeeruddin (04305901)

Roadmap

What is Clustering?

Clustering is Unsupervised classification of patterns based on similarity.

A cluster is therefore a collection of objects which are similar

between them and are dissimilar to the objects belonging to

other clusters.

Motivation

Notations

Pattern X = (

, . . . .

where x

= feature or attribute

d = dimensionality of pattern space

Pattern set p =

X

, . . . ,X

p is n x d pattern matrix.

Components of Clustering Tasks

Pattern representation and Feature selection / extraction.

Similarity Measure.

Clustering algorithm.

Data abstraction.

Assessment of output .

Pattern Representation and Feature Selection/Extraction

Conventional Similarity measure is distance.

1)Minkowski distance

2) Manhattan distance

3)Euclidean distance

Similarity Measures

4) Mutual Neighbor Distance

Similarity Measures

5) Conceptual Measure

Similarity Measures

Clustering algorithm

Data Abstraction & Assessment of output

1.Agglomerative vs. divisive 2.Monothetic vs. polythetic 3.Hard vs. fuzzy

4.Deterministic vs. stochastic

5.Incremental vs. non-incremental

Issues in Clustering Techniques

Taxonomy of Clustering Techniques

.

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

X

C

Partitional Clustering

Partitional Clustering

Single­Link Algorithm O(n

) Complete­Link Algorithm O(n

)

n mearging steps, each requies O(n

) comparisons K ­means algorithm O(n)

Each iteration requires O(n) Constant number of iterations

Time Complexities

Conclusions

References

Proposal for Project

Pattern X = ⁽

where ^x

^X

SingleLink Algorithm O(n

) CompleteLink Algorithm O(n

) comparisons K means algorithm O(n)