Clustering Algorithms
Vaibhav Selot (04305813)
Md Shabeeruddin (04305901)
Roadmap
» What is Clustering?
» Motivation for Clustering Algorithms.
» Components of Clustering task.
» Issues in Clustering Algorithms.
» Clustering Algorithms.
» Conclusion.
What is Clustering?
Clustering is Unsupervised classification of patterns based on similarity.
A cluster is therefore a collection of objects which are similar
between them and are dissimilar to the objects belonging to
other clusters.
Motivation
• Supervised classification vs Unsupervised classification 1. Provided with preclassified 1. No such supervision in
(labeled) patterns. this Classification.
2. Problem is to correctly classify 2. Problem is to group a given
a new pattern. collection of unlabeled patterns into meaningful clusters.
• Exploratory Data Analysis applications like Data Mining , Face Recognition,Signature Matching, FingerPrint ,Image segmentation, Knowledge Acquisition have little prior knowledge of information about the data.
• Clustering can be used for exploring interrelationships among the data points in these applications.
Notations
Pattern X = (
x1, . . . .
xd)where x
i= feature or attribute
d = dimensionality of pattern space
Pattern set p = {X
1 , . . . ,X
n}
p is n x d pattern matrix.
Components of Clustering Tasks
●
Pattern representation and Feature selection / extraction.
●
Similarity Measure.
●
Clustering algorithm.
●
Data abstraction.
●
Assessment of output .
Pattern Representation and Feature Selection/Extraction
● P a tte rn s a re re p re s e n te d b y n -d im e n s io n a l fe a tu re v e c to rs .
● T h e m o s t d is c rim in a tin g fe a tu re s a re s e le c te d .
● N e w fe a tu re s a re c o m p u te d u s in g fe a tu re e x tra c tio n te c h n iq u e s .
● T o re d u c e th e d im e n s io n a lity o f p ro b le m s p a c e , o n ly s u b s e t o f fe a tu re s a re s e le c te d fo r c lu s te rin g a lg o rith m .
● F e a tu re s e le c tio n /e x tra c tio n n e e d s g o o d d e a l o f d o m a in k n o w le d g e .
This is a function w hich takes tw o sets of data item s as input, and returns as output a sim ilarity m easure betw een them .
Conventional Similarity measure is distance.
1)Minkowski distance
2) Manhattan distance
3)Euclidean distance
Similarity Measures
4) Mutual Neighbor Distance
(MND) s(Xi,Xj)=f(Xi,Xj,ξ) where ξ is the ContextMND(Xi,Xj)=NN(Xi,Xj)+NN(Xj,Xi)
NN(Xi,Xj) is neighbor number of Xj w.r.t Xi In fig 1. NN(A,B)=NN(B,A)=1
NN(B,C)=1 NN(C,B)=2
MND(A,B)=2 MND(B,C)=3 In fig 2.
MND(A,B)=5 MND(B,C)=3
Similarity Measures
5) Conceptual Measure
s (Xi,Xj)= f(X i,Xj,ξ,ζ) where
ξ is the C o n te x t
ζ is a set of Predefined concepts
Similarity Measures
• Clustering algorithm uses a similarity measure.
• Choice of Clustering algorithm depends on the desired properties of the final clustering result.
• Time and space complexity also affect the above choice.
Clustering algorithm
• Each cluster resulted due to clustering is compactly described in terms of representative patterns such as
centroids.
• Assessment of Clustering output is based on specific criterion of optimality. This criterion selected depends on the Domain.
Data Abstraction & Assessment of output
1.Agglomerative vs. divisive 2.Monothetic vs. polythetic 3.Hard vs. fuzzy
4.Deterministic vs. stochastic
5.Incremental vs. non-incremental
Issues in Clustering Techniques
Taxonomy of Clustering Techniques
Produces a nested series of partitions based on criterion for merging/splitting based on similarity
.
Algorithm
1. Start with each data item in a distinct cluster.
2. Merge the two clusters with min distance.
3. Continue step 2 , until clustering criterion is satisfied.
Clustering Criterion
# of desired Clusters.
Threshold on Distance.
Hierarchical Clustering
Distance Measurement Singlelink Clustering
Min of each pairwise distances of the data items of two clusters.
Completelink Clustering
Max of each pairwise distances of the data items of two clusters.
Hierarchical Clustering
Dendrogram
Hierarchical Clustering
Disadvantage of Singlelink Chaining Effect
Singlelink Completelink
Hierarchical Clustering
Disadvantage in Hierarchical Clustering
For large data sets, constructing Dendrogram is computationly expensive.
Partitional Clustering
Identifies the partition that optimizes a Criterion function.
Criterion function
where
X
(j)i is the i th pattern belonging to the j th clusterC
j is the centroid of the j th cluster.nj is # of patterns in j th cluster. K is # of clusters.
P is Pattern set. L is set of clusters.
Partitional Clustering
kmeans Clustering Algorithm
1.Start with k cluster and initialize each with a randomlychosen patterns .
2.Assign each pattern to the closest cluster center.
3.Recompute the cluster centers using the current cluster memberships.
If a convergence criterion is not met, go to step 2.
Typical convergence criteria are:
no (or minimal) reassignment of patterns to new cluster, or minimal decrease in squared error.
Partitional Clustering
Example
Disadvantage 1. # of clusters should be
known in advance.
2. Algorithm is sensitive to the selection of initial partition.
3. Not suitable to discover
clusters with nonconvex shapes.
kmeans Clustering Algorithm
1. Will the algorithm terminate?
In each step , we shift data item Xi from cluster Cm to Cn only when
| Xi Cn | < | Xi – Cm | 2. Will it find an optimal clustering?
kmeans Clustering Algorithm
3. Why to choose Centroid in calculating Sum squared error ?
The Partial derivative of error w.r.t center location must be zero.
kmeans Clustering Algorithm
SingleLink Algorithm O(n
2) CompleteLink Algorithm O(n
3)
n mearging steps, each requies O(n
2) comparisons K means algorithm O(n)
Each iteration requires O(n) Constant number of iterations
Time Complexities
K means algorithm is very efficient in terms of computational time, but it is very sensitive to initial choice and it is less versatile.
On the other hand , hierarchical algorithms are more versatile ,but they are computationally expensive.
In Practise, Kmeans and its various forms are used for large data sets ahead of Hierarchical algorithms.
Conclusions
1. A.K. Jain, M. N. Murthy, and P .J. Flynn. Data Clustering: a review.ACM Computing Surveys, 1999.
2. Andrew W. Moore. Kmeans and Hierarchical Clustering.
www.cs.cmu.edu/~awm/tutorials.
3. Hassan. Clustering Algorithms.
http://www.carleton.ca/~hmasum/clustering.html
References
Observing Performance of various Clustering algorithms on Sample data sets