• No results found

Clustering  Algorithms

N/A
N/A
Protected

Academic year: 2022

Share "Clustering  Algorithms"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Clustering  Algorithms

Vaibhav Selot  (04305813)

Md  Shabeeruddin (04305901)

(2)

Roadmap

» What is Clustering?

» Motivation for Clustering Algorithms.

» Components of Clustering task.

» Issues in Clustering Algorithms.

» Clustering Algorithms.

» Conclusion.

(3)

What is Clustering?  

 Clustering  is Unsupervised   classification  of  patterns            based on similarity.

 A cluster is therefore a collection   of objects which are similar 

 between them and are dissimilar  to the objects belonging to 

 other clusters.

(4)

Motivation

• Supervised classification      vs        Unsupervised classification    1. Provided with pre­classified       1. No such supervision in

       (labeled)  patterns.      this  Classification.

   2. Problem is to correctly classify       2. Problem is to group a given 

       a new pattern.  collection  of unlabeled patterns into  meaningful clusters.

•  Exploratory Data Analysis applications like  Data Mining , Face             Recognition,Signature Matching, Finger­Print ,Image segmentation,          Knowledge Acquisition  have little prior knowledge of information about             the data.

•   Clustering can be used for exploring inter­relationships among the  data            points in these applications.

(5)

Notations

Pattern X = (

x1

, . . . .

xd)

where  x

i

  =  feature or attribute

  d   =  dimensionality of pattern  space

Pattern  set    p  =  

{

X

1

, . . . ,X

n}

p   is  n x d pattern matrix.

(6)

Components of Clustering Tasks

Pattern  representation  and       Feature selection / extraction.

Similarity Measure.

Clustering algorithm.

Data abstraction. 

 Assessment  of  output .

(7)

Pattern  Representation and Feature Selection/Extraction

P a tte rn s a re re p re s e n te d b y n -d im e n s io n a l fe a tu re v e c to rs .

T h e m o s t d is c rim in a tin g fe a tu re s a re s e le c te d .

N e w fe a tu re s a re c o m p u te d u s in g fe a tu re e x tra c tio n te c h n iq u e s .

T o re d u c e th e d im e n s io n a lity o f p ro b le m s p a c e , o n ly s u b s e t o f fe a tu re s a re s e le c te d fo r c lu s te rin g a lg o rith m .

F e a tu re s e le c tio n /e x tra c tio n n e e d s g o o d d e a l o f d o m a in k n o w le d g e .

(8)

This is a function w hich takes tw o sets of data item s as input, and returns as output a sim ilarity m easure betw een them .

Conventional Similarity measure is distance.

1)Minkowski distance

2) Manhattan distance

3)Euclidean distance

Similarity Measures

(9)

4) Mutual Neighbor Distance

(MND)  s(Xi,Xj)=f(Xi,Xj,ξ)             where ξ is the Context

     MND(Xi,Xj)=NN(Xi,Xj)+NN(Xj,Xi)

     NN(Xi,Xj)  is neighbor number of Xw.r.t Xi      In fig 1.  NN(A,B)=NN(B,A)=1

      NN(B,C)=1       NN(C,B)=2

       MND(A,B)=2    MND(B,C)=3             In fig 2.  

       MND(A,B)=5    MND(B,C)=3

Similarity Measures

(10)

5) Conceptual Measure

s (Xi,Xj)= f(X i,Xj,ξ,ζ) where 

ξ is the C o n te x t

ζ is a set of Predefined concepts

Similarity Measures

(11)

• Clustering algorithm uses a similarity measure.

• Choice of Clustering algorithm depends on the desired               properties of  the final clustering result.

• Time and space complexity also affect the above choice.

Clustering algorithm

(12)

• Each cluster resulted due to clustering is  compactly       described  in terms of  representative patterns such as 

centroids.

• Assessment of Clustering output is based on        specific  criterion of optimality. This criterion       selected depends  on the Domain.

Data Abstraction & Assessment of output

(13)

1.Agglomerative vs. divisive 2.Monothetic vs. polythetic 3.Hard vs. fuzzy

4.Deterministic vs. stochastic

5.Incremental vs. non-incremental

Issues in Clustering Techniques

(14)

Taxonomy of Clustering Techniques

(15)

Produces a nested series of partitions based on criterion for merging/splitting  based on similarity

.

Algorithm

 

1. Start with each data item in a distinct cluster.

2. Merge the two clusters with min distance.

3. Continue step 2 , until clustering criterion is satisfied.

Clustering Criterion

# of desired Clusters.

Threshold on Distance.

Hierarchical Clustering

(16)

Distance Measurement Single­link Clustering 

Min of each pair­wise distances of the data items of two clusters.

Complete­link Clustering

Max of each pair­wise distances of the data items of two clusters.

Hierarchical Clustering

(17)

Dendrogram

Hierarchical Clustering

(18)

Disadvantage of Single­link ­­ Chaining Effect

Single­link   Complete­link

Hierarchical Clustering

(19)

Disadvantage in Hierarchical Clustering

For large data sets, constructing Dendrogram is computationly  expensive.

Partitional Clustering

Identifies  the partition that optimizes a Criterion function.

Criterion function

where  

X

(j)i     is the i th pattern belonging to the j th cluster

C

j  is the centroid of the j th cluster.

nj is # of patterns in j th cluster.     K is # of clusters.    

P  is Pattern set.         L is set of clusters.

Partitional Clustering

(20)

    k­means  Clustering Algorithm

1.Start with k cluster and initialize each with a randomly­chosen  patterns .

2.Assign each pattern to the closest cluster center.

3.Recompute the cluster centers using the current cluster memberships. 

   If a convergence criterion is not met,             go to step 2. 

Typical convergence criteria are: 

no (or minimal) reassignment of patterns to new cluster, or minimal  decrease in squared error.

Partitional Clustering

(21)

Example

Disadvantage 1. # of  clusters should be

       known in advance.

2. Algorithm is sensitive      to the selection of initial           partition.

3. Not suitable to discover 

    clusters with non­convex shapes.

k­means  Clustering Algorithm

(22)

1. Will the algorithm terminate?

     In each step , we shift data item Xi from cluster Cm to Cn only         when

 |  Xi ­ Cn |  <   | Xi – Cm | 2. Will it find an optimal clustering?

k­means  Clustering Algorithm

(23)

3. Why to choose Centroid in            calculating Sum squared error ?

   The Partial derivative of error             w.r.t  center location must be zero.

k­means  Clustering Algorithm

(24)

Single­Link Algorithm O(n

2

) Complete­Link Algorithm O(n

3

)

n mearging steps, each requies O(n

2

) comparisons K ­means algorithm  O(n)

Each iteration requires O(n) Constant number of iterations

Time Complexities

(25)

K ­means algorithm is very efficient in terms of computational time, but it is very sensitive to initial choice and it is less versatile.

On the other hand , hierarchical algorithms are more versatile ,but they are  computationally expensive.

In Practise, K­means and its various forms are used for large data sets ahead  of  Hierarchical algorithms.

Conclusions

(26)

1.  A.K. Jain, M. N. Murthy, and P .J. Flynn. Data Clustering: a      review.ACM Computing Surveys, 1999.

2. Andrew W. Moore. K­means and Hierarchical Clustering.

    www.cs.cmu.edu/~awm/tutorials.

3. Hassan. Clustering Algorithms.

    http://www.carleton.ca/~hmasum/clustering.html

References

(27)

Observing Performance of various Clustering algorithms        on Sample data sets

Proposal for Project

References

Related documents

The performance of supervised Kohone n Architecture using linear vector qu anti za tion (L VQ) is s hown in Table 12. As the number of epoc hs increa1es the netw

Finger impression recognition is divided in two parts: 1.verification system and 2. A step in finger impression matching is to significantly extract point from the input

First of all various fuzzy clustering algorithms such as FCM, DeFCM are used to produce different clustering solutions and then we improve each solution by again classifying

This chapter is based on the details of the face recognition method using swarm optimization based selected features.. Chapter 6: Results

There are many common algorithms for Character Segmentation such as direct segmentation [14], projection and cluster analysis [15] and template matching [16]. In

Vehicle Model Identification requires segmentation of vehicle logo from the given image followed by its recognition by matching it against a database of logos.. The

In image processing, there are various problem occur, one of which is regarding segmentation which include pattern matching, image analysis and scene analysis. The project

In order to generate and update the set of closed frequent itemsets in the intermediate summary data structure, NewMoment uses an approach that requires multiple scans of the