• No results found

Thank You

N/A
N/A
Protected

Academic year: 2022

Share "Thank You"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

Subject: Statistics

Paper: Multivariate Analysis

Module: Clustering Techniques - The

Partitioning Methods

(2)

Development Team

Principal investigator: Dr. Bhaswati Ganguli, Professor, Department of Statistics, University of Calcutta

Paper co-ordinator: Dr. Sugata SenRoy,Professor, Department of Statistics, University of Calcutta

Content writer: Souvik Bandyopadhyay, Senior Lecturer, Indian Institute of Public Health, Hyderabad

Content reviewer: Dr. Kalyan Das,Professor, Department of Statistics, University of Calcutta

(3)

Partitioning

Partitioning is carried out when one is more interested in cluster characteristics rather than the individuals themselves.

Example

Given a mix of cells, partitioning techniques may be applied to separate the normal cells from the ones in different stages of a disease. Here the interest is on isolating the different types of cells rather than the individual cells themselves.

Thus the focus is on the the cluster types and, unlike hierarchical clustering, distances between individuals are not of importance.

(4)

Aim in Partitioning

The primary aim in partitioning is to form separate clusters or groups such that

I the members within a cluster are as similar among themselves as possible

I the members of different clusters are as different as possible This means that

I between cluster distances should be as large as possible

I distances between the members of a cluster should be as small as possible

(5)

Two major questions

The first question that arises in partitioning technique is How many clusters should be formed ?

Very often, the answer is unknown to the researcher.

However, a decision on this is imperative since the usual

partitioning techniques are based on a specified number of clusters.

Hence we will start by assuming that thenindividuals are to be divided intokclusters and later on come back to see howk can be optimally chosen.

(6)

Two major questions

Having decided on the number of clusters, the second question is How to allocate the individuals into these clusters ?

Two broad and somewhat similar methods are :

I partitioning based on similarity

I a popular method is the k-means clustering

I alternatively the k-median method may be used

I partitioning by splitting the total variability

Usually the two techniques are combined by using the nearness to means as a way of forming the clusters and then using the within cluster variability criterion for checking its optimality.

(7)

k-means Clustering

k-means clustering is a technique which aims to findmore homogeneoussubgroups within the data.

The idea is to divide the data intokdistinct groups so that observations within a group are similar, whilst observations between groups are as different as possible.

The method is iterative, rather than hierarchical.

At each stage of the algorithm data points will be assigned to a fixed number of clusters. This is in contrast to hierarchical

clustering where the number of clusters ranges from the number of data points down to a single cluster.

(8)

k-means Clustering

The method starts by first selectingknuclei around which the k initial clusters are to be formed. This can be done in several ways :

I Take the first k observations

I Use every [n/k]th observation

( [.] denotes the largest integer contained)

I Use any random set of k observations

I Select the k farthest observations (using some distance measure)

I Select some central observations from high density regions i.e.

regions where there are large clusters of observations

(9)

k-means Clustering

After theknuclei are selected, each of the remaining (n−k) individuals is assigned to the cluster whose nucleus is closest to it.

Of course the distances should be according to some pre-determined measure.

Once this preliminary assignment is made, the centroids of each cluster is obtained as

xj = 1 nj

X

i∈Cj

xi, j= 1, . . . , k,

wherenj is the number of elements in thejth clusterCj.

(10)

k-means Clustering

The next steps are :

I If any observation is closer to another cluster’s centroid than its own, then it is re-allocated to the other cluster.

I After each such re-allocation the centroids are re-calculated.

I The process is repeated till each individual has its own centroid as the closest from itself.

I Without further re-allocation needed, the kclusters constitutes the final partitioning of the nindividuals.

k-median partitioning

Here instead of the distance from the centroid, the distance from thecluster medoidsis considered. The rest of the technique is similar.

(11)

Example : k-means clustering

Start with a simulated data with two arbitrary centers

−2 0 2 4 6

−202468

x

y

Data Center 1 Center 2

(12)

Example (contd.)

Calculate the new centers and cluster the variables

−2 0 2 4 6

−202468

x

y

Group1 Group2

(13)

Example (contd.)

Recalculate the new centers and cluster the variables once more

−2 0 2 4 6

−202468

x

y

Group1 Group2

(14)

Example (contd.)

The k-means algorithm converges as no points move between the clusters in further iterations. Fix the clusters.

−2 0 2 4 6

−202468

x

y

Group1 Group2

(15)

Clustering using within cluster variability

Here the idea is to separate the observations into clusters such that

I the within cluster variability is as small as possible

I the between cluster variability is as large as possible

Let T =

n

X

i=1

(xi−x)(xi−x)0,

B =

k

X

j=1

nj(xj−x)(xj−x)0,

W =

k

X

j=1

X

i∈Cj

(xi−xj)(xi−xj)0

(16)

Clustering using within cluster variability

Then T = B+W

SinceT is fixed, ifW decreases, B increases.

Note

For the clusters to be distinct and homogeneous,Wshould be small orBshould be large.

Hence the split should be such that

I W is small

I B is large

(17)

Methods of splitting

Several criteria can be used.

Minimize trace(W)

IfWj is the within cluster variability of the jth cluster, then

Minimize trW=

k

X

j=1

trWj,

I The method is simple.

I However, it is not invariant to orthogonal transformations and hence different scale changes may lead to different clusters.

(18)

Methods of splitting

Alternative Minimize|W|

I The computations are much more difficult.

I However, it often gives best results primarily because it takes into account the correlations between the variables.

Another alternative MaximizetrBW−1

I Can be very simply obtained from its largest eigen-value.

I However, can be unreliable for elongated clusters.

(19)

Algorithm

I As in k-means clustering, the process begins by selectingk points and forming the kinitial clusters

I Any one of the above three criteria is then decided upon and the corresponding value obtained

I Re-allocation of the farthest members in the clusters are made to adjoining (closer) clusters and the value re-calculated to check for improvement.

I The process is repeated till no further improvement in the value can be made.

(20)

Choosing the number of clusters

Till now we had assumed that the number of clustersk is pre-determined. However, this generally is unknown. Hence a decision on the number of clusters needs to be made.

Unfortunately, nearly all such choices are either subjective or are based on special types of data structure. Thus we need to keep in mind the following.

More the number of clusters, the less will be the within cluster variability and the more will be the between cluster variability.

However, too many clusters will defeat the whole purpose of clustering (will end up withnone-member clusters). Optimally there should be only a few distinct, homogeneous clusters.

(21)

Looking for the Elbow

I Consider any of the clustering criteria as discussed above, say the Minimum trW criterion.

I Let Wk be the within cluster variability for the final k-cluster partitioning by this method

i.e. trWk is the minimum within cluster variability among all partitions in kclusters.

I ComputetrWk for several k’s, starting with k = 1 (trivial), 2, 3....

I It is obvious that the plot oftrWk against kwould be monotonically decreasing.

(22)

Looking for the Elbow

I However, the decline will be rapid in the initial stages when distinct clusters take shape but will be gradual when well-defined clusters start splitting into smaller ones simply because k is increased.

I Hence the elbowof the curve, at which the bend from a rapid decrease to a gradual decline takes place, can be taken as an optimal value of k.

I Of course, the criteria |Wk|or trBkWk−1 can be similarly used, keeping in mind that for the latter the curve would be upward sloping.

(23)

Calinski-Harabasz’s method (1974)

Define

Ck= trWtrBk/(k−1)

k/(n−k) = trWtrBk

k

n−k k−1

I trBk

trWk will always be increasing

I n−k

k−1 will always be decreasing

I the latter act as a penalty for too many clusters

I Hence it is expected thatCk would initially rise and then fall

I Choose that kfor which Ck peaks.

Special Cases

A monotonically increasingCk suggests a hierarchical clustering A monotonically decreasingCk suggests no clustering.

(24)

Example (contd.) What should be k?

In our Example, let us see whatk should be.

I We will use the trace W criterion to choose the optimal number of clusters.

I Let us try outk= 1,2,3, . . . ,10 successively.

I For each we show the total within sum of squares criterion.

I See in the next diagram how it decreases as kincreases.

I We then plot these against kand look for the elbow.

I The elbow or the kink in the curve gives the optimal k.

(25)

Example : Choosing k

C1 Tot

WithBetTot

490.7 490.7

0 490.7

−1 0 1 2 3 4 5

−112345

y

(26)

Example : Optimal k

If we plot the total of the within sum of squares values versus k, then we get the following

2 4 6 8 10

100300500

K

Total Within SS

Notice that the graph flattens quickly after k=3. Use 3 groups.

(27)

Summary

I The Partitioning method of clustering is introduced as an alternative to hierarchical clustering.

I The k-means clustering method is discussed in detail and an illustration given as an example.

I Clustering based on within and between cluster variability is discussed.

I Methods for identifying the optimal number of clusters is described.

(28)

Thank You

References

Related documents

Digital Systems: Principles and Applications, 10e Copyright ©2007 by Pearson Education, Inc.. Columbus, OH 43235 All

We select elements in a matrix just as we did for vectors, but now we need two indices.. The element of row i and column j of the matrix A is denoted

These gains in crop production are unprecedented which is why 5 million small farmers in India in 2008 elected to plant 7.6 million hectares of Bt cotton which

INDEPENDENT MONITORING BOARD | RECOMMENDED ACTION.. Rationale: Repeatedly, in field surveys, from front-line polio workers, and in meeting after meeting, it has become clear that

It is a small oval elevation situated anterolateral to the medial geniculate body, and is connected to the superior colliculus by the superior

A number which can neither be expressed as a terminating decimal nor as a repeating decimal is called an irrational number, For

ةريخلأا ةرضاحلما يه هذه تناك .هترضاحم يقليل قوقحلا ةيلك ىلإ يلودلا نوناقلا ذاتسأ بهذ ةفرغلا ىلإ ذاتسلأا لصو .ي ساردلا ماعلا فصن ةلطع أدبت اهدعبو ،لولأا ي ساردلا لصفلا يف

The proposed system is divided in two parts; one compris- ing of a Graph Based Sentence Ranker and other consisting of K-Means clustering algorithm producing topic clusters The