Subject: Statistics
Paper: Multivariate Analysis
Module: Clustering Techniques - The
Partitioning Methods
Development Team
Principal investigator: Dr. Bhaswati Ganguli, Professor, Department of Statistics, University of Calcutta
Paper co-ordinator: Dr. Sugata SenRoy,Professor, Department of Statistics, University of Calcutta
Content writer: Souvik Bandyopadhyay, Senior Lecturer, Indian Institute of Public Health, Hyderabad
Content reviewer: Dr. Kalyan Das,Professor, Department of Statistics, University of Calcutta
Partitioning
Partitioning is carried out when one is more interested in cluster characteristics rather than the individuals themselves.
Example
Given a mix of cells, partitioning techniques may be applied to separate the normal cells from the ones in different stages of a disease. Here the interest is on isolating the different types of cells rather than the individual cells themselves.
Thus the focus is on the the cluster types and, unlike hierarchical clustering, distances between individuals are not of importance.
Aim in Partitioning
The primary aim in partitioning is to form separate clusters or groups such that
I the members within a cluster are as similar among themselves as possible
I the members of different clusters are as different as possible This means that
I between cluster distances should be as large as possible
I distances between the members of a cluster should be as small as possible
Two major questions
The first question that arises in partitioning technique is How many clusters should be formed ?
Very often, the answer is unknown to the researcher.
However, a decision on this is imperative since the usual
partitioning techniques are based on a specified number of clusters.
Hence we will start by assuming that thenindividuals are to be divided intokclusters and later on come back to see howk can be optimally chosen.
Two major questions
Having decided on the number of clusters, the second question is How to allocate the individuals into these clusters ?
Two broad and somewhat similar methods are :
I partitioning based on similarity
I a popular method is the k-means clustering
I alternatively the k-median method may be used
I partitioning by splitting the total variability
Usually the two techniques are combined by using the nearness to means as a way of forming the clusters and then using the within cluster variability criterion for checking its optimality.
k-means Clustering
k-means clustering is a technique which aims to findmore homogeneoussubgroups within the data.
The idea is to divide the data intokdistinct groups so that observations within a group are similar, whilst observations between groups are as different as possible.
The method is iterative, rather than hierarchical.
At each stage of the algorithm data points will be assigned to a fixed number of clusters. This is in contrast to hierarchical
clustering where the number of clusters ranges from the number of data points down to a single cluster.
k-means Clustering
The method starts by first selectingknuclei around which the k initial clusters are to be formed. This can be done in several ways :
I Take the first k observations
I Use every [n/k]th observation
( [.] denotes the largest integer contained)
I Use any random set of k observations
I Select the k farthest observations (using some distance measure)
I Select some central observations from high density regions i.e.
regions where there are large clusters of observations
k-means Clustering
After theknuclei are selected, each of the remaining (n−k) individuals is assigned to the cluster whose nucleus is closest to it.
Of course the distances should be according to some pre-determined measure.
Once this preliminary assignment is made, the centroids of each cluster is obtained as
xj = 1 nj
X
i∈Cj
xi, j= 1, . . . , k,
wherenj is the number of elements in thejth clusterCj.
k-means Clustering
The next steps are :
I If any observation is closer to another cluster’s centroid than its own, then it is re-allocated to the other cluster.
I After each such re-allocation the centroids are re-calculated.
I The process is repeated till each individual has its own centroid as the closest from itself.
I Without further re-allocation needed, the kclusters constitutes the final partitioning of the nindividuals.
k-median partitioning
Here instead of the distance from the centroid, the distance from thecluster medoidsis considered. The rest of the technique is similar.
Example : k-means clustering
Start with a simulated data with two arbitrary centers
−2 0 2 4 6
−202468
x
y
Data Center 1 Center 2
Example (contd.)
Calculate the new centers and cluster the variables
−2 0 2 4 6
−202468
x
y
Group1 Group2
Example (contd.)
Recalculate the new centers and cluster the variables once more
−2 0 2 4 6
−202468
x
y
Group1 Group2
Example (contd.)
The k-means algorithm converges as no points move between the clusters in further iterations. Fix the clusters.
−2 0 2 4 6
−202468
x
y
Group1 Group2
Clustering using within cluster variability
Here the idea is to separate the observations into clusters such that
I the within cluster variability is as small as possible
I the between cluster variability is as large as possible
Let T =
n
X
i=1
(xi−x)(xi−x)0,
B =
k
X
j=1
nj(xj−x)(xj−x)0,
W =
k
X
j=1
X
i∈Cj
(xi−xj)(xi−xj)0
Clustering using within cluster variability
Then T = B+W
SinceT is fixed, ifW decreases, B increases.
Note
For the clusters to be distinct and homogeneous,Wshould be small orBshould be large.
Hence the split should be such that
I W is small
I B is large
Methods of splitting
Several criteria can be used.
Minimize trace(W)
IfWj is the within cluster variability of the jth cluster, then
Minimize trW=
k
X
j=1
trWj,
I The method is simple.
I However, it is not invariant to orthogonal transformations and hence different scale changes may lead to different clusters.
Methods of splitting
Alternative Minimize|W|
I The computations are much more difficult.
I However, it often gives best results primarily because it takes into account the correlations between the variables.
Another alternative MaximizetrBW−1
I Can be very simply obtained from its largest eigen-value.
I However, can be unreliable for elongated clusters.
Algorithm
I As in k-means clustering, the process begins by selectingk points and forming the kinitial clusters
I Any one of the above three criteria is then decided upon and the corresponding value obtained
I Re-allocation of the farthest members in the clusters are made to adjoining (closer) clusters and the value re-calculated to check for improvement.
I The process is repeated till no further improvement in the value can be made.
Choosing the number of clusters
Till now we had assumed that the number of clustersk is pre-determined. However, this generally is unknown. Hence a decision on the number of clusters needs to be made.
Unfortunately, nearly all such choices are either subjective or are based on special types of data structure. Thus we need to keep in mind the following.
More the number of clusters, the less will be the within cluster variability and the more will be the between cluster variability.
However, too many clusters will defeat the whole purpose of clustering (will end up withnone-member clusters). Optimally there should be only a few distinct, homogeneous clusters.
Looking for the Elbow
I Consider any of the clustering criteria as discussed above, say the Minimum trW criterion.
I Let Wk be the within cluster variability for the final k-cluster partitioning by this method
i.e. trWk is the minimum within cluster variability among all partitions in kclusters.
I ComputetrWk for several k’s, starting with k = 1 (trivial), 2, 3....
I It is obvious that the plot oftrWk against kwould be monotonically decreasing.
Looking for the Elbow
I However, the decline will be rapid in the initial stages when distinct clusters take shape but will be gradual when well-defined clusters start splitting into smaller ones simply because k is increased.
I Hence the elbowof the curve, at which the bend from a rapid decrease to a gradual decline takes place, can be taken as an optimal value of k.
I Of course, the criteria |Wk|or trBkWk−1 can be similarly used, keeping in mind that for the latter the curve would be upward sloping.
Calinski-Harabasz’s method (1974)
Define
Ck= trWtrBk/(k−1)
k/(n−k) = trWtrBk
k
n−k k−1
I trBk
trWk will always be increasing
I n−k
k−1 will always be decreasing
I the latter act as a penalty for too many clusters
I Hence it is expected thatCk would initially rise and then fall
I Choose that kfor which Ck peaks.
Special Cases
A monotonically increasingCk suggests a hierarchical clustering A monotonically decreasingCk suggests no clustering.
Example (contd.) What should be k?
In our Example, let us see whatk should be.
I We will use the trace W criterion to choose the optimal number of clusters.
I Let us try outk= 1,2,3, . . . ,10 successively.
I For each we show the total within sum of squares criterion.
I See in the next diagram how it decreases as kincreases.
I We then plot these against kand look for the elbow.
I The elbow or the kink in the curve gives the optimal k.
Example : Choosing k
C1 Tot
WithBetTot
490.7 490.7
0 490.7
−1 0 1 2 3 4 5
−112345
y
Example : Optimal k
If we plot the total of the within sum of squares values versus k, then we get the following
2 4 6 8 10
100300500
K
Total Within SS
Notice that the graph flattens quickly after k=3. Use 3 groups.
Summary
I The Partitioning method of clustering is introduced as an alternative to hierarchical clustering.
I The k-means clustering method is discussed in detail and an illustration given as an example.
I Clustering based on within and between cluster variability is discussed.
I Methods for identifying the optimal number of clusters is described.