<< Go back to Posts
Warning : This document is still a draft

Clustering algorithms

A short introduction to clustering algorithms and their related problems. Disclaimer - this is not an introduction to KMeans or DBSCAN, here, the main concept and evaluation strategies are presented.

Introduction

This post is an introduction to clustering. No hard math, just simple pictures to get a rough idea of what it is.

This aims to set the basis for ensemble clustering which would be the topic of another post.

Table of Content

TODO

Clustering

In Machine Learning, clustering is an unsupervised learning task where the goal is to segment items $X$ into groups without prior information. It is analogous to classification in supervised learning, where we have prior information $(X, Y)$

Unsegmented points:

Items segmented into three groups:

Above, from a visual point of view, the clustering “seems” good. However, because we don’t have ground truth assignment, maybe there are only two groups instead of three, or maybe the groups are differently made. Bellow are two examples of different possible outcomes.

Three groups with different assignment:

Two groups:

In 2D, this is easy to select the outcome that fit the best. However, when we deal with more than 10 dimensions, the visual selection becomes a nightmare, and this is almost impossible to rely on human subjective evaluation. We will explore this point later on.

Vocabulary

A cluster is a group of items.
A partitioning is a groups of non-overlaping clusters covering all items

Algorithms Families

Partitional:
- prototype
- Spectral
- Graph
- Model based
Hierarchical:
- Agglomerative
- Divisive
Bayesian

There are many clustering algorithms, and many classification axis. One possible axis is the cluster description:

Prototypes (K-Means, K-Medoid/PAM (Partitioning Around Medioid)) ¹ ² ³ ⁴ ⁵, ⁶ : In this approach, a cluster is represented by a single vector, which is very compact in memory
Density based ⁵, ⁷, ⁸, ⁹ : In this one, a cluster is represented by a list of representative points, which is less efficient. However, this kind of algorithms better adapt to cluster with diverse shapes.
Spectral clustering TODO: MinCut
Graph Clustering
EM partitioning

Distance based / Density based

This is one axis, we could also sort clustering approaches by splitting procedure: there are partitional algorithms (large majority), and hierarchical ones. We have probabilistic and deterministic algorithm.

Here, we would only talk about hard clustering. There is hard and soft (or fuzzy) clustering.

In hard clustering, an item is assigned to a single group
In soft clusterng, an item can belong to different groups, with some relative strength.

Soft clustering is helpful when there is a lot of noise. However, in the case of ensemble clustering, most algorithms require as input hard partitioning.

One pass clustering ⁴

Evaluation and Quality

A fundamental question is “What is a good clustering ?”. Because the task is unsupervised, there is no ground truth to compare to.

There are different ways to evaluate a clustering:

internal validation: Check coherence / stability of clustering
- coherence: Compare average intra-similarity to inter-similarity (e.g., Silhouette score ¹⁰)
- stability: The clustering outcome must be similar when minor perturbation are applied (noise, sampling, projection, …)
external validation: Compare to ground-truth / known classes
- NMI (Normalized Mutual Information)
- ARI (Artificial Rand Index)
- P-value

External validity is only possible when a complete dataset ($D = (X, Y)$) is available. Therefore, not applicable all the time. This kind of evaluation enables to know how frequently the algorithm recovers the knowledge $Y$. In real data mining situation, we would not know if what is recover is interesting or not, until we perform cluster analysis

Presentation

Changing the outcome

We may want to change the cluster results, to see how reliable is the obtained clustering.

Playing on cluster number

In many clustering algorithm (but not all), it is possible to define the desired number of clusters. We can change this number to get more or less clusters

Playing on cluster parameters

Cluster number is one kind of parameter. It is possible to play on others. For DBSCAN, we can change the number of items necessary to create a core. We can play on the distance measure (manhatan, euclidian, minkovski).

Playing on seed

The result of a clustering may depend on the initialization seed. For instance, in Kmeans, centers are initialized at random. Launching again the algorithm can provide a different outcome.

Resampling

For some algorithm, mostly prototype based algorithm, it is possible to run the algorithm on a data subset. This is helpful when the dataset is very large, for computational reason.

It is possible to run multiple times the clustering algorithm but with different training subset. The leftover is assigned later to the best prototype.

Playing on the Preprocessing

It is possible to remove some features, to change normalization procedure to change the outcome.

Random Projection

A more general way to affect the processing is using random projection.

Evaluation

The big question is “how to evaluate the result ?”“. In classification, there are many metrics: accuracy, F1, precision, recall, …, there is a large choice! For clustering, there are two ways to evaluate an algorithm:

You evaluate an algorithm on a supervised setup. I.e., your clustering algorithm tries to identify groups using an $X$, and you have a reference label $Y$, which allows you to know if you have a good algorithm or not. Usually, this is the setup to evaluate the generalization of the algorithm, testing multiple datasets. If the algorithm get good results over multiple datasets, you can assume that the algorithm is often good. However, you will never know on an unseen dataset if it would provide good results or not.
You evaluate an algorithm on an unsupervised setup. Here, even if we don’t have labels, we can define an error function. For instance, the silhouette score: how far away are items within a cluster versus items in the nearest cluster. In this approch, the result depends mainly on the preprocessing: how data have been normalized, which features have been removed, etc.

So, evaluation is not easy.

Appendices

Silhouette Score

The silhouette score is calculated for each item $\mathbf{x}_i$, using two coefficients:

inter-distance: $a(i)$: the average distance to the closest cluster
intra-distance: $b(i)$: the average distance to items within the cluster

Calling $\mathcal{C} = \{C_1, C_2, ..., C_k\}$ the partitioning with $k$ clusters, the expression of these two coefficients are:

\[a(i) = \sum_{C \in \mathcal{C} \text{AND} i \notin C} \frac{1}{|C|} \sum_{j \in C} d(\mathbf{x}_i, \mathbf{x}_j)\]

In other words:

you pick a point $i$
you measure the average distance to all clusters
you keep the lowest one (closest cluster)

\[b(i) = \sum_{j \in C(i) \text{AND} i \neq j} \frac{1}{C(i) - 1} d(\mathbf{x}_i, \mathbf{x}_j)\]

Then, the score of item $i$ is:

\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\]

The item’s score is in $[-1, 1]$, allowing to know if the point is well assigned ($1$) or not ($-1$)

The global clustering score is the average over clusters, each cluster having the same weight regardless its size:

\[S_{tot} = \frac{1}{k} \sum_{C \in \mathcal{C}} \frac{1}{|C|} \sum_{i \in C} s(i)\]

J. MacQueen, “Some methods for classification and analysis of multivariate observations,” Proc. of the Fifth Berkeley Symp. On Math. Stat. and Prob., vol. 1, pp. 281-296, 1967 ↩
Clustering by means of medoid, L. Kaufman, P. Rousseuw, 1987 ↩
R. T. Ng and Jiawei Han, “CLARANS: a method for clustering objects for spatial data mining,” in IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 5, pp. 1003-1016, Sept.-Oct. 2002, doi: 10.1109/TKDE.2002.1033770. ↩
Chapter 11: Big Data Clustering, H. Tong, U. Kang, 2013 ↩ ↩²
Reynolds, D. A. (2009). Gaussian Mixture Models. In S. Z. Li & A. K. Jain (ed.), Encyclopedia of Biometrics (pp. 659-663) . Springer US . ISBN: 978-0-387-73003-5. ↩ ↩²
Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: An Efficient Clustering Algorithm for Large Databases. SIGMOD Rec. 27, 2 (June 1998), 73–84. ↩
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., & others. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, pp. 226–231). ↩
McInnes et al, (2017), hdbscan: Hierarchical density based clustering, Journal of Open Source Software, 2(11), 205, doi:10.21105/joss.00205 ↩
Cheng, Y. (1995). Mean Shift, Mode Seeking, and Clustering. IEEE Trans. Pattern Anal. Mach. Intell., 17, 790-799. ↩
Wikipedia: Silhouette (clustering) ↩

>> You can subscribe to my mailing list here for a monthly update. <<