This post is an introduction to clustering. No hard math, just simple pictures to get a rough idea of what it is.
This aims to set the basis for ensemble clustering which would be the topic of another post.
TODO
In Machine Learning, clustering is an unsupervised learning task where the goal is to segment items \(X\) into groups without prior information. It is analogous to classification in supervised learning, where we have prior information \((X, Y)\)
Unsegmented points:
Items segmented into three groups:
Above, from a visual point of view, the clustering “seems” good. However, because we don’t have ground truth assignment, maybe there are only two groups instead of three, or maybe the groups are differently made. Bellow are two examples of different possible outcomes.
Three groups with different assignment:
Two groups:
In 2D, this is easy to select the outcome that fit the best. However, when we deal with more than 10 dimensions, the visual selection becomes a nightmare, and this is almost impossible to rely on human subjective evaluation. We will explore this point later on.
There are many clustering algorithms, and many classification axis. One possible axis is the cluster description:
Distance based / Density based
This is one axis, we could also sort clustering approaches by splitting procedure: there are partitional algorithms (large majority), and hierarchical ones. We have probabilistic and deterministic algorithm.
Here, we would only talk about hard clustering. There is hard and soft (or fuzzy) clustering.
Soft clustering is helpful when there is a lot of noise. However, in the case of ensemble clustering, most algorithms require as input hard partitioning.
A fundamental question is “What is a good clustering ?”. Because the task is unsupervised, there is no ground truth to compare to.
There are different ways to evaluate a clustering:
External validity is only possible when a complete dataset (\(D = (X, Y)\)) is available. Therefore, not applicable all the time. This kind of evaluation enables to know how frequently the algorithm recovers the knowledge \(Y\). In real data mining situation, we would not know if what is recover is interesting or not, until we perform cluster analysis
We may want to change the cluster results, to see how reliable is the obtained clustering.
In many clustering algorithm (but not all), it is possible to define the desired number of clusters. We can change this number to get more or less clusters
Cluster number is one kind of parameter. It is possible to play on others. For DBSCAN, we can change the number of items necessary to create a core. We can play on the distance measure (manhatan, euclidian, minkovski).
The result of a clustering may depend on the initialization seed. For instance, in Kmeans, centers are initialized at random. Launching again the algorithm can provide a different outcome.
For some algorithm, mostly prototype based algorithm, it is possible to run the algorithm on a data subset. This is helpful when the dataset is very large, for computational reason.
It is possible to run multiple times the clustering algorithm but with different training subset. The leftover is assigned later to the best prototype.
It is possible to remove some features, to change normalization procedure to change the outcome.
A more general way to affect the processing is using random projection.
The big question is “how to evaluate the result ?”“. In classification, there are many metrics: accuracy, F1, precision, recall, …, there is a large choice! For clustering, there are two ways to evaluate an algorithm:
So, evaluation is not easy.
The silhouette score is calculated for each item \(\mathbf{x}_i\), using two coefficients:
Calling \(\mathcal{C} = \{C_1, C_2, ..., C_k\}\) the partitioning with \(k\) clusters, the expression of these two coefficients are:
\[a(i) = \sum_{C \in \mathcal{C} \text{AND} i \notin C} \frac{1}{|C|} \sum_{j \in C} d(\mathbf{x}_i, \mathbf{x}_j)\]In other words:
Then, the score of item \(i\) is:
\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\]The item’s score is in \([-1, 1]\), allowing to know if the point is well assigned (\(1\)) or not (\(-1\))
The global clustering score is the average over clusters, each cluster having the same weight regardless its size:
\[S_{tot} = \frac{1}{k} \sum_{C \in \mathcal{C}} \frac{1}{|C|} \sum_{i \in C} s(i)\]J. MacQueen, “Some methods for classification and analysis of multivariate observations,” Proc. of the Fifth Berkeley Symp. On Math. Stat. and Prob., vol. 1, pp. 281-296, 1967 ↩
Clustering by means of medoid, L. Kaufman, P. Rousseuw, 1987 ↩
R. T. Ng and Jiawei Han, “CLARANS: a method for clustering objects for spatial data mining,” in IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 5, pp. 1003-1016, Sept.-Oct. 2002, doi: 10.1109/TKDE.2002.1033770. ↩
Chapter 11: Big Data Clustering, H. Tong, U. Kang, 2013 ↩ ↩2
Reynolds, D. A. (2009). Gaussian Mixture Models. In S. Z. Li & A. K. Jain (ed.), Encyclopedia of Biometrics (pp. 659-663) . Springer US . ISBN: 978-0-387-73003-5. ↩ ↩2
Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: An Efficient Clustering Algorithm for Large Databases. SIGMOD Rec. 27, 2 (June 1998), 73–84. ↩
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., & others. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, pp. 226–231). ↩
McInnes et al, (2017), hdbscan: Hierarchical density based clustering, Journal of Open Source Software, 2(11), 205, doi:10.21105/joss.00205 ↩
Cheng, Y. (1995). Mean Shift, Mode Seeking, and Clustering. IEEE Trans. Pattern Anal. Mach. Intell., 17, 790-799. ↩
>> You can subscribe to my mailing list here for a monthly update. <<