<< Go back to Posts
Warning : This document is still a draft

Ensemble Clustering experiments.

Experiments against NMI evaluation.

Introduction

Ensemble evaluation procedure is wrong. Here, we propose a set of simple experiments to show the problem.

Experiments

Here, we describe the experiments to highlight the main issue with evaluation.

Main Evaluation Procedure

Experiment 1

The main experiment is the following:

Build \(b\) base clusterings
For each base clustering, select the number of clusters in \([\mid\mathcal{Y}\mid, \min(\sqrt{n}, 100)]\)
Create the consensus clustering with a given algorithm
Evaluate the results with an external evaluation metric

Parameters:

\[b=20\]
\(k_{target}=20\): For step 3., most experimental protocols propose to select the consensus clustering leading to the largest NMI. We will show in an experiment that this is irrelevant. Therefore, we fix the desired number of clusters.
Clustering algorithm: KMeans
Evaluation metric: NMI (Normalized Mutual Information)

Selected Consensus Algorithms

We tested a bunch of consensus algorithm, just to show that almost all consensus algorithms are impacted.

CSPA
ECP
ODCA /OODCA: Undisclosed algorithms for the moment

Baseline Information

Mean / Max: These are not “ensemble consensus algorithm”.

These two metrics enable to see the result with \(0\) efforts.

Maximal NMI can never be reached

Maximal NMI is \(1\). It is impossible to reach this target. It would mean that all base clusterings looks like the consensus one, i.e., there are identical, which can’t be, as each base clustering has a different number of cluster

NMI score depends on the dataset size

We performed the following experiment. We selected a “large” dataset, here pendigits with more than \(10,000\) items.

We performed 10 times the following experiment:

Set \(n\) the number of items to sample from the whole dataset
Do experiment 1 and record the NMI

And average the result over the 10 experiments, so sampling can be ignored.

We did the experiment for several value of \(n\), and we got the following chart:

Here, \(k_0 = 10\).

Here are a bunch of different algorithm. What you can see is that changing the number of items impact the result. So, when we evaluate ensemble results over several datasets, a part of difference is linked to the dataset size.

NMI is largely affected by \(\mid\mathcal{Y}\mid\)

The third experiment was to measure the impact of the number of class, which is often used as the minimal number of clusters.

In the standard protocol, the lowest number of cluster \(k_0\) is set equal to the number of class, when the dataset can be used for supervised learning. In this experiment, we tested different values for \(k_0\):

\(k_0\)	Plot
\(2\)
\(5\)
\(10\)

The general behavior is: the higher \(k_0\) is, the better the NMI, regardless the consens method.

This empirical result can be simply explained:

The maximal entropy for a group with \(k\) clusters is \(\log_2(k)\). When you compute the NMI, \(I(X, Y)\) is at max \(\min(H(X), H(Y))\) (because \(X\) and \(Y\) cannot share more than the best entropy of each). Therefore, \(NMI(X,Y) = 2\frac{\log_2(k)}{\log_2(k) + H(Y)}\). The lower \(k\) is, the smaller is the NMI.

Therefore, selecting \(k_0 = \mid\mathcal{Y}\mid\) impact the consensus results.

NMI does not depend on the dataset nature

We did the experiment with \(k_0 = 2\) with two datasets:

pendigits
landstat

Pendigits
Landstat

When you compare the two plots, the dataset effect is low.

What you see is the global same behavior: more points increase or decrease the NMI in one direction for a given consensus method. Unless for hierarchical clustering where values are very low for some, NMI score are relatively similar.

This is an experiment over only two datasets, so it is difficult to claims “this is how it is in all situation”. It is always possible to find an example which contradict the rules with data. However, this is the general behavior we have observed so far…

>> You can subscribe to my mailing list here for a monthly update. <<