Ensemble evaluation procedure is wrong. Here, we propose a set of simple experiments to show the problem.
Here, we describe the experiments to highlight the main issue with evaluation.
The main experiment is the following:
We tested a bunch of consensus algorithm, just to show that almost all consensus algorithms are impacted.
These two metrics enable to see the result with \(0\) efforts.
Maximal NMI is \(1\). It is impossible to reach this target. It would mean that all base clusterings looks like the consensus one, i.e., there are identical, which can’t be, as each base clustering has a different number of cluster
We performed the following experiment.
We selected a “large” dataset, here pendigits
with more than \(10,000\) items.
We performed 10 times the following experiment:
And average the result over the 10 experiments, so sampling can be ignored.
We did the experiment for several value of \(n\), and we got the following chart:
Here, \(k_0 = 10\).
Here are a bunch of different algorithm. What you can see is that changing the number of items impact the result. So, when we evaluate ensemble results over several datasets, a part of difference is linked to the dataset size.
The third experiment was to measure the impact of the number of class, which is often used as the minimal number of clusters.
In the standard protocol, the lowest number of cluster \(k_0\) is set equal to the number of class, when the dataset can be used for supervised learning. In this experiment, we tested different values for \(k_0\):
\(k_0\) | Plot |
---|---|
\(2\) | |
\(5\) | |
\(10\) |
The general behavior is: the higher \(k_0\) is, the better the NMI, regardless the consens method.
This empirical result can be simply explained:
The maximal entropy for a group with \(k\) clusters is \(\log_2(k)\). When you compute the NMI, \(I(X, Y)\) is at max \(\min(H(X), H(Y))\) (because \(X\) and \(Y\) cannot share more than the best entropy of each). Therefore, \(NMI(X,Y) = 2\frac{\log_2(k)}{\log_2(k) + H(Y)}\). The lower \(k\) is, the smaller is the NMI.
Therefore, selecting \(k_0 = \mid\mathcal{Y}\mid\) impact the consensus results.
We did the experiment with \(k_0 = 2\) with two datasets:
pendigits
landstat
Pendigits | |
---|---|
Landstat |
When you compare the two plots, the dataset effect is low.
What you see is the global same behavior: more points increase or decrease the NMI in one direction for a given consensus method. Unless for hierarchical clustering where values are very low for some, NMI score are relatively similar.
This is an experiment over only two datasets, so it is difficult to claims “this is how it is in all situation”. It is always possible to find an example which contradict the rules with data. However, this is the general behavior we have observed so far…
>> You can subscribe to my mailing list here for a monthly update. <<