<< Go back to Posts
Warning : This document is still a draft

Ensemble Clustering Evaluation Issues.

EClust are often evaluated using NMI and ARI. If they were good measure, you would get a score of one for a perfect consensus, and 0 for a very bad one. There are very simple experiments showing they do not behave this way. This article will show the problem and describe where it comes from. Next, we will propose alternative measure this consensus functions.



Introduction

If you don’t know yet what is ensemble clustering, please look at the previous article for a short introduction.

Table of Content

TODO when article is not a draft anymore

Evaluation Protocol of Consensus Clustering

The general protocol is the following:

  1. Pick a clustering algorithm. Most of the time, \(k\)-Means is selected, as it is well known, simple and fast
  2. Create \(20\) base clustering. To get diversity, select the number of cluster \(k\) at random. There are two options:
    • \(k \in [n_{\text{class}}, \min(100, \sqrt{n})]\), where \(n_{\text{class}}\) is the number of class in the supervised setting. If completely unsupervised, use \(2\),
    • \(k \in [n_{\text{class}} - 2, n_{\text{class}}+2]\): This option is by far not recommanded (over the 20 base clustering, many would have the same cluster number. In itself, this is not a problem. However, with \(k\)-Means, it is very likely to converge to the same partitioning, so you may end-up with only \(5\) different clustering, out of the \(20\))
  3. Create the consensus clustering
  4. Perform the evaluation

This is the main protocol we can see accross the litterature.

The most frequently used evaluation metrics are the NMI and the ARI. We will shortly introduce them in the next paragraphs.

Evaluation Metrics

NMI

The Normalized Mutual Information is defined as:

\[NMI(\pi_A, \pi_B) = 2 \frac{I(\pi_A, \pi_B)}{H(\pi_A) + H(\pi_B)}\]

for two partitioning \(\pi_A\) and \(\pi_B\). In natural language, \(I\) is the amount of common information, while \(H\) is the amount of individual information. Therefore, if \(\pi_A\) is similar to \(\pi_B\), then \(I\) is maximal.

This value range in \([0, 1]\), where \(1\) is obtained for a perfect overlap.

ARI

The Adjusted Rand Index is a measure looking at pairs of items of different kinds:

  • \(a\) pairs of items which are together in \(\pi_A\) and in \(\pi_B\)
  • \(b\) pairs of items which are together in \(\pi_A\) but not in \(\pi_B\)
  • \(c\) pairs of items which are together in \(\pi_B\) but not in \(\pi_A\)
  • \(d\) pairs of items which are never together nor in \(\pi_A\) neither in \(\pi_B\)
  together in \(\pi_A\) not together in \(\pi_A\)
together in \(\pi_B\) a c
not together in \(\pi_B\) b d

From here, you can compute the rand index \(RI\):

\[RI(\pi_A, \pi_B) = \frac{a+d}{a + b + c + d}\]

In other words, the number of coherent pairs (always together \(a\) plus alway never together \(d\)) divided by the total number of pairs. This is also call the simple matching coefficient.

The Adjusted Rand Index is a little more complicated, but follow the same intuition.

\[ARI(\pi_A, \pi_B) = \frac{\sum_{i=1}^k \sum_{j=1}^l \binom{m_{ij}}{2} - t_{AB}}{\frac{1}{2}(t_A + t_B) - t_{AB}}\]

where:

  • \[t_A = \sum_{i=1}^k \binom{|\pi_A|}{2}\]
  • \[t_B = \sum_{j=1}^{\ell} \binom{|\pi_B|}{2}\]
  • \[t_{AB} = \frac{2 t_A t_B}{n(n-1)}\]
  • \(m_{ij}\) is the number of items clusters in the \(i\)-est cluster of \(\pi_A\) and the \(j\)-est cluster of \(\pi_B\)

The similarity between the RI and ARI seems very low. The ARI gives more attention to large clusters.

Source ARI +++

Consensus Evaluation

NMI and ARI are one-to-one evaluation metrics, not one-to-many.

To evaluate a consensus partitioning, we calculate the average over the ensemble:

\[\bar{M} = \frac{1}{n_{base}} \sum_{i=1}^{n_{base}} M(\pi^*, \pi_i)\]

where \(\pi^*\) is the consensus partitioning, and \(\pi_i\) the \(i\)-est base partitioning.


What’s Wrong with these Evaluation Measure ?

In classification, the best algorithm can get \(100\%\) accuracy. ARI and NMI are supposed to be scoring function between \([0, 1]\), which is the case. For consensus evaluation, the final score is the average similarity between the consensus and all base clustering. Here, it is impossible to get \(1\), unless your ensemble is made of the exact same partitioning (with \(0\) diversity).

Worst, when you pick one base clustering and consider it as the consensus clustering, the best obtained score is close to the algorithmic consensus result. You may swith from \(89\%\) to \(90\%\) (best base clustering VS consensus clustering). The difference between the two does not worth it for the computational effort.

A Focus on NMI

Say our consensus clustering has \(k\) clusters. What happens if each of these clusters are split in two, leading to \(2k\) clusters ? What is the change in NMI ?

Let’s look at each term in the NMI formula:

  • \(H(\pi_i)\) is unchanged (hopefully)
  • \(H(\pi^{**})\) is modified, but not that much
  • \(I(\pi^{**}, \pi_i)\) is modified, but not that much.
\[H(\pi^{*}) = -\sum_{j=1}^{k} p_j\log p_j\]

TODO: check

\[\begin{array}{lll} H(\pi^{**}) &=& -\sum_{j=1}^{2k} \frac{p_j}{2}\log \frac{p_j}{2} \\ & = & -\frac{1}{2} \sum_{j=1}^{2k} p_j (\log p_j - \log 2) \\ & = & \frac{1}{2} (2 H(\pi^*) + 2) \\ & = & H(\pi^*) + 1 \end{array}\]

Well, as you can see, just a “+1”, which is very intuitive (you encode for each cluster the item goes left 0 or right 1).

For \(I = H(\pi^{**}) + H(\pi_i) - H(\pi^{**}, \pi_i)\), the easiest is to compute the join entropy \(H(\pi^{**}, \pi_i)\):

\[\begin{array}{lll} H(\pi^{**}, \pi) &=& - \sum_{i, j} P(\pi^{**}=i, \pi=j) \log P(\pi^{**}=i, \pi=j) \\ &=& - \sum_{i, j} \frac{1}{2} P(\pi^{*}=i, \pi=j) \left(\log P(\pi^{**}=i, \pi=j) - \log(2) \right) \\ & =& \frac{1}{2}\left(-\sum^{2k} P_{ij} \log P_{ij} + \sum^{2k} P_{ij} \log(2)\right)\\ & = &\frac{1}{2} (2H(i,j) + 2) \\ & = & H(i,j) + 1 \end{array}\]

Also here, no surprise in this result.

Therefore, \(I(\pi^{**}) = H(\pi^{**}) + H(\pi) - H(\pi^{**}, \pi)= H(\pi^*) + 1 + H(\pi) - H(\pi^*, \pi) - 1 = I(\pi^*, \pi)\)

So, if you divide the clusters of the consensus clustering in two, your NMI is just:

\[NMI = \frac{2I(\pi^{**}, \pi)}{H(\pi^*) + H(\pi) + 1}\]

The \(+1\) does not change that much to the result. \(H\) are not very large, but when moving from \(30\) to \(31\), the impact on the NMI is negligible.

You can do the experiment yourself, you will see that scores do not decay that much.

Average

When performing consensus clustering, sometimes, some algorithms let you select the number of desired clusters. In that case, if you select \(2k\) instead of \(k\), the result is even better.

In the previous paragraph, we supposed a random split. If you ask the algorithm for the best split with \(2k\) clusters, the result is even better.

TODO: Curve with a given algorithm and average NMI.

Here, you can see that NMI is almost flat. There it is unsure if 30 or 35 or 40 clusters is better.

The “best” cluster number is … the average number of cluster over the ensemble. This can be shown by performing \(100\) base clustering (instead of \(20\)), the result depends on the selected range for \(k\).


Reflection on Ensemble Clustering

To define a new evaluation measure, we need to look back at some simple examples, and what would be the expected outcome.

What could be the Best Outcome ?

Let’s start with a simple example, where we have two partitioning with two clusters each.

If the two partitioning are not identical, we end up with this situation, where there are three main possible options for the consensus,

Option 1: There is a single cluster

If the number of clusters was set to two, there is no reason why there should be two clusters. You can always split a group in two, unless there is only one element. Here, as the groups do not fully overlap, then this assumption is reasonable.

Option 2: There are two clusters

But there are plenty of choices for these two clusters:

  • a: the first partitioning is the reference
  • b: the second partitioning is the reference
  • c: there is an enclave in the middle (imagine a half moon with a dot in the middle. The dot is the middle cluster)
  • d: there is noise, therefore the boundary is somewhere in between

Option 3: There are three clusters

This is also a possible configuration, and here there is a single trivial choice.

More Base Partitioning

If you have only two partitioning, it is almost impossible to pick the correct consensus clustering from the above list, because you don’t have enough evidences. If more partitioning are performed, we would be able to find out which hypothesis best suits the data.

The next figures illustrates how it would be for each scenarios:

1 cluster 2 clusters 3 clusters
  • For a single cluster, there would be no rational in the splitting
  • For two clusters + noise, the overlap between clusters of different partitioning is likely to be high, with some small variations
  • For three clusters, it is likely that sometimes we see \(\{\{C_1, C_2\}, \{C_3\}\}\), sometimes \(\{\{C_1, C_3\},\{C_2\}\}\) and also \(\{\{C_1\},\{C_2, C_3\}\}\)

Noise on Boundary

The 2-clusters case enable to introduce one concept: Noise. Partitioning are unlikely to have exactly the same boundary. Some items are in between, and there is no way to decide properly where they should go.

If we look at our 2-clusters example, we can identify three groups:

  • On the left, items clearly belong to the green group
  • On the right, items clearly belong to the blue group
  • In the middle, items where we don’t know which is the best cluster for.

Consensus clustering leads to hard clusters, and NMI / ARI cannot be used to evaluate soft clusters. Therefore, as the split cannot handle this uncertainty (even if we know that some are 90% of the time green, this is still 10% in the other group), the evaluation will be impacted. It would be reasonable to remove points that are in between to perform consensus evaluation.

Structure Correspondancies

Even if the previous schema seems easy (some items missclassified), it could correspond to many different case. We can identify at least three cases leading to this boundary uncertainty:

  1. The boundary between green and blue items is in between. Clusters are just too close, therefore contact items switch of side.
  2. Points in between the two main groups are just outliers. They do not belong either to one or the other. Therefore, the boundary can be clearly defined after items removal
  3. There is no boundary: all base clustering fulfill the constraint of having two clusters, but a better option would have been one single cluster. Because of the oval shape of the group, the boundary between the two clusters is roughly at the same location.
Option 1 Option 2 Option 3

In Option 1 and 2, removing those items would lead to a clean hard clustering and a better consensus. In Option 3, this would not help, creating a hole within the group.

Too Many Partitioning

In Option 3 the previous example, we supposed that because of the oval shape of the group, the boundary would be located near the same location. Therefore, ensemble analysis would give the impression there are two clusters.

In real life, this is unlikely to be the case, or at least not that frequent. If there are too many partitioning, it is very likely that there would be no clear boundary.

Here, we just illustrate one group split 3 times more than it should.

The external boundary is stable (well, we did not represent other groups, just imagine it), but internal boundaries change all the time. Therefore, the small clusters weakly overlap, and make the recovery difficult.

When performing evidence accumulation clustering, the more cluster you add, the less 1 the \(A\) matrix contains.

So, the idea of removing items is unlikely to be helpful in that case. Nevertheless, this can be used to know if too many clusters have been searched.


Proposition of New Scoring Method

If we look at our matrices representing clustering outcomes, we can define atomic cluster, i.e., the most fine grained clustering you could obtain, where all items within the cluster are always clustered together.

This is a state that is reached by EAC + hierarchical agglomerative clustering, where there is only trivial merge choices.

  • There are groups that are larger than other, and items that are singleton.
  • There are groups where if you remove one or two partitioning, then it merges with another cluster.

We can score a partitioning, base on if its removal help to unify cluster. Similarly, we can weight a group of item based on its size, and on the effort necessary to merge it with another group.

MP: Micro precision:

\[MP = \sum_{i=1}^k \frac{a_i}{n}\]
  • \(k\): number of clusters
  • \(n\): number of items

There is no Magic Recipe !

If the base clustering are completely incoherent, there is no magic, consensus clustering will not help ! And if all base clustering are identical, consensus clustering will not help either !! This is related to the diversity concept.

The first thing to look at it the difficulty of the current ensemble.

I propose the following approach:

First, rank items by triviality. There are items where they constantly change of neighbors in one clustering to another.

Next, set a tolerance level: remove all items switching xy% of the time. You reduce the data from \(n\) to \(n_{xy}\).

Then, measure average coherency. For instance, with NMI:

\[Score_{xy} = \frac{b(b-1)}{2} \sum_{1 \leq i < j \leq b} NMI(\pi_i, \pi_j)\]

You change xy for all possible values, from 1% to 100%. This allow you to create NMI x stability curves as the following:

In practice, we don’t evaluate all proportion, we just record the distinct value. Here, we work on landstat dataset, with 6435 elements. I performed 20 base clustering, where diversity in the clustering was obtained by sampling 1000 items at random to fit cluster centers (\(k=10\)), then assigning the rest.

# Code for making the ensemble
n_base   = 20
ki = 10
n0 = 1000
xrange = np.arange(len(X))

lst_clusterings = []
for _ in range(n_base):
    print("-", end=" ")
    md = KMeans(n_clusters=ki)
    locs = np.random.choice(xrange, n0, replace=False)
    Xi = X[locs]
    md.fit(Xi)

    lst_clusterings.append(md.predict(X).astype(int))

Here, for the biggest group is made of 680 elements, and the second has 664 elements (and the smallest ones have one element). There are 641 codes (but you can only see 51 dots, because some groups have the same number of elements (mostly groups with one elements) )

The curve is “nice” I would say, large good groups.

Here, I just changed \(k=10\) to \(k=20\), leading to more small clusters.

And bellow, decreased to 6, we have a large plateau.

So, here you see that the outcome of this curve depend on the base clustering.

Issue with the Curve Shape

The problem is the following:

  • when you have not enough clusters, boundaries between clusters are likely to be true boundary. However, some are missing.
  • when you have just the right number of cluster, then no boundaries are missing
  • when you have too many clusters, then true groups are splitted at random.

The last case explain the curves loosing their roundness.

Base X k tradeoff

If we have large number of base clustering, then we can study boundaries.

10 x 10 is different from 5 x 20, even if you get as many clusters. However, in practice, for low number of cluster, they converge to the same target. Because mass.

Getting the items left

The simplest way to remove (but not perfect) is to use binary encoding. Each clustering is one-hot encoded and all one-hot-enocoded clustering results are concatenated, leading to a large matrix. From this matrix we extract “words”, or binary codes. We count their frequency:

  • unfrequent words are associated with items flipping frequently of clusters
  • frequent words correspond to large group of items that are always clustered together.

The intuition of group and size is super easy. If we remove all the items of small groups, then it would be super easy to cluster the left over (clean groups), and we would get a nice NMI value.

Issues with Codes

Code allows to count how many items are always clustered together. That is the reason why this method is not perfect.

If we have \(k\) base clustering with each \(d\) clusters, there are at max \(d^k\) binary words. If that is the case, then we cannot do anything with that.

If we have \(k_g\) “good” clustering and \(k_b\) bad clustering (items assigned at random), then code will lead to nowhere, as they would hide similarities between good clusters.

Hopefully, we can identify bad clustering with NMI (or ARI or any other relevant measure). A random clustering will have an NMI close to \(0\) with any other clustering (random or not random) Therefore, we can discard them gracefully.

Nevertheless, one clustering can be good in some parts of the data, and bad elsewhere. In that case, we need to take each cluster of a clustering alone against the other clustering. Here, we can proceed the same way. If we take \(n\) items at random, there is a probability it will overlap the rest. If random \(X^T X_i \approx \frac{X^T X}{k}\), i.e., the overlap is inversely proportional to the number of clusters.

By using a contrast function, we can score

\[\frac{1}{b} \sum_{\pi} \frac{1}{|\pi|}\sum_{C \in \pi} Contrast(C^T, C_r)\] \[Constrast(x, y) = \frac{x^ty - y^Ty/k}{1 - y^Ty/k}\] \[Constrast(x, y) = \frac{y^Ty - x^ty/k}{y^Ty/k}\]

The \(/k\) accounts for the randomness of selecting at random items. Not adapted if clusters of different size…


A Better Partitioning

The previous part presents a way to evaluate if we have too many clusters. If too many, then the curve would be (almost) a straight line between the max and the min value. Which helps, but it requires many experiments to get the correct number of clusters.

Example

A definition for a good consensus is that \(\pi^*\) is reachable from a base partitioning with simple operations, such as:

  • merges
  • splits

If a cluster of a base partitioning is too large, we can accept to split it into several parts (but parts cannot be merged). And if many small clusters are too small, they can be merged together (but cannot be splitted afterwards).

Look at these two examples:

Example 1 Example 2

In the first example, the blue-ish clusters can be merged together to correspond to the big blue cluster of the consensus \(\pi^*\). And the orange cluster can be splitted in two parts to fit the green and pink cluster.

In the second example which does not work, the 2 blue-ish clusters can be merged together. However, it is not possible to merge the red with the orange, and next to split this new cluster in two.

Existence

A consensus partitioning that match all base partitioning this way exist. There are two trivial solutions:

  • a consensus with \(n\) clusters with each one element,
  • a consensus with a single cluster with \(n\) elements.

The first case is reached by diving all clusters into atomic elements, while the second is reached by merging everything, in any order. However, none of them are good clustering.

Nevertheless, these states can be used as starts to identify a closer consensus.

Distance

We can measure the distance from a base clustering \(\pi\) to the consensus \(\pi^*\) by counting the number of steps to reach it. The number of steps is at min \(|k(\pi) - k(\pi^*)|\) (i.e. the difference between the number of clusters in \(\pi\) and in \(\pi^*\)), assuming we can only do binary steps (merge two clusters or split in two).

In the first example, the distance would be equal to 3 (split orange, merge two blue, merge two blue). An optimal partitioning must minimize this distance.

Errors

It is unlikely that no error would happen. It is almost certain that some items would be frequently missclassified, impacting merge and splits of clusters. We can identify them.

(How ??) We may perform a relaxed version, accept merge-split, and next identify problematic areas. Then, compare the areas across multiple clusterings.

Metrics

Well, we said a good consensus must be like this, but how to compare to others ? What can be a reasonable metric ? Plus, in some application, we may want a specific number of clusters. How do we set that ?

Reaching a Solution

We can bound the consensus shape by finding a lower bound and an upper-bound partitioning.

Lower bound partitioning \(\pi^-\)

The lowerbound \(\pi^-\) would be obtained by dividing the partitioning with a single cluster. Here, we will need to discard elements / accept some errors

Upper Bound Partitioning \(\pi^+\)

Similarly, we will start from the top partitioning with \(n\) clusters.

Measure overlap


def compare_clusterings(A, B):
    """
    Measure the overlap between clusters of A using information of B.

    :param A, B: int array of n elements, where A[i] is the cluster assignement of item i.
    :rparam: Matrix of size k x k, where k is the number of clusters in A.
    """
    ki = max(A) + 1

    M = np.zeros((ki, ki))
    for i in range(ki):
        for j in range(ki):
            msk0 = A == i
            msk1 = A == j

            dic_0 = hp.count_elements(B[msk0])
            dic_1 = hp.count_elements(B[msk1])

            arr0 = np.zeros(ki)
            arr1 = np.zeros(ki)
            for x, v in dic_0.items():
                arr0[x] = v
            for x, v in dic_1.items():
                arr1[x] = v

            M[i, j] = (arr0 @ arr1) / np.sqrt(arr0 @ arr0 * arr1 @ arr1)

    return M
\(\pi_1\) \(\pi_2\) \(\pi_3\)

Instead of comparing one cluster of \(\pi_i\) with one cluster of \(\pi_j\), which corresponds to looking at the overlap, we can look at edges. Here, we want to measure if \(\pi_j\) support the evidence that clusters \(A\) and \(B$ of\)\pi_i$$ are distinct.

Let’s start with \(\pi_1\) VS \(\pi_2\).

\(\pi^1\) VS \(\pi^2\)

We start by picking two clusters of \(\pi_1\). Here, \(\pi_1\) has only two clusters, so the choice is trivial.

For each, we build their corresponding vectors:

\[\mathbf{v}(\pi_1^A | \pi_2) = \mathbf{v}^{1|2}_A = \frac{1}{|A|}\left[|\pi^1_A \cap \pi^2_W|, |\pi^1_A \cap \pi^2_X|, |\pi^1_A \cap \pi^2_Y|, |\pi^1_A \cap \pi^2_Z| \right] = [0.6, 0.4, 0, 0]\] \[\mathbf{v}(\pi^2_B | \pi_2) = \mathbf{v}^{1|2}_B = \frac{1}{|B|}\left[|\pi^1_B \cap \pi^2_W|, |\pi^1_B \cap \pi^2_X|, |\pi^1_B \cap \pi^2_Y|, |\pi^1_B \cap \pi^2_Z| \right] = [0, 0, 0.75, 0.25]\]

Next, you compute the cosine similarity as:

\[Cos(\mathbf{v}_A, \mathbf{v}_B) = \frac{\mathbf{v}_A^T \mathbf{v}_B}{\sqrt{\mathbf{v}_A^T\mathbf{v}_A \mathbf{v}_B^T\mathbf{v}_B}} = 0\]

So \(\pi_2\) supports the evidence that A and B of \(\pi^1\) are two distinct clusters.

\(\pi^1\) VS \(\pi^3\)

If we compare \(\pi^1\) to \(\pi^3\), we get the same results, as:

\[\mathbf{v}^{1|3}_A = [0.6, 0.4, 0]\]

and

\[\mathbf{v}^{1|3}_B = [0, 0, 1]\]

therefore:

\[Cos(\mathbf{v}^{1|3}_A, \mathbf{v}^{1|3}_B) = 0\]

, so \(\pi^3\) supports that A and B are distinct.

\(\pi^3\) VS \(\pi^2\)

We have three clusters in \(\pi^3\), with vectors:

\[\mathbf{v}^{3|2}_C = [0, 0, 0.3, 0.7]\] \[\mathbf{v}^{3|2}_D = [0.6, 0.4, 0, 0]\] \[\mathbf{v}^{3|2}_E = [0.5, 0.5, 0, 0]\]

For C, there are supporting evidence that C is not connected to D nor E:

\[Cos(\mathbf{v}^{3|2}_C, \mathbf{v}^{3|2}_D) = 0\] \[Cos(\mathbf{v}^{3|2}_C, \mathbf{v}^{3|2}_E) = 0\]

However, \(Cos(\mathbf{v}^{3|2}_E, \mathbf{v}^{3|2}_E) = \frac{0.6 \times 0.5 + 0.4 \times 0.5}{\sqrt{(0.6^2 + 0.4^2)(2 \times 0.5^2)}} = \frac{0.5}{\sqrt{0.26}} \approx 1\)

Here, there are evidence against separation of D from E.


Unclassified / To Remove

Trade-Off Between Clusters Number and Base Partitioning

In the first example, we limited ourselves to 3 clusters (because we could not spot more than that with two base partitioning with two clusters each). We could have seen four clusters, but not more. When adding more partitioning, we can see more clusters.

The boundary for \(b\) base partitioning and \(k\) clusters in each is:

\[[k, k^b]\]

We have \(k\) clusters if there is a perfect overlap (often when \(k\) is small and clustering algorithms converge to the same state). We reach a number close to \(k^b\) if \(k\) is large, where we end up in splitting “true” clusters into pieces.


Removing Items

Most clustering and ensemble clustering algorithms output a complete clustering, i.e., all items are assigned to a given cluster.

It is possible issue an incomplete partitioning, where only \(70 \sim 90\%\) of the items would be assigned to a cluster. The limit between blue and green items would be straightforward. In our example, we have 100% accuracy for \(80\%\) of the items left.

Proposal: We could define a curve \((\text{proportion}, \text{accuracy})\), by looking at the co-clustering frequency.

\[prop(\alpha) = \frac{1}{n} \sum_{i=1}^n \arg\max_{j \neq i} \left\{ \begin{array}{ll} 1 & \mbox{ if } A_{i,j} \geq \alpha \\ 0 & \mbox{ otherwise} \end{array} \right.\]

where \(A_{i,j}\) is the co-clustering of item \(i\) with \(j\).


This concept is used by stability-based ensemble clustering approaches: There are items that always goes together, and we are sure they always form a cluster. (Hierarchical bottom-up clustering always found these clusters first).

Points in between the two groups can be assigned to the closest based on their co-clustering frequency. Or they can be left alone, and we could say we have an incomplete partitioning where we have \(100 \%\) confidence items goes together.

However, this example is an easy one. Sometimes, you would get incoherent clustering.

In this case, the best thing to do would be to discard the last clustering, which is too different from the other. The problem is that a clustering can be locally incoherent, i.e., there are items that are badly separated, but other which are normal. When discarding clustering, we could arrive in a situation where all base clustering are discarded.

There are at least three options:

Measuring entropy changes

When assigning an item to a cluster, this is represented by a binary vector. For \(k\) clusters, there are \(k\) slots in the binary vector, where one is set to \(1\) while the others are left to \(0\). By combining multiple clustering together, we get more \(1\). The variety of binary vector increase. We can measure the diversity with entropy.

If we get a large change when adding clustering outputs, then we have high diversity improvement.

Rather than adding all clusters together, we can add a single cluster at a time, decomposing a clustering into its subpart. Then, we can weight each cluster, and say “this one is completely new”.

Clusters VS Boundaries

When performing ensemble clustering, we tend to look at group of coherent points, in a way “one against the rest”. We check the unity of a group of items.

There is an alternative approach, which is equivalent, but looking at the problem from a different point of view. Instead, we look at boundaries.


What Should be Considered ?

NMI is wrong. For Consensus evaluation, this is the average of a 1:1 metric.

There are several things to consider, and we need to look at general clustering problems:

  • What should be the number of clusters ? Should it be learned, or should it satisfy some rule of thumb or desire from the user ?
  • Clustering generally assume “hard complete clusters”.
    • What about soft clusters, where one item can belong to several clusters at the same time, with a relative strength ?
    • What about incomplete clustering, where some items are not assigned to any clusters ?
    • What about multi-view ? Where instead of one solution, you offer several orthogonal solutions ?
  • Can we remove “bad clusters” from the initial set ?
  • Can we evaluate a consensus based on the removal necessary to reach its state ?

Another way to evaluate consensus is using a supervised manner:

  • You build $b_0$ base clustering for building consensus
  • You buils $b_1$ base clustering for evaluation

If your algorithm is still okay on unseen data, then you are good.


Appendices

Definition

Hard cluster
cluster where items belong at 100 % to it
Soft cluster
cluster where items may not belong at 100% to it, and therefore an item can belong to multiple clusters at the same time
Cluster
group of items
Partitioning
Separation of items into hard clusters
Also called a clustering
Hierarchical clustering
Clustering algorithm performing \(n-1\) steps building the dependencies between items as a rooted tree
Agglomerative clustering
Kind of hierarchical clustering, where the initial step is made of \(n\) clusters with one item each and the final step has one single cluster with the \(n\) items. At each step, the closest clusters are merged together.

Bibliography



>> You can subscribe to my mailing list here for a monthly update. <<