\(t\)-SNE is a nice embedding algorithm, where it helps to show up existing clusters. There are a few rule of thumb to make it better. I will present them shortly. Then, I will present a method to tackle the problem of low contrast, i.e., when all clusters seems connected to each others.
It is recommended to create an embedding following these two steps:
tSNE
with the two first eigenvectors (or more if your embedding has more dimensions)from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
X1 = PCA().fit_transform(X)[:,:2]
Y = TSNE(init=X1).fit_transform(X)
When using \(t\)-SNE with an array, this is used by default (scikit-learn tSNE Documentation - see “init”).
Y0 = TSNE(perplexity=300).fit_transform(X)
Y1 = TSNE(perplexity=30, init=Y0).fit_transform(X)
Sometimes, clusters do not separate from each others and we get a kind of fully connected component. This can be clustered but with low accuracy.
As you can see, there is a continuum between the different high-density locations.
In some way, this is “the truth”, what you really get from your data. However, \(t\)-SNE is a tool for data analysis / data visualization. The data will not be reused afterwards, therefore modifying the outcome to amplify the clusters is not a big deal.
The idea of the trick is the following: \(t\)-SNE is unable to separate correctly the items, because the distance between them are not sufficiently discriminative.
To improve contrast, we will modify distances. Norammly, \(t\)-SNE works with a matrix composed of:
\[D_{i, j} = d(\mathbf{x}_i, \mathbf{x}_j)\]To enhance the difference, the second embedding (the one with the small perplexity) will search for a new embedding using the modified distance function:
\[D'_{i, j} = d(\mathbf{x}_i, \mathbf{x}_j) \times d^{tSNE}(\mathbf{y}_i, \mathbf{y}_j)\]where \(d^{tSNE}(\mathbf{y}_i, \mathbf{y}_j) = \|\mathbf{y}_i - \mathbf{y}_j\|^2\) is the square Euclidian distance between items measured over the first high-perplexity embedding.
This way, thanks to \(Y\), we have a rough notion of proximity that we can take into account.
# D: Square distance matrix
Y0 = TSNE(perplexity=300, metric="precomputed", init="random").fit_transform(D)
D2 = ((Y0[:, np.newaxis] - Y0)**2).sum(axis=2)
# Enhanced Embedding
Y2 = TSNE(perplexity=30, init=Y0, metric="precomputed").fit_transform(D * D2)
The drawback of this method is its slowness.
Indeed, when giving to TSNE()
an \(n\)-D array, it can use an optimization trick.
Here, because we feed a distance matrix, first there is a need for memory to store this matrix, and next the optimization trick cannot be used.
As a suggestion, it is possible to concatenate the first embedding to the dataset to transform the following way:
# Initial embedding
Y0 = TSNE(perplexity=300).fit_transform(X)
k = 10 # scaling factor
X1 = np.concatenate([X, Y0 * k], axis=1)
Here, k
should vary with the number of dimension in X
and the desired effect.
You can see that the clusters found in the enhanced embedding are coherent with what exists in the primary embedding.
You can see that main dense areas are well recognized. However, it depends for items near the boundary.
Instead of D * D2
, we could modify the strength of D2
.
Here are a few examples where D2
is replaced by D2**x
:
When increasing x
above 1
, the effect are limited
>> You can subscribe to my mailing list here for a monthly update. <<