<< Go back to Posts
Warning : This document is still a draft

Distance, Divergence, and Similarity - A Cheat Sheet

When you read around thousand of papers a year, you become aware that there are many ways to measure distance or similarity between stuff. They do not apply to the same object, the same data-structure, so this variety is necessary.

Over time, I have seen many many ways to measure similarity or distance, and I tried to keep an inventory.

Here is a “disorganized” list. Some formulas have explanations, some do not. This is a work in progress.

Introduction

Vocabulary

Before starting, it might be useful to state the common vocabulary (in case of).

Data Type

The variety of function can be explained by the variety of data type and data structures. We have:

continuous values
discrete values
categorical values
weight vectors (probability distrib of categorical features over an array)

Distance

Bhattacharyya distance

\[BC(x, y) = - \ln \sum_{i=1}^n \sqrt{x_i y_i}\]

Canberra

\[CanD(x, y) = \sum_{i=1}^n \frac{|x_i - y_i|}{|x_i|+|y_i|}\]

Clark distance / Coefficient of divergence

\[ClaD(x, y) = \sqrt{\sum_{i=1}^n \left(\frac{x_i - y_i}{|x_i| + |y_i|}\right)^2}\]

Chebyshev

For a $\mathbb{R}^n$ space, max 1D-projection distance

\[max(A,B) = max \{|a_1-b_i|, ..., |a_n-b_n|\}\]

Chord distance

\[ChoD(x, y) = \sqrt{2 - 2\frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2 \sum_{i=1}^n y_i^2}}\]

Hamming

Distance in bits between two strings.

If they don’t have the same size, the extra bits of the longest can be considered as $1$.

\[d(x_1, x_2) = \sum XOR(x_1, x_2)\]

XOR table :

	0	1
0	0	1
1	1	0

Hausdorf distance

Maximal distance between two set in a metric space.

$d_H(X,Y) = \max{\sup_{x \in X}\inf_{y \in Y} d(x,y), \sup_{y \in Y}\inf_{x \in X} d(x,y) }$

Haversine

Distance of two points on a sphere.

First :

\[hav(\theta) = \sin^2 \left(\frac{\theta}{2}\right) = \frac{1 - \cos(\theta)}{2}\] \[hav^{-1} (h) = 2 \arcsin(\sqrt(h))\]

$d$ : greatest distance (unknown)
$r$ : radius of the sphere
$\phi$ : latitude
$\lambda$: longitude

\[hav \left(\frac{d}{r}\right) = hav (\phi_2 - \phi_1) + \cos(\phi_1) \cos(\phi_2) hav(\lambda_2 - \lambda_1)\]

Kulczynski Distance

\[KD(x, y) = \frac{\sum_{i=1}^n |x_i - y_i|}{\sum_{i=1}^n \min(x_i, y_i)}\]

EDIT distance

Levenshtein

Distance in strings counting the number of characters between two words. Number of character to add/remove/change between two strings.

Exemple, with the code :

I : insert
- exchange
R remove

Distance(Alice, Claie) = 3

a l i c e
c l a i e
- l * * e

In order to calculate the distance, there are three common algorithms :

The Levenshtein algorithm, for short length strings
The Jaccard algorithm
TF-IDF algorithm

Levenshtein algorithm

TODO

Complexity

Use a matrice of size $(n+1) \times (m+1)$

LogLoss

\[\frac{1}{n} \sum_{i=1}^n f(i) \times \log(p(i))\]

Here Calibrated

Manhattan

\[d(x_1, x_2) = \sum^n_i |a_{i, 1} - a_{i,2}|\]

Mahalanobis

For a set of $n$-dimensional variables $X = { \mathbf{x}1, …, \mathbf{x}_m }$ with $\mathbf{x}_i = (x{i1}, …, x_{in})^T$

$\mu = (\mu_1, …, \mu_n)^T$, with the covariance matrix $\Sigma$

$D_M(\mathbf{x})= \sqrt{(\mathbf(x)-\mu)^T \Sigma^{-1} (\mathbf(x)-\mu) }$

Matusita Distance

\[MatD(x, y) = \sqrt{\sum_{i=1}^n (\sqrt{x_i} - \sqrt{y_i})^2}\]

MSE: Mean Square Error

Distance for regression purposes.

\[MSE = \frac{1}{N} \sum_{i=1}^N (f(x_i) - y_i)^2\]

$f(x_i)$: predicted value
$y_i$: true value

MCD: Mean Character Distance

\[MCD(x, y) = \frac{\sum_{i=1}^n |x_i - y_i|}{n}\]

MDEF: Multi Granularity Deviation Factor

Source

\[MDEF(p_i, r, \alpha) = \frac{\hat{n}(p_i, r, \alpha) - n(p_i, \alpha r)}{\hat{n}(p_i, r, \alpha)}\]

With:

$n(p_i, r)$ The number of neighbors of $p_i$ with distance less than $r$
$\hat{n}(p_i, r, \alpha) = \frac{\sum_{p \in \mathcal{N}(p_i, r)} n(p, \alpha r)}{n(p_i, r)}$ Local neighborhood integral. Average density of the neighborhood.

\[\sigma_{MDEF}(p_i, \alpha, r) = \frac{\sigma_{\hat{n}}(p_i, r, \alpha)}{\hat{n}(p_i, r, \alpha)}\]

With:

$\sigma_{\hat{n}}(p_i, r, \alpha)$ the standard deviation of the neighborhood numbers

NID Non Interacting Distance

\[NID(x, y) = \frac{1}{2} \sum_{i=1}^n |x_i - y_i|\]

Perplexity

Measure how well a generative model learning on string is able to reproduce a like-text.

$M$ : model
$W_t$: Context

$PPL(M) = \exp \left(-\frac{1}{N} \sum_{k=1}^N \log[P_M(w_k

W_{k-1})]\right)$

Squared Chord Distance

\[SCD(x, y) = \sum_{i=1}^n (\sqrt{x_i} - \sqrt{y_i})^2\]

Soergel Distance SoD

\[SoD(x, y) = \frac{\sum_{i=1}^n |x_i - y_i|}{\sum_{i=1}^n \max(x_i, y_i)}\]

Sorensen distance (SD) / Bray-Curtis

\[SD(x, y) = \frac{\sum_{i=1}^n |x_i - y_i|}{\sum_{i=1}^n x_i + y_i}\]

Spherical distance / Great circle / orthorombic distance

Distance on the sphere

$\Phi, \lambda$: latitude, longitude

$\Delta \sigma$: angle at the middle of the sphere between two points

$\Delta \sigma = 2 arccos(\sin\Phi_1 \cdot \sin\Phi_2 + \cos\Phi_1 \cdot \cos\Phi_2 \cdot \cos\Delta\lambda )$

$d= r \Delta \sigma$

Divergence

For $P$ and $Q$ two probability distribution over the same space $\Omega$, with $P$ absolutely continuous with respect to $Q$:

$d_f(P | Q) = \int_{\Omega} f \left( \frac{dP}{dQ}\right) dQ$

absolutely continuous

This is the distance between two probability distribution. So, two random variable sharing the same space / same support

Information geometry of divergence functions

Minka, message passing

Divergence measures for statistical processing

$\alpha$ divergence

$D[\pmb{p}, \pmb{q}] = \frac{4}{1 - \alpha^2} \sum (1 - p_i^{\frac{1- \alpha}{2}} q_i^{\frac{1+ \alpha}{2}})$

$\beta$ Divergence

For $\beta > -1$:

\[D[\pmb{p}, \pmb{q}] = \frac{1}{\beta(\beta + 1)} \sum(p_i^{\beta+1} - q_i^{\beta+1} - (\beta+1)q_i^{\beta}(p_i - q_i))\]

For $\beta = 0$:

\[D[\pmb{p}, \pmb{q}] = \sum p_i \log\frac{p_i}{q_i}\]

For $\beta = -1$:

\[D[\pmb{p}, \pmb{q}] = \sum (\log \frac{q_i}{p_i} + \frac{p_i}{q_i} - 1)\]

Bhattacharyya (distance and divergence)

Divergence:

\[BC(p, q) = \sum_{x \in X} \sqrt{p(x)q(x)}\]

with: $BC \in [0, 1]$

Distance:

\[D_B(p, q) = - \log(BC(p, q))\]

with $D_B \in [0, \infty)$

Note: Do not verify triangular inequality

Brier Score

Accuracy of probabilistic prediction.

\[BS = \frac{1}{N} \sum^N_{t=1} (f_t - o_t)^2\]

Distance from probability to actual class

$o_t$ actual outcome $\in {0, 1 }$ (0 if no event, 1 if event)
$f_t$ probability of the event at time $t$

Wikipedia

Bregman divergence

$U(x)$ is a real valued function, stricly convex. Divergence of $x_1$ compare to $x_0$ is:

$D_U(x_1;x_0) = U(x_1) - U(x_0) - \left\langle \Nabla(x_0), (x_1 - x_0) \right\rangle$

Properties:

Positivity
Separability: if $D_U(x_1; x_0) = 0$ then $x_1 = x_0$
Convexity
Linearity
Duality (legendre transform)

But not:

Symmetry $D_U(x_1; x_0) \neq $D_U(x_0; x_1)$
Triangular inequality $U(x_1) + U(x_2) \geq U(x_1 + x_2)$, so it is not a distance

Fermi dirac

$D[\pmb{p}, \pmb{q}] = \sum (p_i \log \frac{p_i}{q_i} + (1 - p_i)\log \frac{1-p_i}{1-q_i})$

Itakua-Saito

$D[\pmb{p}, \pmb{q}] = \sum (\log \frac{q_i}{p_i} + \frac{p_i}{q_i} - 1)$

Hellinger Distance

$Hel^m(P,Q)= \sum_{i=1}^n \left( p_i^{\frac{1}{m}} - q_i^{\frac{1}{m}}\right)^m$

$H \in [0, 1]$

$\alpha$-divergence of Amari

Source

\[HelD(x, y) = \sqrt{2 \sum_{i=1}^n (\sqrt{x_i} - \sqrt{y_i})^2}\]

Jensen Shannon divergence

The symmetric version of KL divergence.

$D[\pmb{p}, \pmb{q}] = \frac{1}{2} KL(\pmb{p}, \pmb{m}) + \frac{1}{2} KL(\pmb{q}, \pmb{m})$

with $M = \frac{P + Q}{2}$

Another version is parametrized by $\alpha \in [0, 1]$:

$D[\pmb{p}, \pmb{q}, \alpha] = \alpha KL(\pmb{p}, \pmb{m}) + (1-\alpha) KL(\pmb{q}, \pmb{m})$

where $M = \alpha P + (1-\alpha) Q$

Similarity between two probability distribution, based on KL divergence.

Other names:

Information radius (IRad)
Total divergence to the average

$JSD(P | Q) = \frac{1}{2} D(P | M) + \frac{1}{2} D(Q | M)$

where $M = \frac{1}{2}(P + Q)$

Best if $P==Q$, where $JSD = 0$ Worst if $P$ does not overlap at all $Q$, it tend to $_\infty$ It is a way to make symmetric the KL divergence. You penalize

For $n$ distributions:

$P = { P_1, … P_n}$ the probability distributions and $\mathbf{\pi} = { \pi_1, … \pi_n }$ are the weight

$JSD_{pi}(P) = H \left( \sum_i^n \pi_i P_i \right) - \sum_i^n \pi_i H(P_i)$

where $H$ is the Shannon Entropy

Kullback Divergence

The most common divergence.

\[D[\pmb{p}, \pmb{q}] = \sum p_i \log\frac{q_i}{p_i}\]

KL FW or Reverse

Forward KL

Dissimilarity between two probability distribution. Dissimilarity of $Q$ according to $P$ (weighted by $P$, as if $P$ was the standard distribution)

\[D_{KL}(P\|Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}\] \[D_{KL}(P\|Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} dx\]

When $P(i) = 0$, contribution is 0 When $Q(i)$ small and $P(i) != 0$, the contribution is big, and proportional to the ratio between “correct” distribution $P$ and $Q$.

It cost more to have $Q$ smaller than $P$. You “gain”, or reduce the cost if $Q(i) > P(i)$. But this will imply that somewhere you will have an important cost for an other proba.

$D_{KL} (P | Q) \geq 0$
Zero avoiding (try to put $Q \neq 0$ everywhere where $P$ > 0$ )

Reverse-KL

\[D_{KL}(Q\|P) = \sum_i P(i) \log \frac{Q(i)}{P(i)}\]

Can be zero where $P$ is not.
Allow fitting on small portion of $Q$
Zero forcing if $P > 0$ and $Q$ cannot adjust there

f-divergence

Generic formula:

\[D[\pmb{p}, \pmb{q}] = \sum p_i f(\frac{q_i}{p_i})\]

Total variation

$f(u) =

u-1

Square Hellinger distance

$f(u) = (\sqrt{u} -1)^2$

Pearson and Neyman Chi-square

$D [\pmb{p}, \pmb{q}] =\frac{1}{2}\sum \frac{(p_i - q_i)^2}{p_i}$

Similarity

Continuous

Pearson Correlation Coefficient

Correlation based Similarity COR

\[COR = \sqrt{2(1 - \rho)}\]

$\rho$ Pearson correlation coefficient

Discrete / Binary / sets

Jaccard

Overlap of two sets (unweighted)

For two sets $U$ and $V$:

\[S(U, V) = \frac{|U \cap V|}{|U \cup V|}\]

DICE Similarity Coefficient

Similarity between two sets.

Usage: Compare number of similar $n$-grams in two strings

Short formula:

\[DSC = \frac{|U \cap V|}{|U|+|V|}\]

Long formula:

\[DSC(X,Y) = \frac{a}{2a + g + h}\]

With:

$n$: the number of features / dimensionality of $\mathbf{x}$
$a = \sum_{i=1}^n x_i y_i$ (number of time $\mathbf{x}$ and $\mathbf{y}$ equal $1$ at same time)
$b = \sum_{i=1}^n (1- x_i) (1 - y_i)$ Number of time $\mathbf{x}$ and $\mathbf{y}$ are not present at same time
\[h = \sum_{i=1}^n (1- x_i) y_i\]
\[g = \sum_{i=1}^n x_i(1- y_i)\]

Sokal & Sneath Similarity

(using the same terms as in DICE)

\[S(X, Y) = \frac{a}{a + 2(g+h)}\]

Sokal & Michenon Similarity

\[S(X, Y) = \frac{a + b}{n}\]

Kulzinsky Similarity

\[S(X, Y) = \frac{a}{g+h}\]

Russel & Rao Similarity

\[S(X, Y) = \frac{a}{a + b + g+h}\]

Joccard & Needham Similarity

\[S(X, Y) = \frac{a}{n - b}\]

Weight Vectors

Cosine

Overlap between two sets. This can be weighted.

Useful to compare word vectors.

Measurement of the similarity of two vectors of dimension $n$ denoted $\mathbf{x}$ and $\mathbf{y}$.

\[\cos (\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x}^T \mathbf{y}}{\sqrt{\| \mathbf{x} \|^2 \| \mathbf{y}\|^2}} \in [-1,1]\]

where $$|x\mathbf{x}

^2 = \sum_i x_i^2$$ is the square Euclidian norm.

Usage

Word similarity. Cos (Women vs Men) = -1 (dissimlarity) Cos (Stone vs Men) = 0 (no relationship) Cos (Men vs Guy) = 1 (similarity)

Time series

Cross-Correlation Coefficient

Miscs

Cohen’s Kappa

Distance between grade of observator when labeling data

\[\kappa = \frac{Pr(a) - Pr(e)}{1 - Pr(e)}\]

$Pr(e)$ Random consensus $Pr(a)$ Relative consensus between observators

$\kappa = 1$ Total accord $\kappa = 0$ No agreement

Cophenetic correlation

How a dendrogram preserve pairwise distance.

$c = \frac{\sum_{i<j} (x_{ij} - \bar{x})(t(i,j) - \bar{t})}{\sqrt{\sum_{i<j} (x_{ij} - \bar{x})^2 \sum_{i<j}(t(i,j) - \bar{t})^2}}$

with

$x_{ij} = d(i,j)$ in an euclidian space usually
$t(i,j)$ The dendromatic distance

Complexity Invariant Similarity CID

2011, Borista Source

\[CID(x,y) = ED(x, y) CF(x,y)\]

$ED$: Euclidian distance
$CF(x,y) = \frac{\max(CE(x), CE(y))}{\min(CE(x), CE(y))}$
$CE(x) = \sqrt{\sum_{t=1}^{N-1} (x_{t+1} - x_t)^2}$

CORT

Chouakria-Douzal, A. and Nagabhushan P. N. (2007)

Adaptive dissimilarity index for measuring time series proximity

\[CORT(X,Y) = \frac{\sum_t (x_{t+1} - x_t)(y_{t+1} - y_t)}{\sqrt{\sum_t (x_{t+1} - x_t)^2 \sum_t (y_{t+1} - y_t)^2}}\]

Correlation based dissimilarity of time series

$\rho_{i,j}(\tau)$ cross correlation between $x_i$ and $v_j$ with lag $τ$

\[d_{i,j} = \sqrt{\frac{(1 - \rho_{i,j}^2(0))}{\sum_{\tau=1} \rho_{i,j}^2(\tau)}}\]

Similarity: $s_{i,j} = \exp(- d_{i,j})$

Unsorted

Great list here

Cost for MDP

cuted

Discrete MDP

$(S,A,C,\pi)$

According to Bellman equation :

$C(s_i){tot} = C(s_i) + \sum{j \in S} \gamma A(i,j) C(s_j)$

$\gamma in (0,1]$

Campana KEogh distance (CK-1)

Continuous-(time) MDP

$\gamma$ is replaced by a kind of Poisson process $\exp^{-\lambda t} = \gamma(t)$

DCG: Discounted Cumulative Gain

For recommander system, a list of the $k$ best recommandation

\[DCG_k= \sum_{i=1}^k \frac{rel_i}{\log_2(1+i)}\]

CG: Cumulative Gain

\[CG_k = \sum_{i=1}^k rel_i\]

$rel_i$: relevance of the i-est best item, $\in [0, 1]$

nDCG: normalized Discounted Cumultative Gain

\[nDCG = \frac{DCG}{IDCG}\]

$IDCG$ is the ideal discounted gain.

\[IDCG_p = \sum_{i=1}^{|Rel_k|} \frac{rel_i}{\log_2(1+i)}\]

Obtained with the best relevant documents … Tocheck

Instead, simply $\sum_{i=1}^k \frac{1}{\log_2(1 + i)}$ (best you can have)

Earth Mover, Wasserstein Distance

Distance between two differents vector, in order to transform one to the other. It can also be between two distributions. It is in relationship with the minimal mass transport to do to transform one distribution to the other

Euclidian

On a $\mathbb{R}^n$ space :

With $\mathbf{x}j = (a{1,j}, …, a_{n,j})$

\[d(\mathbf{x}_1, \mathbf{x}_2) = \sqrt{\sum^n_i (a_{i, 1} - a_{i,2})^2}\]

EPE: Expected Predicted Error

\[EPE = E \{Y - \hat{f}(x) \}^2 = E \{Y - f(x) \}^2 + \{E(\hat{f}(x)) - f(x)\}^2 + E \{\hat{f}(x) -E(\hat{f}(x)) \}^2\] \[EPE = Var(Y) + Bias^2 + Var(\hat{f}(x))\]

Error of the measure (even if the model is right)
Bias : Mispefying the model
Error using a sample (lack of continuity and covering the space)

Geodesic

A geodesic is a curve which follow the minimal distance path.

$\gamma : I \to M$ where $I$ is the unit interval space and $M$ is the metric space.

$\gamma$ is a geodesic if there exist $v \geq 0 \forall t \in I \exists J

\forall (t_1, t_2) \in J$

$d(\gamma(t_1), \gamma(t_2)) = v

t_1 - t_2

Giny Impurity

Distance from predicted classification to effective classification, for a set of items $J$ items, which classification value (before rounding) is $f_i$. If $f_i$ is close to $0$ or $1$, the error related to item $i$ is minimal. If $f_i$ is close to $0.5$, ie randomly choosen, the error is maximal

\[I_G(f) = \sum_{i=1}^J f_i(1 - f_i)\]

Used in decision trees

Goodman and Krustal’s gamma

Rank correlation.

$N_s$: Number of pairs ranked in the same order $N_d$: Number of pairs ranked in the reverse order

$G = \frac{N_s - N_d}{N_s + N_d}$

Grubb’s test

$G = \frac{max Y_i - \bar{Y} }{s}$
$G = \frac{Y_{max} - \bar{Y}}{s}$
$G = \frac{\bar{Y} - Y_{min}}{s}$

Test:

\[G > \frac{N-1}{\sqrt{N}} \sqrt{\frac{t_{\alpha/2N, N-2}^2}{N-2+ t^2}}\]

Incremental dissimilarity measure

Distance between time series.

$corr(a, b) = \frac{P - \frac{AB}{n}}{\sqrt{A_2 - \frac{A^2}{n}} \sqrt{B_2 - \frac{B^2}{n}}}$

With:

$A = \sum a_i$
$A^2 = \sum a_i^2$
$P = \sum a_i b_i$

Jordan Center :

For a set of points $U$,

$\hat{u}= JC(U) = arg min_{u_s \in U} max_{u_i \in U} Dist(u_s, u_i)$

It is the point which is the most central, ie., the maximal distance to the other point is the lowest.

Question :

How to find it efficiently ?

is this center nearby the mean of the cluster ?? ==> No, as if too much point are concentrated elsewhere than in the middle, it does not work.

Find the maximal distance between all points, and find the point nearby the middle of this biggest axis

Kolmogorov complexity

Length of the shortest computer programm that is able to generate the “string” / “byte string” from scratch in a given language.

LOF: Local Outlier Factor

Usage: for knn, as a metric of outlier.

$k-distance(A)$: distance from A to the $k$iest nearest neighbour.

$reachability_k(A,B) = \max \left{ k-dist(B), d(A,B)\right}$

If A is not in the $k$ first neighbours of $B$, then take the $AB$ distance. Reachability is at min, the distance between $A$ and $B$, at most, the radius of the $k$ neighbours.

$lrd(A) = \left( \frac{\sum_{B \in N_k(A)} reachability_k(A,B)}{

N_k(A)

} \right)^{-1}$

Extension

wiki

Locally Linear Embedding

For $n$ vectors of dim $D$, $\mathbf{x} =(x_1, …, x_D)^T$

Identify $K$ nearest neighbours, using euclidian distance.
Measure the error $\epsilon (\mathbf{w}) = \sum_i \mathbf{x}i - \sum_j \mathbf{w}{ij} \mathbf{x}_j ^2$

Find the weights in order to preserve the mapping like $\Psi(\mathbf{y}) = \sum_i

\mathbf{y}i - \sum_j \mathbf{w}{ij} \mathbf{y}_j

^2$

Source

MMD Maximum Mean Discrepency

Measure difference between two distribution

Renyi Information Measures

Divergence

$D_{\alpha}(P|Q)= \frac{1}{\alpha -1} \log \sum_{i=1}^n p_i^{\alpha} q_i^{1 - \alpha}$

Entropy

$H_{\alpha}(P)= \frac{1}{\alpha -1} \log \sum_{i=1}^n p_i^{\alpha}$

Relationship

if $U$ is uniform $D_{\alpha}(P|Q) = H_{\alpha}(P) - H_{\alpha}(Q)$

STS Short Time Series distance

$d_{STS} = \sqrt{\sum_{k=1}^P \left( \frac{v_{k+1} - v_k}{t_{k+1} - t_k} - \frac{u_{k+1} - u_k}{t_{k+1} - t_k} \right)^2}$

Need to normalize distance first to have consistent distance

Derivative square differences.

Tanimoto Indice

$T(A,B) = \frac{A \dot B}{| A |^2 + |B |^2 - A \dot B}$

Total Variation (TV)

$\delta(\mathbf{P}r | \mathbf{P}_y) = sup{A \in \Sigma}

\mathbf{P}_r(A) - \mathbf{P}_g(A)

Neighbourhood

Von Newman

1D : 2 adjacent pixels

X A Y

2D : 4 adjacent pixels

	X
X	A	X
	X

3D : 6 Adjacents pixels

And for voxels : 2D : Hexagonal, so 6 neighbours. Same distance than Moore in that case

Moore

2D : Manhattant distance = 1

NW	N	NE
w	A	E
SW	S	SE

Normalized compression distance

Paturi : clustering malware

$p$, the shortest program to compute $x$ from $y$ or $y$ from $x$.

\[|p| = max\{K(x|y),K(y|x)\}\]

$K(x	y)$: Algorithmic information of $x$ given $y$

To express similarity, normalize by average, and obtain the Normalized Information Distance:

\[NID(x,y) = \frac{max\{K(x|y),K(y|x)\}}{max\{K(x),K(y)\}}\]

As the distance is not always computable, given a compressor $Z$. $Z(x)$ is the bitlength of x compressed using a specific algorithm, such as gzip, PPMZ or others.

\[NCD_Z (x,y) = \frac{Z(xy) - min\{Z(x),Z(y)\}}{max\{Z(x),Z(y)\}}\]

NCD is robust to noise, and free of parameter (just the compressor $Z$).

Normalized Google distance

Give the semantic similarity of two words $x$ and $y$:

\[NGD (x,y) = \frac{max\{log(f(x)), log(f(y))} - log(f(x,y))\}{log N - min\{log f(x), log f(y)\}}\]

Use the page hit count $f(x)$ : number of page containing x $f(x,y)$: number of page containing x AND y $N$ : number of pages indexed

Source

Intro image from Image de stephanie2212 on Freepik

>> You can subscribe to my mailing list here for a monthly update. <<