Over time, I have seen many many ways to measure similarity or distance, and I tried to keep an inventory.
Here is a “disorganized” list. Some formulas have explanations, some do not. This is a work in progress.
Before starting, it might be useful to state the common vocabulary (in case of).
The variety of function can be explained by the variety of data type and data structures. We have:
For a $\mathbb{R}^n$ space, max 1D-projection distance
\[max(A,B) = max \{|a_1-b_i|, ..., |a_n-b_n|\}\]Distance in bits between two strings.
If they don’t have the same size, the extra bits of the longest can be considered as $1$.
\[d(x_1, x_2) = \sum XOR(x_1, x_2)\]XOR table :
0 | 1 | |
---|---|---|
0 | 0 | 1 |
1 | 1 | 0 |
Maximal distance between two set in a metric space.
$d_H(X,Y) = \max{\sup_{x \in X}\inf_{y \in Y} d(x,y), \sup_{y \in Y}\inf_{x \in X} d(x,y) }$
Distance of two points on a sphere.
First :
\[hav(\theta) = \sin^2 \left(\frac{\theta}{2}\right) = \frac{1 - \cos(\theta)}{2}\] \[hav^{-1} (h) = 2 \arcsin(\sqrt(h))\]Next :
Distance in strings counting the number of characters between two words. Number of character to add/remove/change between two strings.
Exemple, with the code :
Distance(Alice, Claie) = 3
In order to calculate the distance, there are three common algorithms :
TODO
Use a matrice of size $(n+1) \times (m+1)$
For a set of $n$-dimensional variables $X = { \mathbf{x}1, …, \mathbf{x}_m }$ with $\mathbf{x}_i = (x{i1}, …, x_{in})^T$
$\mu = (\mu_1, …, \mu_n)^T$, with the covariance matrix $\Sigma$
$D_M(\mathbf{x})= \sqrt{(\mathbf(x)-\mu)^T \Sigma^{-1} (\mathbf(x)-\mu) }$
Distance for regression purposes.
\[MSE = \frac{1}{N} \sum_{i=1}^N (f(x_i) - y_i)^2\]With:
With:
Measure how well a generative model learning on string is able to reproduce a like-text.
$PPL(M) = \exp \left(-\frac{1}{N} \sum_{k=1}^N \log[P_M(w_k | W_{k-1})]\right)$ |
Distance on the sphere
$\Phi, \lambda$: latitude, longitude
$\Delta \sigma$: angle at the middle of the sphere between two points
$\Delta \sigma = 2 arccos(\sin\Phi_1 \cdot \sin\Phi_2 + \cos\Phi_1 \cdot \cos\Phi_2 \cdot \cos\Delta\lambda )$
$d= r \Delta \sigma$
For $P$ and $Q$ two probability distribution over the same space $\Omega$, with $P$ absolutely continuous with respect to $Q$:
$d_f(P | Q) = \int_{\Omega} f \left( \frac{dP}{dQ}\right) dQ$
absolutely continuous
This is the distance between two probability distribution. So, two random variable sharing the same space / same support
Information geometry of divergence functions
Divergence measures for statistical processing
$D[\pmb{p}, \pmb{q}] = \frac{4}{1 - \alpha^2} \sum (1 - p_i^{\frac{1- \alpha}{2}} q_i^{\frac{1+ \alpha}{2}})$
For $\beta > -1$:
\[D[\pmb{p}, \pmb{q}] = \frac{1}{\beta(\beta + 1)} \sum(p_i^{\beta+1} - q_i^{\beta+1} - (\beta+1)q_i^{\beta}(p_i - q_i))\]For $\beta = 0$:
\[D[\pmb{p}, \pmb{q}] = \sum p_i \log\frac{p_i}{q_i}\]For $\beta = -1$:
\[D[\pmb{p}, \pmb{q}] = \sum (\log \frac{q_i}{p_i} + \frac{p_i}{q_i} - 1)\]Divergence:
\[BC(p, q) = \sum_{x \in X} \sqrt{p(x)q(x)}\]with: \(BC \in [0, 1]\)
Distance:
\[D_B(p, q) = - \log(BC(p, q))\]with $D_B \in [0, \infty)$
Note: Do not verify triangular inequality
Accuracy of probabilistic prediction.
\[BS = \frac{1}{N} \sum^N_{t=1} (f_t - o_t)^2\]Distance from probability to actual class
$U(x)$ is a real valued function, stricly convex. Divergence of $x_1$ compare to $x_0$ is:
$D_U(x_1;x_0) = U(x_1) - U(x_0) - \left\langle \Nabla(x_0), (x_1 - x_0) \right\rangle$
Properties:
But not:
$D[\pmb{p}, \pmb{q}] = \sum (p_i \log \frac{p_i}{q_i} + (1 - p_i)\log \frac{1-p_i}{1-q_i})$
$D[\pmb{p}, \pmb{q}] = \sum (\log \frac{q_i}{p_i} + \frac{p_i}{q_i} - 1)$
$Hel^m(P,Q)= \sum_{i=1}^n \left( p_i^{\frac{1}{m}} - q_i^{\frac{1}{m}}\right)^m$
$H \in [0, 1]$
$\alpha$-divergence of Amari
\[HelD(x, y) = \sqrt{2 \sum_{i=1}^n (\sqrt{x_i} - \sqrt{y_i})^2}\]The symmetric version of KL divergence.
$D[\pmb{p}, \pmb{q}] = \frac{1}{2} KL(\pmb{p}, \pmb{m}) + \frac{1}{2} KL(\pmb{q}, \pmb{m})$
with \(M = \frac{P + Q}{2}\)
Another version is parametrized by \(\alpha \in [0, 1]\):
$D[\pmb{p}, \pmb{q}, \alpha] = \alpha KL(\pmb{p}, \pmb{m}) + (1-\alpha) KL(\pmb{q}, \pmb{m})$
where \(M = \alpha P + (1-\alpha) Q\)
V2
Similarity between two probability distribution, based on KL divergence.
Other names:
$JSD(P | Q) = \frac{1}{2} D(P | M) + \frac{1}{2} D(Q | M)$
where $M = \frac{1}{2}(P + Q)$
Best if $P==Q$, where $JSD = 0$ Worst if $P$ does not overlap at all $Q$, it tend to $_\infty$ It is a way to make symmetric the KL divergence. You penalize
For $n$ distributions:
$P = { P_1, … P_n}$ the probability distributions and $\mathbf{\pi} = { \pi_1, … \pi_n }$ are the weight
$JSD_{pi}(P) = H \left( \sum_i^n \pi_i P_i \right) - \sum_i^n \pi_i H(P_i)$
where $H$ is the Shannon Entropy
The most common divergence.
\[D[\pmb{p}, \pmb{q}] = \sum p_i \log\frac{q_i}{p_i}\]Dissimilarity between two probability distribution. Dissimilarity of $Q$ according to $P$ (weighted by $P$, as if $P$ was the standard distribution)
\[D_{KL}(P\|Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}\] \[D_{KL}(P\|Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} dx\]When $P(i) = 0$, contribution is 0 When $Q(i)$ small and $P(i) != 0$, the contribution is big, and proportional to the ratio between “correct” distribution $P$ and $Q$.
It cost more to have $Q$ smaller than $P$. You “gain”, or reduce the cost if $Q(i) > P(i)$. But this will imply that somewhere you will have an important cost for an other proba.
Generic formula:
\[D[\pmb{p}, \pmb{q}] = \sum p_i f(\frac{q_i}{p_i})\]$f(u) = | u-1 | $ |
$f(u) = (\sqrt{u} -1)^2$
$D [\pmb{p}, \pmb{q}] =\frac{1}{2}\sum \frac{(p_i - q_i)^2}{p_i}$
Overlap of two sets (unweighted)
For two sets \(U\) and \(V\):
\[S(U, V) = \frac{|U \cap V|}{|U \cup V|}\]Similarity between two sets.
Usage: Compare number of similar \(n\)-grams in two strings
Short formula:
\[DSC = \frac{|U \cap V|}{|U|+|V|}\]Long formula:
\[DSC(X,Y) = \frac{a}{2a + g + h}\]With:
(using the same terms as in DICE)
\[S(X, Y) = \frac{a}{a + 2(g+h)}\]Overlap between two sets. This can be weighted.
Useful to compare word vectors.
Measurement of the similarity of two vectors of dimension $n$ denoted \(\mathbf{x}\) and \(\mathbf{y}\).
\[\cos (\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x}^T \mathbf{y}}{\sqrt{\| \mathbf{x} \|^2 \| \mathbf{y}\|^2}} \in [-1,1]\]where $$|x\mathbf{x} | ^2 = \sum_i x_i^2$$ is the square Euclidian norm. |
Word similarity. Cos (Women vs Men) = -1 (dissimlarity) Cos (Stone vs Men) = 0 (no relationship) Cos (Men vs Guy) = 1 (similarity)
Distance between grade of observator when labeling data
\[\kappa = \frac{Pr(a) - Pr(e)}{1 - Pr(e)}\]$Pr(e)$ Random consensus $Pr(a)$ Relative consensus between observators
$\kappa = 1$ Total accord $\kappa = 0$ No agreement
How a dendrogram preserve pairwise distance.
$c = \frac{\sum_{i<j} (x_{ij} - \bar{x})(t(i,j) - \bar{t})}{\sqrt{\sum_{i<j} (x_{ij} - \bar{x})^2 \sum_{i<j}(t(i,j) - \bar{t})^2}}$
with
2011, Borista Source
\[CID(x,y) = ED(x, y) CF(x,y)\]Chouakria-Douzal, A. and Nagabhushan P. N. (2007)
Adaptive dissimilarity index for measuring time series proximity
\[CORT(X,Y) = \frac{\sum_t (x_{t+1} - x_t)(y_{t+1} - y_t)}{\sqrt{\sum_t (x_{t+1} - x_t)^2 \sum_t (y_{t+1} - y_t)^2}}\]$\rho_{i,j}(\tau)$ cross correlation between $x_i$ and $v_j$ with lag $τ$
\[d_{i,j} = \sqrt{\frac{(1 - \rho_{i,j}^2(0))}{\sum_{\tau=1} \rho_{i,j}^2(\tau)}}\]Similarity: $s_{i,j} = \exp(- d_{i,j})$
cuted
$(S,A,C,\pi)$
According to Bellman equation :
$C(s_i){tot} = C(s_i) + \sum{j \in S} \gamma A(i,j) C(s_j)$
$\gamma in (0,1]$
$\gamma$ is replaced by a kind of Poisson process $\exp^{-\lambda t} = \gamma(t)$
For recommander system, a list of the $k$ best recommandation
\[DCG_k= \sum_{i=1}^k \frac{rel_i}{\log_2(1+i)}\]$rel_i$: relevance of the i-est best item, $\in [0, 1]$
$IDCG$ is the ideal discounted gain.
\[IDCG_p = \sum_{i=1}^{|Rel_k|} \frac{rel_i}{\log_2(1+i)}\]Obtained with the best relevant documents … Tocheck
Instead, simply $\sum_{i=1}^k \frac{1}{\log_2(1 + i)}$ (best you can have)
Distance between two differents vector, in order to transform one to the other. It can also be between two distributions. It is in relationship with the minimal mass transport to do to transform one distribution to the other
On a $\mathbb{R}^n$ space :
With $\mathbf{x}j = (a{1,j}, …, a_{n,j})$
\[d(\mathbf{x}_1, \mathbf{x}_2) = \sqrt{\sum^n_i (a_{i, 1} - a_{i,2})^2}\]A geodesic is a curve which follow the minimal distance path.
$\gamma : I \to M$ where $I$ is the unit interval space and $M$ is the metric space.
$\gamma$ is a geodesic if there exist $v \geq 0 \forall t \in I \exists J | \forall (t_1, t_2) \in J$ |
$d(\gamma(t_1), \gamma(t_2)) = v | t_1 - t_2 | $ |
Distance from predicted classification to effective classification, for a set of items $J$ items, which classification value (before rounding) is $f_i$. If $f_i$ is close to $0$ or $1$, the error related to item $i$ is minimal. If $f_i$ is close to $0.5$, ie randomly choosen, the error is maximal
\[I_G(f) = \sum_{i=1}^J f_i(1 - f_i)\]Used in decision trees
Rank correlation.
$N_s$: Number of pairs ranked in the same order $N_d$: Number of pairs ranked in the reverse order
$G = \frac{N_s - N_d}{N_s + N_d}$
$G = \frac{max | Y_i - \bar{Y} | }{s}$ |
Test:
\[G > \frac{N-1}{\sqrt{N}} \sqrt{\frac{t_{\alpha/2N, N-2}^2}{N-2+ t^2}}\]Distance between time series.
$corr(a, b) = \frac{P - \frac{AB}{n}}{\sqrt{A_2 - \frac{A^2}{n}} \sqrt{B_2 - \frac{B^2}{n}}}$
With:
For a set of points $U$,
$\hat{u}= JC(U) = arg min_{u_s \in U} max_{u_i \in U} Dist(u_s, u_i)$
It is the point which is the most central, ie., the maximal distance to the other point is the lowest.
Question :
How to find it efficiently ?
Find the maximal distance between all points, and find the point nearby the middle of this biggest axis
Length of the shortest computer programm that is able to generate the “string” / “byte string” from scratch in a given language.
Usage: for knn, as a metric of outlier.
$k-distance(A)$: distance from A to the $k$iest nearest neighbour.
$reachability_k(A,B) = \max \left{ k-dist(B), d(A,B)\right}$
If A is not in the $k$ first neighbours of $B$, then take the $AB$ distance. Reachability is at min, the distance between $A$ and $B$, at most, the radius of the $k$ neighbours.
$lrd(A) = \left( \frac{\sum_{B \in N_k(A)} reachability_k(A,B)}{ | N_k(A) | } \right)^{-1}$ |
For $n$ vectors of dim $D$, $\mathbf{x} =(x_1, …, x_D)^T$
Measure the error $\epsilon (\mathbf{w}) = \sum_i | \mathbf{x}i - \sum_j \mathbf{w}{ij} \mathbf{x}_j | ^2$ |
Find the weights in order to preserve the mapping like $\Psi(\mathbf{y}) = \sum_i | \mathbf{y}i - \sum_j \mathbf{w}{ij} \mathbf{y}_j | ^2$ |
Measure difference between two distribution
$D_{\alpha}(P|Q)= \frac{1}{\alpha -1} \log \sum_{i=1}^n p_i^{\alpha} q_i^{1 - \alpha}$
$H_{\alpha}(P)= \frac{1}{\alpha -1} \log \sum_{i=1}^n p_i^{\alpha}$
if $U$ is uniform $D_{\alpha}(P|Q) = H_{\alpha}(P) - H_{\alpha}(Q)$
$d_{STS} = \sqrt{\sum_{k=1}^P \left( \frac{v_{k+1} - v_k}{t_{k+1} - t_k} - \frac{u_{k+1} - u_k}{t_{k+1} - t_k} \right)^2}$
Need to normalize distance first to have consistent distance
Derivative square differences.
$T(A,B) = \frac{A \dot B}{| A |^2 + |B |^2 - A \dot B}$
$\delta(\mathbf{P}r | \mathbf{P}_y) = sup{A \in \Sigma} | \mathbf{P}_r(A) - \mathbf{P}_g(A) | $ |
X A Y
X | ||
---|---|---|
X | A | X |
X |
And for voxels : 2D : Hexagonal, so 6 neighbours. Same distance than Moore in that case
NW | N | NE |
---|---|---|
w | A | E |
SW | S | SE |
Paturi : clustering malware
$p$, the shortest program to compute $x$ from $y$ or $y$ from $x$.
\[|p| = max\{K(x|y),K(y|x)\}\]$K(x | y)$: Algorithmic information of $x$ given $y$ |
To express similarity, normalize by average, and obtain the Normalized Information Distance:
\[NID(x,y) = \frac{max\{K(x|y),K(y|x)\}}{max\{K(x),K(y)\}}\]As the distance is not always computable, given a compressor $Z$. $Z(x)$ is the bitlength of x compressed using a specific algorithm, such as gzip, PPMZ or others.
\[NCD_Z (x,y) = \frac{Z(xy) - min\{Z(x),Z(y)\}}{max\{Z(x),Z(y)\}}\]NCD is robust to noise, and free of parameter (just the compressor $Z$).
Give the semantic similarity of two words $x$ and $y$:
\[NGD (x,y) = \frac{max\{log(f(x)), log(f(y))} - log(f(x,y))\}{log N - min\{log f(x), log f(y)\}}\]Use the page hit count $f(x)$ : number of page containing x $f(x,y)$: number of page containing x AND y $N$ : number of pages indexed
Intro image from Image de stephanie2212 on Freepik
>> You can subscribe to my mailing list here for a monthly update. <<