Thesis

The thesis was defend the 10th of March 2022.

This thesis is the compilation of all publications that have been published during the thesis, plus some extra results and visuals.

See the “Papers” section for more details.

Patents

Pending: Patent about biometry and ID databases separation for identification.

2023

Procédé de réalisation d’une transaction, dispositifs et programmes correspondants. OEB

2022

Procédé d’échange sécurisé de données. OEB

Papers

Link to DBLP

2023

The theoretical limits of biometry ArXiv

Biometry has proved its capability in terms of recognition accuracy. Now, it is widely used for automated border control with the biometric passport, to unlock a smartphone or a computer with a fingerprint or a face recognition algorithm. While identity verification is widely democratized, pure identification with no additional clues is still a work in progress. The identification difficulty depends on the population size, as the larger the group is, the larger the confusion risk. For collision prevention, biometric traits must be sufficiently distinguishable to scale to considerable groups, and algorithms should be able to capture their differences accurately. Most biometric works are purely experimental, and it is impossible to extrapolate the results to a smaller or a larger group. In this work, we propose a theoretical analysis of the distinguishability problem, which governs the error rates of biometric systems. We demonstrate simple relationships between the population size and the number of independent bits necessary to prevent collision in the presence of noise. This work provides the lowest lower bound for memory requirements. The results are very encouraging, as the biometry of the whole Earth population can fit in a regular disk, leaving some space for noise and redundancy.

Linear time Evidence Accumulation Clustering with KMeans ArXiv

2021

Co-embedding: Discovering Communities on Bipartite Graphs Through Projection Springer / ArXiv

One problem when performing co-clustering of large matrices of user \(\times\) item or document \(\times\) word is the sparsity. Often, two vectors overlap thanks to one feature, but nothing more. This makes difficult to find relationships between items. To circumvent this issue, we need to look at high-order relationships, i.e., checking the overlap with the neighbors is not enough, we need to find how the features are related to each other. To achieve this target, we propose to project items in a low dimensional space. In this space, we can measure distance, enabling to compute a form of similarity, even if two records are not directly related. With this similarity between items in memory, we can now compute the similarity between features. We repeat the same process: we project features in the low dimensional space, compute features simlarity, and move on to items. This process is repeated until no major changes are observed. When natural clusters exist, they become visible in the embedding and can be clustered and analyzed.

Tagged Documents Co-Clustering ArXiv

When working with documents, we may want to cluster documents and at the same time to cluster their topics. When the description is very short, like in an image, it is difficult to get an accurate representation of this document. In this paper is described two things: first, a way to enhance document vectors with limited context, and second a bottom-up hierarchical algorithm. The latter has the property of searching for clusters of equivalent size, preventing from getting singleton clusters with difficult-to-classify items and a large big cluster.

Index t-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings. Waset / ArXiv

\(t\)-SNE is a widely used embedding algorithm, enabling to get nice visualization with salient clusters. The main drawback of this algorithm is that it is non-parametric: two trials lead to two different results. This property make it unsuitable to visualize items evolving over time. This paper describe an approach to visualize “slice” of data, one after another, using the previous embedding as a reference for the latter.

Generating Local Maps of Science using Deep Bibliographic Coupling ArXiv

In this paper, I develop an approach to compare documents using their bibliography. Traditional methods use first-level bibliography and/or use metadata and content. Here, we describe an approach focusing on bibliography and exploiting indirect links.

Noise-Resilient Ensemble Learning using Evidence Accumulation Clustering ArXiv

Usually, ensemble clustering “collects” the clustering results of different clustering instance to get a “strong” consensus clustering. This paper propose the use of EAC to distributed multi-class classification. In a decentralized system, one would like exchange its results with its pairs and get some info in returns. However, because it requires some works and knowledge (data and working power to run the algorithm), people may not want to share their results directly for free. One way to “hide” the results is to exploit EAC: instead of sending raw clustering output (i.e., the predicted category), one send the binary co-association matrix. Unless there are two classes, someone cannot recover the initial labels unless it has some initial information. With several CA matrices from its pairs, someone can improve its clustering results. The algorithm is highly noise tolerant, i.e., if the pairs corrupt partially their data, the algorithm can still improve the accuracy.

How to compartiment secrets ePrint

The secret sharing technic develop by Shamir enables splitting a secret into several parts, where we need at least \(k\) parts to recover the secret. This paper is about optimal secret distribution and recovery strategies.