Connecting Graphs to Machine Learning
The thesis was defend the 10th of March 2022.
This thesis is the compilation of all publications that have been published during the thesis, plus some extra results and visuals.
See the “Papers” section for more details.
Pending: Patent about biometry and ID databases separation for identification.
Procédé de réalisation d’une transaction, dispositifs et programmes correspondants. OEB
Procédé d’échange sécurisé de données. OEB
One problem when performing co-clustering of large matrices of user \(\times\) item or document \(\times\) word is the sparsity. Often, two vectors overlap thanks to one feature, but nothing more. This makes difficult to find relationships between items. To circumvent this issue, we need to look at high-order relationships, i.e., checking the overlap with the neighbors is not enough, we need to find how the features are related to each other. To achieve this target, we propose to project items in a low dimensional space. In this space, we can measure distance, enabling to compute a form of similarity, even if two records are not directly related. With this similarity between items in memory, we can now compute the similarity between features. We repeat the same process: we project features in the low dimensional space, compute features simlarity, and move on to items. This process is repeated until no major changes are observed. When natural clusters exist, they become visible in the embedding and can be clustered and analyzed.
When working with documents, we may want to cluster documents and at the same time to cluster their topics. When the description is very short, like in an image, it is difficult to get an accurate representation of this document. In this paper is described two things: first, a way to enhance document vectors with limited context, and second a bottom-up hierarchical algorithm. The latter has the property of searching for clusters of equivalent size, preventing from getting singleton clusters with difficult-to-classify items and a large big cluster.
\(t\)-SNE is a widely used embedding algorithm, enabling to get nice visualization with salient clusters. The main drawback of this algorithm is that it is non-parametric: two trials lead to two different results. This property make it unsuitable to visualize items evolving over time. This paper describe an approach to visualize “slice” of data, one after another, using the previous embedding as a reference for the latter.
In this paper, I develop an approach to compare documents using their bibliography. Traditional methods use first-level bibliography and/or use metadata and content. Here, we describe an approach focusing on bibliography and exploiting indirect links.
Usually, ensemble clustering “collects” the clustering results of different clustering instance to get a “strong” consensus clustering. This paper propose the use of EAC to distributed multi-class classification. In a decentralized system, one would like exchange its results with its pairs and get some info in returns. However, because it requires some works and knowledge (data and working power to run the algorithm), people may not want to share their results directly for free. One way to “hide” the results is to exploit EAC: instead of sending raw clustering output (i.e., the predicted category), one send the binary co-association matrix. Unless there are two classes, someone cannot recover the initial labels unless it has some initial information. With several CA matrices from its pairs, someone can improve its clustering results. The algorithm is highly noise tolerant, i.e., if the pairs corrupt partially their data, the algorithm can still improve the accuracy.
The secret sharing technic develop by Shamir enables splitting a secret into several parts, where we need at least \(k\) parts to recover the secret. This paper is about optimal secret distribution and recovery strategies.