Deciphering the code of viral-host adaptation through maximum entropy models, in collaboration with Benjamin D Greenbaum, Remi Monasson, and Simona Cocco, available in bioRxiv.
In this work, we investigate virus' genetic changes when they jump from one host to another, a critical topic in the study of emerging pathogens. We found that viruses from various families have unique strategies for tuning their nucleotide usage when they infect the same host, and this insight can be leveraged to predict which host a viral sequence came from, even when the sequence is very different from any sequence known. We also demonstrated how this information can be used to design better RNA vaccines and other nucleic acid-based therapies, and to track the evolution of viruses after an host jump.
Designing molecular RNA switches with Restricted Boltzmann machines, in collaboration with Jorge Fernandez-de-Cossio-Diaz, Pierre Hardouin, Francois-Xavier Lyonnet du Moutier, Bertrand Marchand, Yann Ponty, Bruno Sargueil, Remi Monasson, and Simona Cocco, available in bioRxiv.
Machine learning tools can be used to discover meaningful patterns in biological sequences, such as RNA molecules, that are related to the molecule functions. In this work we show how to use a specific type of machine learning tool, called Restricted Boltzmann Machines, to design RNA molecules that can switch between two different structures, and we show that the designed molecules are able to perform the desired function in vitro.
Repeats Mimic Pathogen-Associated Patterns Across a Vast Evolutionary Landscape, in collaboration with Petr Šulc, Alexander Solovyov, Sajid A Marhon, Siyu Sun, Håvard T Lindholm, Raymond Chen, Amir Hosseini, Hua Jiang, Bao-Han Ly, Parinaz Mehdipour, Omar Abdel-Wahab, Nicolas Vabret, John LaCava, Daniel D De Carvalho Rémi Monasson, Simona Cocco, and Benjamin D Greenbaum, available in bioRxiv.
In this work we analyze the human genome using statistical-physics inspired computational tools to detect regions (which in most cases are in the so-called "non-coding genome") that have viral-like patterns, and that could play a role in interacting with the innate immune system. We also discuss their evolution within and outside the human genome, pointing out cases where possibly immunogenic patterns seems to be conserved during genome evolution or across organisms.
A transfer-learning approach to predict antigen immunogenicity and T-cell receptor specificity, in collaboration with Barbara Bravi, Andrea Di Gioacchino, Jorge Fernandez-de-Cossio-Diaz, Aleksandra M Walczak, Thierry Mora, Simona Cocco, and Rémi Monasson, published in eLife on 2023 (open access).
In this work we show how to use an implementation of transfer learning using Restricted Boltzmann Machines to analyze the changes between two datasets, and we apply this idea to predict antigen immunogenicity and T-cell receptor specificity, two well-known open problems in immunology.
Generative and interpretable machine learning for aptamer design and analysis of in vitro sequence selection, in collaboration with Jonah Procyk, Marco Molari, Jonh S Schreck, Yu Zhou, Yan Liu, Rémi Monasson, Simona Cocco, and Petr Šulc, published in PLoS Computational Biology on 2022 (open access).
Aptamers are small DNA or RNA molecules selected in the lab to interact specifically with some relevant protein. For instance thrombin-binding aptamers have been developed and tested as anti-coagulant agents to be used during surgery. In this work we show how, using machine learning, we are able to better understand the results of the experiments done to obtain new aptamers, and that we can use this added knowledge to design new aptamers with the wanted properties.
sgDI-tector: defective interfering viral genome bioinformatics for detection of coronavirus subgenomic RNAs, in collaboration with Rachel Legendre, Yannis Rahou, Valérie Najburg, Pierre Charneau, Benjamin D Greenbaum, Frédéric Tangy, Sylvie van der Werf, Simona Cocco, and Anastassia V Komarova, published in RNA on 2022 (open access).
SARS-CoV-2, as other coronaviruses, has a complex mechanism to translate its genetic code into proteins, that results in many sub-populations of fragments of viral genome (subgenomic RNAs), which are replicated and translated independently from each other. Is it possible to detect these fragments and to quantify their expression level from the sequencing data coming out of infected cells? In this work, we took a pre-existing bioinformatic tool (DI-tector) and modified it to accomplish this task. We show that the resulting algorithm, sgDI-tector, works as well as other state-of-the-art approaches (and sometimes better), and we made the software publicly available to help other research groups in detecting subgenomic RNAs.
The heterogeneous landscape and early evolution of pathogen-associated CpG dinucleotides in SARS-CoV-2, in collaboration with Petr Šulc, Anastassia V Komarova, Benjamin D Greenbaum, Rémi Monasson, and Simona Cocco, published in Molecular Biology and Evolution on 2021 (open access).
Viruses and hosts evolve together, modifying their genetic code and adapting it to the changing environment. This process takes tens to hundreds of years and it is particularly apparent when a virus jumps from an host to another. A well-known example is the usage of cytosines followed by guanines in the genome: humans have a surprising low number of these short motifs (called CpG motifs) in their genome, and can detect as external any genome with too many CpG motifs. For this reason, if a virus jumps into human hosts, it must have a low enough number of CpG motifs, and it will adapt by loosing more and more such motifs. In this work, we discuss this virus-host interaction for SARS-CoV-2, and exploit the incredibly large public database of viral genomes to show that, in genomic regions with high level of CpG motifs, the adaptation process of the virus to the human host had immediately begun.
Perils of embedding for sampling problems, in collaboration with Jeffrey Marshall and Eleanor G Rieffel, published in Physical Review Research on 2020 (open access).
Quantum annealer are analog quantum computers (that is, computers exploiting directly quantum phenomena to perform computations) that have been proposed as a tool to solve more efficiently hard combinatorial optimization problem. More recently, they have also been proposed as tools to sample from probability distributions, a known very hard problem in its most general setting. The quantum annealers that we have now, however, need a specific preprocessing called minor embedding to deal with arbitrary problems, that can introduce errors as ambiguities in the sample obtained. While many works have focused on how to mitigate these errors when the objective consists in solving a combinatorial optimization problem, here we focus on the case of sampling. We show that the errors introduced through embedding are expected to be very relevant in the case of sampling, especially for large interesting problems. Moreover, we suggest an empirical procedure involving a post-processing through classical (as opposed to quantum) hardware to mitigate these issues.
Large deviation of the free energy in the p-spin glass spherical model, in collaboration with Mauro Pastore and Pietro Rotondo, published in Physical Review Research on 2019 (open access).
In statistical physics of disordered systems the focus is typically on estimating average values of relevant observables of the system for relevant values of the parameters. Here we used the technical tools developed for spin glasses together with the theory of large deviations to describe fluctuations far away from the average values, in the pedagogical case of the spherical p-spin model. In an attempt to better understand the approximations and assumptions usually done in spin glass computations, we also discuss an alternative view of some technical problems suggesting that a replica symmetry breaking scheme is at play.
Selberg integrals in 1D random Euclidean optimization problems, in collaboration with Sergio Caracciolo, Enrico M Malatesta, and Luca G Molinari, published in Journal of Statistical Mechanics: Theory and Experiment on 2019 (preprint available in arXiv).
Selberg integrals are a famous family of integrals introduced by Atle Selberg. In this quite technical paper, we show how to use an extension of these integrals to compute the cost of some euclidean combinatorial optimization problems, such has the assignment, in one spatial dimension, for any number of points.
Average optimal cost for the Euclidean TSP in one dimension, in collaboration with Sergio Caracciolo, Enrico M Malatesta, and Carlo Vanoni, published in Journal of Physics A: Mathematical and Theoretical on 2019 (preprint available in arXiv).
In this work we consider the classic version of the Traveling Salesman Problem (different from the bipartite version that we considered in previous papers), and we manage to describe the geometry of the solution when the points are uniformly chosen at random on a line. Although this version of the problem is in general more difficult to deal to with our methods, we still have been able to obtain the average cost of the solution in the limit of large number of points.
Exact value for the average optimal cost of the bipartite traveling salesman and two-factor problems in two dimensions, in collaboration with Riccardo Capelli, Sergio Caracciolo, and Enrico M Malatesta, published in Physical Review E on 2018 (preprint available in arXiv).
Solving the Traveling Salesman Problem in two dimensions means to be able to tell, for any possible position of cities on a (flat) map, the shortest path traveling to all of them and coming back to the starting point. This is hard, and indeed solving this problem would allow anybody to get a reward of a million dollars from the Clay Mathematics Institute (more info here). In this paper, we consider a variant of this problem (which is at least as hard as the problem itself), and showed that, for large number of cities, we can compute the length of the optimal tour by solving a much simpler problem (and no, this is not enough to win the prize!).
Plastic number and possible optimal solutions for an Euclidean 2-matching in one dimension, in collaboration with Sergio Caracciolo and Enrico M Malatesta, published in Journal of Statistical Mechanics: Theory and Experiment on 2018 (preprint available in arXiv).
In this work we consider a combinatorial optimization problem known as the 2-matching. It consists, given a set of points, in joining them in loops (with more than 2 points) so that the total length (or, better, cost, which is a function of the length) of the loops is the smallest possible. This problem is somehow intermediate between the matching problem (where we have segments connecting points instead of loops) and the traveling salesman problem (where the loop has to be a single one). We showed that the solution, when the points are on a line, is composed by loops involving a little number of points, and depending on the specific points positions it can be any of a large number of solution (the so-called plastic number to the number of points). Even if we could not characterize exactly the solution for general point positions, we have been able to compute the average cost of the solution when the number of points goes to infinity.
Solution for a bipartite Euclidean traveling-salesman problem in one dimension, in collaboration with Sergio Caracciolo, Marco Gherardi, and Enrico M Malatesta, published in Physical Review E on 2018 (preprint available in arXiv).
Euclidean combinatorial optimization problems are everywhere: for instance, they are routinely solved by Google Maps to find shortest routes to go to work, or to a party. Here we considered the problem of finding a cycle passing through a set of points on a line so that a given cost, which depends on the points' distances, is minimized. To do so, we proved some geometrical properties of the optimal solution for arbitrary points' positions, and managed to use these properties to compute the average cost of the solution when points are randomly placed on a line.
Unified Fock space representation of fractional quantum Hall states, in collaboration with Luca G Molinari, Vittorio Erba, and Pietro Rotondo, published in Physical Review B on 2017 (preprint available in arXiv).
The fractional quantum Hall effect (FQHE) is a physical phenomenon described in 1982, which is still the topic of active research in present days. Here we contribute to this line of research, showing how to use a single framework to describe two distinct cases of the simplest FQHE theory (the bosonic and fermionic Laughlin wave functions), and elucidating where (mathematically speaking) the differences in these two cases come from.
I co-organized a conference in Paris about Innate and Adaptive Recognition of Antigens and NeoAntigens, more info at the conference website.