Innovative algorithms for RNA research
Long-read sequencing enables the direct measurement of RNA molecules. However, this data must be processed to reconstruct the RNA sequences, their abundances, chemical modifications and potential variation across various conditions.
In this project, we aim at exploring novel computational algorithms that will make possible the study of the variation of RNA sequence, abundance, or chemical modifications using long-read sequencing data. We plan to apply these tools to uncover new molecular features in cancer and inherited disorders.
Computational challenges include graph algorithms and hashing methods to match, cluster, and interpret biological sequences that present frequent errors, and deep learning for complex signal interpretation to predict chemical modifications in RNA molecules.
|Supervisor: Professor Eduardo Eyras|
Mining viral sequences in diverse metagenome assemblies using pattern recognition algorithms
Bacteria face constant pressure from bacteriophages to be killed. In a phage-bacteria arm race, bacteria have developed sophisticated defence systems against phages. These systems such as the Nobel laureate winning CRISPR technology have been harnessed as gene editing tools for gene therapy, molecular diagnostics and synthetic biology. to counter-attack bacterial defences, the phage has acquired various anti-CRISPR systems and an anti-bacterial arsenal.
This project aims in detecting rare fragments of phage genomes in a large dataset of metagenomes assemblies using pattern recognition and potentially deep learning techniques. It will enable us to determine the viral versus the bacterial sequences within microbial communities.
The main outcome of this project will enable us to discover new viruses and potentially novel anti-bacterial defence systems.
This research will be using pattern recognition algorithms and potentially deep learning to predict short viral repetitive sequences within diverse sets of microbial sequences. Programming skills in python and C are required.
|Supervisor: Dr Gaétan Burgio|
Computational featurisation of protein structure data with human genetic variation
Last year, the AlphaFold2 algorithm developed by DeepMind, revolutionised the field of protein structural biology. Application of AlphaFold2 has provided a database of accurate, high-resolution structures for just about all human proteins. With this dataset and an ecosystem of structural bioinformatics tools, we can computationally derive structural metrics of how human disease mutations affect protein function, interactions and stability. We will compute these at scale and then look for structure in this data, especially how it relates to disease incidence. This information will be used to train machine learning models that can be used to identify new and undetected disease mutations from personal genome sequencing. With collaborators in the Department of Immunology and Infectious Disease Department, we have the opportunity to test these predictions in experimental systems that target mutations which cause human autoimmunity.
Challenges will be the computation of outcomes of mutation on protein structures with algorithms that run in both an HPC environment to those that will need to be screen-scraped from a multitude of web-services. As the resultant data from these sources grows, there will be opportunities for identification of patterns and structure in these data, using ML and DL methodology.
|Supervisor: Dr Dan Andrews||Co-supervisor: Dr Vicki Athanasopoulos|
Reading cell history by the identification of non-canonical RNA modifications using nanopore signals
In case of chemical or radiation damage, instead of repair like for DNA, cells rely on the replenishment of RNA with newly transcribed molecules. Thus, an RNA snapshot at any given moment carries useful information about recent cellular (and organismal) exposures, which could be applicable in diagnosing the amount of received cell stress and damage, forensics and treatment monitoring. Furthermore, RNA can be conditionally modified depending on its environment and reflective of its intracellular localisation, condensation state, involvement in splicing and translation, and interactions with the proteins.
To exploit this uniquely rich information, we propose to train neuronal networks capable of dissecting the canonical and non-canonical RNA modifications using long-read direct RNA sequencing (DRS) data. The immediate focus of the project will be on RNA UV damage and modification resulting form protein interactions, with a potential for expansion to any detectable signal type later on.
In DRS, each RNA molecule is measured by its capacity, nucleotide-by-nucleotide, to affect ionic current through a protein nanopore. The current signal is recorded and reflects the primary structure of the RNA molecule. DRS signals are complicated by the current noise and timing irregularities, requiring the development of complex signal processing models akin to those used for speech and image recognition. This alone represents a challenging and exciting objective, in which any contribution towards the underlying signal analysis principles and specific implementation can lead to a next breakthrough. Currently, DRS is applied to a narrow range of pre-defined modifications, each requiring specialised and extensive training of the pattern recognition models on a set of ‘pure’ data with and without said modification.
In this project, we propose a novel mutli-tier approach, whereby in the first tier we will use a data-driven, unbiased investigation to identify signals and their locations across transcriptome (sites) characteristic to those resulting from UV damage of RNA and those resulting form RNA-protein interactions, by heuristics and machine-learning-based analysis and classification. In the second tier, which could be a follow-up continuation project, we will identify the specific modification types that underlie signal classification obtained in the first tier, to resolve all functionally-relevant, DRS-detectable modification types associated with UV exposure or protein interactions. The project can be expanded by inclusion of other conditions affecting modification status of RNA, including targeted chemical and biochemical modification that can be used to address intracellular history record of RNA species.
|Supervisor: Dr Nikolay Shirokikh||Co-supervisor: Professor Eduardo Eyras|
RNA-binding proteins rewire transcriptomes during immune cell differentiation
Alternative PolyAdenylation (APA), is used by >70% of genes in human and has emerged as a major mechanism for the diversification of their transcriptomes and regulation of gene expression. We are investigating APA in individual cell types focusing on the immune system, specifically CD8+ T cells that respond to virus infections, which comprises a complex mix of cell subtypes at various stages of differentiation. We predicted a set of trans-acting RNA binding proteins (RBPs) that are important in T cell differentiation from the pattern of APA in their transcriptomes. We combine single-cell assays, CRISPR gene knockouts in transgenic CD8+ T cells, computational analysis and machine learning models to understand the functional and phenotypic implications of new roles of trans-acting RBPs in regulating APA in the immune system.
Deep neural network models in genomics, single-cell RNA-seq analysis.
|Supervisor: A/Professor Jean (Jiayu) Wen|
Framework for spatial omics analysis
Spatial Quantification of Molecular Data in Python’ (Squidpy) is a newly developed Python-based framework for the analysis of spatially resolved omics data. Squidpy provides efficient infrastructure and numerous spatial molecular data, such as transcriptome or multivariate proteins, to efficiently store, manipulate and interactively visualize spatial omics data. Squidpy aims to bring the diversity of spatial data in a common data representation and provide a common set of analysis and interactive visualization tools. In this internship program, we aim to implement Squidpy at the JCSMR to enable scalable analyses of both spatial neighborhood graph and image, along with an interactive visualization module.
There is currently no integrated spatial analysis framework developed at the JCSMR.
|Supervisor: Dr Woei Ming (Steve) Lee|