Bioinformatics Interest Group Seminar: Improving clinical omics applications using k-mer approaches

The past decade has seen a dramatic increase in the amount of sequencing data produced to dissect human disease and biology. However, the number of clinically actionable discoveries produced from these are remarkably low. A pessimistic view would only consider the 59 genes designated by the American College of Medical Genetics as medically actionable. In retrospect, this result is unsurprising as many genetic components of human disease consist of an interaction of a large number of small variations; most studies however are of small size and thus do not have sufficient statistical power to infer these interactions. The difficulty of putting together clinical studies with large cohorts is further compounded by major computational hurdles in exploiting them. These data are large, sensitive and heterogeneous. As a consequence, they cannot freely circulate between research labs, they are difficult to analyze using currently available software and data from different studies are difficult to integrate. In this seminar, I will show how we use k-mer based approaches to address multiple issues pertaining to the use and interpretation of health-related sequencing data. Specifically, how we can tune simple machine learning techniques to better explore sequencing data without the requirement of a reference genome, implement dimension reduction techniques to allow easy integration of data from multiple sources, implement indexing strategies to store more compact versions of sequencing data and finally, generate software that enables easy exchange of specific parts of sequencing data that are relevant to disease between research groups. These data are highly compressed and importantly they preserve patient anonymity.