Novel large language models for biological sequences and their interactions
The identification of genomic alterations driving cancer progression has historically relied on identifying recurrent mutations across patient cohorts. While some driver mutations are well-characterized due to their high recurrence, the vast majority of mutations are rare, shared by only a few patients. This limits the scope for functional studies and potential therapeutic targets. Recent advances in large language models (LLMs) trained from vast biological sequences offer a promising solution by enabling a deeper understanding of the structural and functional consequences of genomic alterations in DNA, RNA, and protein sequences. By leveraging LLMs, we can model the effects of even rare mutations in cancer with unprecedented detail and precision.
This project aims to train and utilize LLMs to explore the impact of genomic mutations, specifically focusing on protein sequences, RNA sequences, and their interactions. These models will provide novel insights into the consequences of genomic alterations and pave the way for improved understanding and therapeutic targeting in cancer biology.
Objective 1: Building LLMs for Protein Sequences and Protein-Protein InteractionsMutations in protein sequences can have significant functional consequences, particularly in the context of protein-protein interactions. Current LLMs are trained on single protein sequences, limiting their ability to model these interactions. In this objective, we aim to build a novel protein language model specifically designed to learn and predict protein-protein interactions. This model will be applied to both mouse and patient data, enabling the study of mutations in the context of protein interaction networks.
Objective 2: Building LLMs for RNA Sequences and Protein-RNA InteractionsMutations in RNA sequences, including those in non-coding regions, have profound effects. In collaboration with RNA-focused research groups, we will develop LLMs to model the interaction between RNA and proteins, as well as the functional consequences of mutations. With an existing RNA model and a wealth of data for fine-tuning and validation, we will explore how these interactions influence gene regulation and cancer progression.
Objective 3: Fine-Tuning LLMs for Mitochondrial GenomesMitochondrial mutations are increasingly recognized for their role in cancer and other diseases. Glasgow's ongoing initiative in mitochondrial genomics provides an ideal opportunity to fine-tune LLMs for mitochondrial DNA. These models will enhance our ability to assess the functional consequences of mitochondrial mutations and could potentially inform the design of genome-editing tools targeting the mitochondrial genome.
For questions regarding the application process, PhD programme/studentships at the CRUK Scotland Institute or any other queries, please contact phdstudentships@beatson.gla.ac.uk.
Applications are open to all individuals irrespective of nationality or country of residence.