NextGen Sequencing

Next Generation Sequencing (NextGen or NGS) covers all the high-throughput nucleotide sequencing technologies and applications that are replacing traditional Sanger sequencing. Applications of NGS include genome/exome sequencing or resequencing, digital transcriptomics, metagenomics and various RNASeq applications. Sequencing capacity is exploding at a rate that far out-strips increases in computational capacity; NGS has shifted the bottleneck for many of these applications from one of data generation to one of data management and processing. This topic is for all CMG members with interests in facilitating, generating or exploiting NGS data.

Local Wiki and Forums (restricted access).

For queries about this topic, contact Richard Edwards.

View the calendar of events relating to this topic.

Projects

A composite likelihood approach to genome-wide data analyses.

Andrew Collins (Investigator), Jane Gibson, Ioannis Politopoulos

We describe composite likelihood-based analysis of a genome-wide breast cancer case-control sample by determining genome regions of fixed size on a linkage disequilibrium map which delimit comparable levels of linkage disequilibrium. Analysis of findings suggests further validation in more samples from other cohorts as well as the exploitation of novel computationally-intensive methods such as next-generation sequencing.

Application of RNA-Seq for gene fusion identification in blood cancers

William Tapper (Investigator), Marcin Knut

Gene fusions are often the cause of different blood cancers. As such, accurate identification of them provides information on the underlying cause of a cancer, ensuring appropriate choice of treatment. However, due to shortcomings of the currently applied methods for gene fusion identification, some of them escape undetected. We are employing RNA-Seq, a cutting-edge method for sequencing RNA, the messenger of genetic information, to investigate gene fusions.

Centre for Doctoral Training in Next Generation Computational Modelling

Hans Fangohr, Ian Hawke, Peter Horak (Investigators), Susanne Ufermann Fangohr, Thorsten Wittemeier, Kieran Selvon, Alvaro Perez-Diaz, David Lusher, Ashley Setter, Emanuele Zappia, Hossam Ragheb, Ryan Pepper, Stephen Gow, Jan Kamenik, Paul Chambers, Robert Entwistle, Rory Brown, Joshua Greenhalgh, James Harrison, Jonathon Waters, Ioannis Begleris, Craig Rafter

The £10million Centre for Doctoral Training was launched in November 2013 and is jointly funded by EPSRC, the University of Southampton, and its partners.

The NGCM brings together world-class simulation modelling research activities from across the University of Southampton and hosts a 4-year doctoral training programme that is the first of its kind in the UK.

Characterisation of the Genomic Landscape in Splenic Marginal Zone Lymphoma

Sarah Ennis, Jane Gibson, Jon Strefford (Investigators), Carolina Jaramillo Oquendo, Helen Parker

This project aims to expand the catalogue of mutated genes in splenic marginal zone lymphoma (SMZL) and construct a detailed characterisation of the genetic landscape of this disease. Using next generation sequencing, we aim to identify somatic mutations in over 100 samples, and enrich clinical data with this information to improve patient treatment and prognosis.

Effects of Sample Contamination on Alternate Allele Frequency

Jane Gibson (Investigator), Roshan Sood

Accurate calling of genetic variants is reliant on the purity of samples, contamination will reduce the accuracy of results. Currently there are few programs able to identify contamination in samples, potentially misinforming a researcher or clinician. To better understand the changes caused by sample contamination in
silico simulations were performed where a known percentage of DNA sequence reads from a contaminating
file were added. Understanding the changes will assist the development of a new method and program to
detect sample contamination.

Genetic studies to characterise the role of genetic factors in early-onset breast cancer

Andrew Collins (Investigator), Rosanna Upstill-Goddard

Breast cancer is a highly heterogeneous disease, with many distinct subtypes. In the majority of breast cancer cases the causative genetic component is poorly characterised. This study aims to explore both rare and common mutations in early-onset breast cancer patients and the contribution of such variants to disease using a variety of analytic approaches.

Identification of Gene-Modules Associated to a Predisposition of Post-Traumatic Stress Disorder

Christopher Woelk (Investigator), Michael Breen

The predisposing genetic factors associated to Post-Traumatic Stress Disorder (PTSD) are altogether unknown. Since not all trauma-survivors later develop the PTSD, it has been hypothesized that transcript differentiation prior–to-trauma exposure could be associated to the risk and resilience of PTSD. We apply a systems-level approach to investigate changes in transcript abundance (gene expression profiles) in whole blood of U.S. Marines sampled prior-to-deployment to the battlefield and followed through-out a seven month deployment to obtain disorder related outcomes.

Identifying factors required for DNA methylation using the imprinting control protein ZFP57

Deborah Mackay (Investigator)

Mutation of ZFP57 in humans is associated with widespread loss of DNA methylation at imprinted genes, and clinical features including congenital anomalies and developmental delay (Mackay et al, 08). This indicates that ZFP57 is required for DNA methylation of imprinted genes necessary for normal development.
We propose to identify the DNA sequences targeted by ZFP57, and its protein cofactors. This work will give insight into the biology of imprinting, indicate mechanisms of disease, and identify novel imprinted genes.

Identifying variants in next generation sequencing data of 61 paediatric Inflammatory Bowel Disease patients

Sarah Ennis (Investigator), Gaia Andreoletti

This study aims to characterise the mutations of genes known to predispose Inflammatory bowel disease in 61 paediatric patients using next generation sequencing analysis. Our aim is to identify the relative impact of known genes in individual case presentations of disease and correlate matches with clinical manifestation.

Imprinting Disorders Finding Out Why

Karen Temple (Investigator)

We are conducting a research project to determine the cause and clinical impact of widespread imprinting aberrations in human development. We are recruiting patients with possible or definite imprinting disorders (due to methylation loss or gain at an imprinted loci)

including Silver Russell syndrome, Transient Neonatal diabetes, Beckwith Wiedemann syndrome, Angelman syndrome Prader Willi syndrome, UPD 14 syndromes and Pseudohypoparathyroidism.

Mathematical tools for analysis of genome function, linkage disequilibrium structure and disease gene prediction

Mahesan Niranjan, Andrew Collins, Reuben Pengelly (Investigators)

This iPhD project uses a Gaussian Bayesian Networks framework through Machine learning methods to predict which genes are involved in the development of different diseases.

Mathematical tools for analysis of genome function, linkage disequilibrium structure and disease gene prediction

Andrew Collins, Mahesan Niranjan, Reuben Pengelly (Investigators), Alejandra Vergara Lope

This iPhD project uses a Gaussian Bayesian Networks approaches framework through machine learning approach to predict which genes are involved in the development of different diseases.

Mathematical tools for analysis of genome function, linkage disequilibrium structure and disease gene prediction

Mahesan Niranjan, Andrew Collins, Reuben Pengelly (Investigators)

This PhD project uses a Monte Carlo molecular simulation processes approach to predict which genes are involved in the development of different diseases.

Metagenomics: Understanding the impacts of environmental change on soil biodiversity

Richard Edwards, Gail Taylor (Investigators), Joseph Jenkins

Drought is expected to increase in prevalence by 2050. Similarly, the use of biochar (a charcoal based soil amendment) has been suggested as a method to sequester carbon and fertilize soils without need of mineral fertilizers, and its use is increasing. We are using next generation DNA sequencing technology and bioinformatics to determine bacterial genetic diversity from soil samples which have been subject to drought or biochar amendment, to further our understanding of the impacts of environmental change on microbial communities.

Predicting Relative Protein Abundance via Sequence-Based Information

Gregory Parkes (Investigator), Mahesan Niranjan

Understanding the complex interactions between transcriptome and proteome is essential in uncovering cellular mechanisms both in health and disease contexts. The limited correlations between corresponding transcript and protein abundance suggest that regulatory processes tightly govern information flow surrounding transcription and translation, and beyond.

In this study we adopt an approach which expands the feature scope that models the human proteome: we develop machine learning models that incorporate sequence-derived features (SDFs), sometimes in conjunction with corresponding mRNA levels. We develop a large resource of sequence-derived features which cover a significant proportion of the H. sapiens proteome, demonstrate which of these features are significant in prediction on multiple cell lines, and suggest insights into which biological processes can be explained using these features.

We reveal that (a) SDFs are significantly better at protein abundance prediction across multiple cell lines both in steady-state and dynamic contexts, (b) that SDFs can cover the domain of translation with relative efficiency but struggle with cell-line specific pathways and (c) provide a resource which can be plugged into many subsequent protein-centric analyses.

Sample tracking in whole-exome sequencing projects

Andrew Collins, Sarah Ennis (Investigators), Reuben Pengelly

Whole-exome sequencing is entering clinical use for genetic investigations, and it is therefore essential that robust quality control is utilised. As such we designed and validated a tool to allow for unambiguous tying of patient data to a patient, to identify, and thus prevent errors such as the switching of samples during processing.

Tag based transcriptome analysis of gene expression in a promising green algae

Richard Edwards (Investigator), Andreas Johansson

We use SuperSAGE in combination with next-generation sequencing to compare differences in gene expression between selected mutants and the wild type of a green algae. The data in the form of millions of 26 bp tags representing short stretches of expressed genes, will be analysed to find patterns of variation in gene expression under different conditions.

The application of next-generation sequencing to unresolved familial disease

Andrew Collins, Sarah Ennis (Investigators), Jane Gibson, Reuben Pengelly

Next-generation sequencing (NGS) allows us to sequence individual patients cost-effectively, allowing us to enter a new era of genomic medicine. The level of genetic detail that we can access through these methods is unprecedented making it suitable for clinical molecular diagnostics.

Uncovering extensive post-translation regulation during human cell cycle progression by integrative multi-’omics analysis

Gregory Parkes (Investigator), Mahesan Niranjan

Analysis of high-throughput multi-’omics interactions across the hierarchy of expression has wide interest in making inferences with regard to biological function and biomarker discovery. Expression levels across different scales are determined by robust synthesis, regulation and degradation processes, and hence transcript (mRNA) measurements made by microarray/RNA-Seq only show modest correlation with corresponding protein levels.

In this work we are interested in quantitative modelling of correlation across such gene products. Building on recent work, we develop computational models spanning transcript, translation and protein levels at different stages of the H. sapiens cell cycle. We enhance this analysis by incorporating 25+ sequence-derived features which are likely determinants of cellular protein concentration and quantitatively select for relevant features, producing a vast dataset with thousands of genes. We reveal insights into the complex interplay between expression levels across time, using machine learning methods to highlight outliers with respect to such models as proteins associated with post-translationally regulated modes of action.

We uncover quantitative separation between modified and degraded proteins that have roles in cell cycle regulation, chromatin remodelling and protein catabolism according to Gene Ontology; and highlight the opportunities for providing biological insights in future model systems.