Benchmarking the GOPHER orthologue prediction algorithm.

Started: 1st October 2012
Ended: 1st June 2013
Research Team: Richard Edwards, Shaun Maguire

Through comparison of DNA or amino acid sequences changes that have occurred in genes between species can be seen and evolutionarily related genes, called homologues, can be identified (Dessimoz et al, 2012). Orthology and Paralogy are terms used to distinguish between two classes of gene homology. Orthology occurs due to speciation, where two species inherit the same gene from a common ancestor. This gene will undergo different selection pressures in each organism resulting in gene pairs with a similar function found in both species (Dessimoz et al, 2012). Paralogy is the result of gene duplication; a duplication event will provide the genome with genetic information that can lead to a gene diverging in function (Gabaldon et al, 2009). Studying orthologues is important as the function of a newly discovered gene could be inferred if gene orthology can be identified (Dessimoz et al, 2012). Some orthologues and paralogues do not behave as expected; also proteins can form complexes and can have more then one function (Boeckmann et al, 2011). This makes distinguishing between the types of gene homology difficult.

A variety of orthologue prediction methods exist, all using variations of certain algorithms. Salichos and Rokas (2011) found that, of 4 prediction methods, simpler prediction algorithms may be better orthology predictors than more complex ones. One such method is Mutual Best Hit (MBH) which relies on BLAST to identify pairwise orthologues between two species (Salichos and Rokas, 2011). BLAST scores are sensitive to sequence and multi-domain proteins. This can result in unrelated genes being considered orthologues if they both contain similar low complexity regions, making MBH flawed (Edwards, 2006). Generation of Orthologous Proteins from High-throughput Evolutionary Relationships (GOPHER) is a prediction algorithm that tries to overcome the flaws of MBH using phylogenetic analysis (Edwards, 2006). GOPHER has not been formally benchmarked but is utilised by Short Linear Motif (SLiM) prediction methods and websites like SLiMFinder.

To evaluate orthologue prediction methods a reference dataset will be created from collections of species proteomes that contain complete sets of genes. The reference dataset will act as the “known” evolutionary relationship between species identifying known orthologues. This “known” relationship will then be compared to the evolutionary relationships predicted by each orthologue prediction program. The accuracy, specificity, sensitivity and false discovery rate of each prediction method can then be analysed and comparisons can be made. Not all databases contain accurate gene annotations (Dessimoz et al, 2012), and most need continuous updating (Boeckmann et al 2011). Boeckmann et al (2011) found that of 7 databases, none were in perfect agreement with a known evolutionary relationship. Therefore another part of this experiment is to understand the limitations of each database and investigate how databases affect the accuracy of orthologue prediction methods.

This experiment aims to benchmark the GOPHER algorithm and investigate whether it is a more accurate orthologue predictor than other prediction methods. By benchmarking GOPHER its effectiveness can be established and prediction methods that rely on this algorithm, like SLiM, can be adjusted accordingly.

Computational Modelling Group

Benchmarking the GOPHER orthologue prediction algorithm.

Categories