How to Evaluate and Compare Genomes: A Proposal

Introduction

I was asked to defend the proposition that the annotation provided on the NmpdrWebsite adds significant value to the annotations provided in RefSeq? . In reflecting on how I would structure such an argument, I gradually explored a view that I will present in this proposal %u2013 that we should develop metrics to support comparison of annotations and use these metrics to both improve our annotations and to compare them against annotations provided by others.

The central issue in formulating effective annotation metrics is that they must accurately measure the quality of the annotations. Any experienced annotator is familiar with cases in which any existing automated annotation or evaluation environment would produce inaccurate results. It is possible to build evaluation schemes based on careful manual evaluation of statistical samples. Such approaches have both obvious merits and significant drawbacks. My first response was to suggest such a sampling-based approach. However, upon reflection, I wish to suggest a metric that requires no manual component. This proposal focuses on what I think is needed to provide a well-defined and useful metric.

What Is Meant by Comparing Annotations?

When we at the NMPDR speak of comparing annotations with other sources, we will be referring to

  • the identification of genes (and their precise locations),
  • the functions assigned to each of these genes,
  • which subsystems are believed to be present in the genome,
  • the connection between the genes and functional roles within the subsystems, and
  • extra tables giving inventories of essential genes and virulence factors.

The scope of what is meant by annotations is clearly somewhat arbitrary. In my view the first two items (the location of the genes and a description of their functions) constitute the most commonly included notions.

To compare annotations from two sources, we will need to implement metrics that allow one to measure both consistency and accuracy for each of the notions. In this document, I will focus on metrics for only the first two notions.

Evaluating Gene Calls and Assignments of Function

The essence of this proposal is as follows:

  1. Formulate a set of protein families such that you are relatively sure that all of the members of a family share the same function. Call this set of families F1. For collections of genomes that are close, F1 should contain 50-90% of the genes in the collection of genomes.
  2. Select a subset of F1 such that the function of the genes in each family are known with relative confidence. Call this set of families F2. This set should still contain at least 25-40% of the genes.
  3. Evaluate accuracy of assignments using F2 and consistency using F1.

It is important that the percentages of genes that remain in the two sets is high to prevent the natural criticism that we might be biasing the test framework.

The construction of F1 can proceed with any fairly conservative algorithm for projecting correspondence of genes. I believe that we could reasonably use a set of algorithms proposed by different groups and retain only sets that are determined identically by all of the algorithms (i.e., a very conservative subset).

The issue of how best to determine F2 is more difficult. Here the following possibilities should be considered:

  1. You could select a single genome from the set of closely-related genomes that you believed was best annotated and simply assert that the functional assignments made for that genome reresent a best estimate of accuracy.
  2. You could take the subset of genes for which Swiss Prot assignments exist. If you wish to be more conservative, you could pick the subset that connect to research publications (other than %u201Cgenome%u201D papers).
  3. You could take the genes that have been placed in the manually-curated subset of PIR families.
  4. You could pick those genes that have been placed into subsystems (by the Project to Annotate 1000 Genomes). To be more conservative, you might restrict the set to those genes for which no other candidate for the same function exists.

My point is not to argue about which approach is better, since I believe that they all would probably lead to the same result: RefSeq? annotations are sometimes good (when the annotated genome originally deposited into GenBank? reflected a serious annotation effort), but the quality over a set of closely related genomes is pretty poor.

Evaluating Consistency of Gene Calls

Consistency of gene calls can be evaluated using F1. I am reasonably confident that we will find substantial inconsistency in start positions and a much lower rate of inconsistency in actual gene calls. However, these do need to be tabulated for each of the pathogen sets curated by the BioinformaticsResourceCenters. We need to construct a utility that takes as input a set of genomes an produces counts of

  1. Missed/Inserted Genes %u2013 cases in which a gene is called in one genome of the set but not in another.
  2. Serious overlaps %u2013 all overlaps of over 50bp and 100bp need to be counted (separately).
  3. Inconsistent starts %u2013 we need to compute consensus start positions, and then tabulate deviations from the consensus.

We will implement a program to accumulate these counts.

Evaluating Consistency and Accuracy of Assigned Functions

There are two levels of consistency for assignment of function: the first treats two annotations as identical if they match word-for-word, and the second if they convey essentially the same meaning. The first is easily implemented. The second can be effectively approximated automatically. Consistency should be evaluated both ways and tabulated separately for all sets of corresponding genes in F1.

Accuracy is based on comparison against F2 (for which the functional role corresponding to each set is believed to be known). Just as for consistency, one can count both identical matches and cases in which the functional role of the protein set and the gene are %u201Cessentially the same%u201D.

Steps to be Taken Immediately

The following steps are needed to properly implement these plans:

  1. A program to evaluate gene calls must be constructed. This program would take as input a set of closely-related genomes with called genes, and optionally a large non-redundant protein database. The output would be tabulated counts for the undesirable properties mentioned earlier.
  2. We need to construct initial versions of F1 and F2. This could be a cooperative effort among the BioinformaticsResourceCenters merging the results produced from different sources. We need to run these for each group of closely-related genomes corresponding to BRC pathogens, tabulate counts for those protein sets, and tabulate potentially inaccurate annotations.

We are proposing that we begin this process now and use the short meeting in August 2006 to evaluate the impact (or lack of it) resulting from the experiment.

AuthorDataForm
Original Author RossOverbeek
Display Title How to Evaluate and Compare Genomes
Original date 2006-06-08
Citation string

Topic revision: r2 - 19 Mar 2008 - 19:39:51 - BruceParrello
 
NMPDR is a collaboration among researchers from the Computation Institute of the University of Chicago, the Fellowship for Interpretation of Genomes (FIG), Argonne National Laboratory, and the National Center for Supercomputing Applications (NCSA) at the University of Illinois. NMPDR is funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract HHSN266200400042C. Banner images are copyright © Dennis Kunkel.