SOP010: Annotating a Feature

The SEED Environment is designed specifically to faciliate annotation of features in genomes. Although the Rapid Annotation Server annotates genomes automatically, its accuracy is dependent on the quality of existing annotations by experts working with the Fellowship for Interpretation of Genomes. This procedure describes more or less how the manual annotation is performed.

This is intended to be a user-friendly procedure. The formal documentation is available in PDF form here.

Annotation Decision Procedure

Mark genes for which relevant literature exists.

A semi-automatic procedure exists for attaching and evaluating the existence of relevant literature. We attach (using the tools provided by NCBI) specific papers to genes from the NMPDR genomes. During curation our annotators can delete connections they consider inappropriate, or they can add references (often under guidance from the user community). The percentage of genes in NMPDR genomes for which connections to publications exist is low. The more common case is when a paper is connected to one or more clear orthologs to a gene from an NMPDR organism. For each gene maintained in our subsystem collection, we have attached relevant papers to the functional roles maintained in the subsystem, and annotators can access these papers as they curate each gene. When papers are connected directly to a gene, the evidence code dlit (for direct literature) is attached to the annotation. When no direct references are connected, but connections do exist to the functional role of the containing subsystem, a code of ilit (for indirect literature) is attached to the gene.

Use a manually curated subsystems-based annotation if it exists.

If a feature has been manually placed in a subsystem by an annotator, this amounts to an assertion by the annotator that the gene implements one or more functional roles from the subsystem. We attach evidence codes to the gene to reflect important attributes of the connection:

  • A code of icw(n) is used to indicate that the connected feature occurs in a chromosomal cluster containing n other features connected to the same subsystem.
  • A code of isu is used if the feature is the only one within the genome that is connected to the given functional role.
  • A code of idu(n) is used to indicate that the connection between the feature and the functional role is not unique. In particular, there are n other features that are believed to
implement the same functional role. This evidence code should be viewed as significantly weakening the confidence of the assignment.

Process genes not yet included in subsystems.

At the time of this writing, over half of the known features remain outside defined subsystems. These features are processed automatically. For each feature not in a subsystem, the following evidence codes are attached, as applicable.

  • If the feature occurs within a FIGfam (presumably arising through close-strain sets, since the feature is not included in a subsystem), the code ff is attached. In this case, the function assigned is determined by the function associated with the given FIGfam.
  • If the feature is functionally clustered with other genes at a score level of 5 or more, then if at least one of the other genes has a non-hypothetical functional role, an evidence code of cwn (clusters with non-hypothetical) is attached to the gene; if all of the clustered features are hypothetical, then cwh is attached.
  • If the feature is not in a FIGfam, it will be assigned a function by examining assignments based on assignments made to similar features by other institutions.

Summary

We rank our confidence in assignments based on the attached evidence codes. Our guidelines to users are as follows:

  • Features with codes icw(n) and/or isu are considered most reliable. An additional dlit (or to a lesser extent ilit) increases confidence.
  • Features with idu(n) and/or ff are the next most reliable. Again, dlit and ilit increase confidence.
  • Features with cwn have far less reliability. The functional clustering with a nonhypothetical assignment is viewed as a suggestive clue.
  • Features with just cwh are also considered very unreliable. The clustering should be viewed as a significant clue.
SopForm
Number 010
Audience User Group
Title Annotating a Feature
Style normal
Topic revision: r6 - 30 Jan 2009 - 05:55:16 - Bruce Parrello
 
Notice to NMPDR Users - The NMPDR BRC contract has ended and bacterial data from NMPDR has been transferred to PATRIC (http://www.patricbrc.org), a new consolidated BRC for all NIAID category A-C priority pathogenic bacteria. NMPDR was a collaboration among researchers from the Computation Institute of the University of Chicago, the Fellowship for Interpretation of Genomes (FIG), Argonne National Laboratory, and the National Center for Supercomputing Applications (NCSA) at the University of Illinois. NMPDR is funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract HHSN266200400042C. Banner images are copyright © Dennis Kunkel.