SOP010: Annotating a Feature
The
SEED Environment is designed specifically to faciliate annotation of
features in
genomes. Although the
Rapid Annotation Server annotates genomes
automatically, its accuracy is dependent on the quality of existing annotations by experts
working with the
Fellowship for Interpretation of Genomes. This procedure describes more or less
how the manual annotation is performed.
This is intended to be a user-friendly procedure. The formal documentation is available in PDF form
here.
Annotation Decision Procedure
Mark genes for which relevant literature exists.
A semi-automatic procedure exists for attaching and evaluating the existence of
relevant literature. We attach (using the tools provided by NCBI) specific
papers to genes from the NMPDR genomes. During curation our annotators can
delete connections they consider inappropriate, or they can add references (often
under guidance from the user community). The percentage of genes in NMPDR
genomes for which connections to publications exist is low. The more common case
is when a paper is connected to one or more clear
orthologs to a gene from an
NMPDR organism. For each gene maintained in our subsystem collection, we have
attached relevant papers to the
functional roles maintained in the subsystem, and
annotators can access these papers as they curate each gene. When papers are
connected directly to a gene, the
evidence code dlit (for direct literature) is attached to
the annotation. When no direct references are connected, but connections do exist to the
functional role of the containing
subsystem, a code of
ilit (for indirect literature) is
attached to the gene.
Use a manually curated subsystems-based annotation if it exists.
If a
feature has been manually placed in a subsystem by an annotator, this amounts to an
assertion by the annotator that the gene implements one or more
functional roles from
the
subsystem. We attach
evidence codes to the gene to reflect important attributes of
the connection:
- A code of
icw(n) is used to indicate that the connected feature occurs in a chromosomal cluster containing n other features connected to the same subsystem.
- A code of
isu is used if the feature is the only one within the genome that is connected to the given functional role.
- A code of
idu(n) is used to indicate that the connection between the feature and the functional role is not unique. In particular, there are n other features that are believed to
implement the same functional role. This evidence code should be viewed as significantly weakening the confidence of the assignment.
Process genes not yet included in subsystems.
At the time of this writing, over half of the known features remain outside defined subsystems. These features
are processed automatically. For each feature not in a subsystem, the following
evidence codes are attached, as applicable.
- If the feature occurs within a FIGfam (presumably arising through close-strain sets, since the feature is not included in a subsystem), the code
ff is attached. In this case, the function assigned is determined by the function associated with the given FIGfam.
- If the feature is functionally clustered with other genes at a score level of 5 or more, then if at least one of the other genes has a non-hypothetical functional role, an evidence code of
cwn (clusters with non-hypothetical) is attached to the gene; if all of the clustered features are hypothetical, then cwh is attached.
- If the feature is not in a FIGfam, it will be assigned a function by examining assignments based on assignments made to similar features by other institutions.
Summary
We rank our confidence in assignments based on the attached
evidence codes. Our guidelines to users are as follows:
- Features with codes
icw(n) and/or isu are considered most reliable. An additional dlit (or to a lesser extent ilit) increases confidence.
- Features with
idu(n) and/or ff are the next most reliable. Again, dlit and ilit increase confidence.
- Features with
cwn have far less reliability. The functional clustering with a nonhypothetical assignment is viewed as a suggestive clue.
- Features with just
cwh are also considered very unreliable. The clustering should be viewed as a significant clue.