Reflections on Accurate Annotations: The Basic Cycle and Its Significance
This is a revised update of Reflections on 2007
, which was originally written in December of 2007. Like the original article, it discusses the annotation cycle
shown in the diagram below.
The Production of Accurate Annotations
The efforts required to establish a framework for high-volume, accurate annotation are substantial. I believe that it is important that we reflect on what we have learned about the factors that determine productivity. So, what have we learned from the Project to Annotate 1000 Genomes
First, subsystem-based annotation
is the key to accuracy. While there are certainly numerous efforts still focusing on annotation of a single genome, it is now generally accepted that comparative analysis is the key to everything, and that focusing on the variations of a single component of cellular machinery as they are manifested over the entire collection of existing genomes is the key to accuracy. Manually-based subsystem creation and maintenance is the rate-limiting component of successful annotation efforts, and the factors that constrain this process are at the heart of the matter. We have understood this for some time now.
Components and Costs
I am going to argue a new position in this short essay:
- There are three distinct components that make up our strategy for rapid accurate annotation: subsystems-based annotation, FIGfams as a framework for propagating the subsystems annotations, and RAST as a technology for using FIGfams and subsystems to consistently propagate annotations to newly-sequenced genomes.
- These three components form a cycle (subsystems => FIGfams => RAST technology => subsystems). This cycle creates a feedback that rapidly accelerates the productivity achievable in all three components. Further, failure in any of these components impairs productivity dramatically in the others. Understanding this cycle will be the key to supporting higher productivity in subsystem maintenance and creation.
- To understand the dependencies, we need to consider each of the three components:
- FIGfams: The key to accurate FIGfam creation and maintenance is to couple it directly to subsystem maintenance. Since the initial release of the FIGfams was created, we have been updating them automatically based on changes in the subsystem collection. Thus, FIGfams are automatically split, merged and added as the subsystem collection is maintained. There remains one area of substantial cost in FIGfam development: creation of family-dependent decision procedures that are occasionally required to achieve the required accuracy. At this point we have approximately 10,000 subsystem-based FIGfams, although the overall collection contains over 100,000 families (the majority containing only 2-3 members).
- RAST: RAST has a central dependency on FIGfams for assertion of function to newly-recognized genes. In this sense, the main dependency of RAST is on the FIGfam collection. The more accurate the FIGfams and their associated decision procedures, the more accurate the assignments of function to genes in genomes processed by RAST.
- Subsystems: The central costs of maintaining subsystems include cleaning up errors in existing subsystems (often indicated by multiple genes having the same function) and adding new genomes to existing subsystems. Once a subsystem has reached an acceptable level of accuracy (and many are not here yet), the central cost is integration of new genomes after annotation by RAST. The speed with which new genomes can be added depends on how well RAST assigns gene function (and, secondarily, on how accurately these RAST-based annotations can be used to infer operational variants of subsystems).
- The main costs of increasing the speed and accuracy of annotations can be split into two categories: those relating to maintenance of existing subsystems, and those relating to generation of new subsystems. The maintenance costs are containable if the cycle is established and functions smoothly. Otherwise, I suspect they inevitably grow rapidly.
I have argued that the costs in achieving rapid, accurate annotations is limited by the rate at which subsystems can be maintained and created. I place the maintenance ahead of creation at this stage. As the collection grows (it now contains over 600 subsystems with over 6800 distinct functional roles), costs of maintenance will tend to dominate. The creation of new subsystems will always be a critical activity, but each new subsystem will impact smaller sets of genomes as we "move into the tail of the distribution".
The costs relating to subsystem maintenance, which will quickly dominate, depend critically on how smoothly the cycle I described functions. We have just established the complete cycle.
The two central costs that cannot be avoided will be creation of FIGfam-dependent decision procedures and the creation of new subsystems. The manual work on FIGfams will be necessary to achieve near-100% accuracy on annotation of seriously ambiguous paralogs
. However, in the vast majority of cases, this effort will be restricted to specific curators who are willing to spend massive effort to get things perfect. The more central cost relates to manual curation of the subsystems.
More Effective Integration of Existing Annotation Efforts
In the section above, I reflected on the cycle that we shall depend upon for supporting increased volume and accuracy of our own efforts. Other groups are certainly experimenting with their own solutions, and in some cases with clear successes. I have no desire to rate these competing efforts. I sincerely believe that cooperative activity is the key to enhanced achievements by everyone. However, effective cooperation is often elusive. I think that we have put in place an extremely important mechanism for making cooperation much easier, and the benefits more compelling.
Anyone working for one of the main annotation efforts realizes that it is not easy to really benefit from access to the annotation efforts of other groups. The efforts required characterizing discrepancies between local annotations and those produced externally often outweigh any benefits that result.
Two events of major importance have occurred:
- Both PIR and our project decided to build correspondences between IDs used by different annotation projects. The PIR effort produced BioThesaurus and the SEED effort produced the Annotation Clearinghouse. The fact that it will become trivial to reconcile IDs between the different annotation efforts will undoubtedly support rapid increases in cross-linking entries. The SEED is working with UniProt to cross-link proteins from all of our complete genomes, and I am sure similar efforts are happening between the other major annotation efforts.
- Within the Annotation Clearinghouse, a project to allow experts to assert that specific annotations are reliable (using whatever IDs they wish) has been initiated. This has led to many tens of thousands of assertions that specific annotations are highly reliable. PIR is preparing a list of assertions that they consider highly reliable, and both institutions are making these lists openly available.
To see the utility of exchanging expert assertions in a framework in which it is easy to compare the results, let me describe how we intend to use these assertions:
- We begin with a 3-column table of reliable annotations containing [ProteinID, AssertedFunction, NameOfExpert].
- We then take our IDs and construct a 2-column table [FIG-function, AssertedFunction]. This table creates a correspondence between each of our functional roles and the functional roles used by the expert making the assertion of reliability.
- We go through this correspondence table (using both tools and manual inspection) and split it into one set in which we believe both columns are essentially identical and a second set that we believe represent errors (either our own or those of the expert asserting reliability). We anticipate that in most cases the expert assertion will be accurate, which is what makes this exercise so beneficial to ourselves.
- We take the table of essentially the same assertions and distribute it as a table of synonyms (which we consider to be a very useful resource).
We are strongly motivated to resolve differences between our annotations and high-reliability assertions made by experts. The production of the table of synonyms both reduces the effort to redo such a comparison in the future, but is also a major asset by itself. I am confident that any serious annotation group that participates will benefit, and I believe that these exchanges will accelerate in 2009 and 2010.
I have tried to express the significance of the cycle depicted above, but I think that I failed to really convey the epiphany, so let me end by expressing it somewhat more emphatically. I believe that there will be a very rapid acceleration in the sequencing of new, complete genomes (although frequently the quality of the sequence wil be far from perfect, and I am willing characterize an occasional genome in 100 contigs as essentially complete). Groups that now try to provide accurate integrations of all (or most) complete genomes will be strained heavily. The tendency will be to go one of two directions:
- Some will swing to completely automated approaches. This will result in rapid propagation of errors (for those portions of the cellular mechanisms that are not yet accurately characterized—which is quite a bit).
- Others will give up any attempt at comprehensive annotation and focus on accurate annotation of a slowly growing subset.
The problem with the second approach is that accurate annotation of new cellular mechanisms (i.e., the introduction of new subsystems) will increasingly depend on a comprehensive set of genomes (comparative analysis is central to working out any of the serious difficulties, and the larger the set of accurately annotated genomes, the better framework for careful correction.
The cycle depicted above is the only viable strategy that I know of to handle the deluge of genomes accurately. I claim that as time goes by, the SEED effort to implement the above cycle will emerge in a continuously strengthening position. Other groups will be forced to rapidly copy it, but it really was not that easy to establish, and I believe the odds are that the SEED effort will be the only group standing in 2-3 years (i.e., it will be the only group claiming both accuracy and comprehensive integration).