In the NMPDR Website and SEED Environment databases, there is a general concept of a piece of DNA. In the databases and in the SEED, this is called a feature. In the NMPDR, it is called a gene, because it is felt this term is more familiar to the general public. (The correct term is locus, although CDS is also used in some places.) We will use feature in this document.
Each feature in our databases has a unique FIG ID. There are, however, features we will need to work with that have not been called in our databases. For these, we must rely on external names. These are often called Accession Numbers though in fact most of them are non-numeric. A more precise term that has recently come into vogue is DBXREF, though there are also common names (e.g. dnaK, SCO0132) that don't belong to a database. For the purposes of this document, we'll use the term identifier to refer to something that could be a local identifier (FIG ID), common name, or DBXREF.
The Annotation Clearinghouse contains Assertions made about the nature and purpose of features. These assertions are made using an identifier (as opposed to a FIG ID) and come from an organization or person that we call the Source? . Some sources are considered expert, and their assertions are called expert assertions. Finding expert assertions is a critical task.
Each identifier has one or more alias types that indicate the databases to which it belongs or what type of common name it is. For a DBXREF, the alias type tells us how to convert the identifier to an outbound link. In most cases, the alias type can be determined by parsing the identifier itself. There are, however, several databases that use the same naming schemes in different ways, so the same identifier may mean two different things depending on the type.
Each identifier has a normalized form and a natural form. The normalized form is used internally and usually contains a prefix that indicates the type. For example, LocusTag:CJJ26094_0128 is a locus tag and uni|A3YQ44 is a UniProt ID. The natural form is the way the identifier is used when the type is known. So, for example, CJJ26094_0128 is the natural form of the example locus tag and A3YQ44 is the natural form of the example UniProt ID.
Our goal is to provide a mechanism for getting from an identifier to the appropriate NMPDR page. This has proven to be a very complicated task over the past few years.
Notions of Equivalence
There are three notions of equivalence of interest to this process.
For many features, we will have a group of names from other databases that have been determined by PIR International to represent more or less the same thing. We call each such group a PIR group. Each name in a PIR group has a clearly-identified alias type.
Each feature may belong to one or more synonym groups. A synonym group contains features that represent essentially identical protein or DNA sequences. Unlike a PIR group, the features in a synonym group can be located in different places and even different Genomes.
For each feature, we generally have several Aliases that people wish to use to search for features. Unlike PIR groups and synonym groups, Aliases are not organized into neat little partitions, and it is possible the same alias can have multiple meanings. For example, the common names for features are considered aliases, and these are only guaranteed to be unique within the scope of a single Genome.
Each of these three notions is assigned a confidence grade. The highest confidence gradel (A) is given to features from the curated PIR data. The lowest confidence grade (C) is given to synonym groups. Aliases have a confidence grade of (B). For each identifier in the system, we keep a list of the features having the highest grade for that identifier.
Synonym groups get the lowest confidence because they indicate an equivalence at the protein level, not the feature level. The grading system insures that if we are presented with an identifier that is connected to a specific feature, we return that feature; we only return all features with a matching protein sequence if the specific location belonging to the identifier is unknown to us.
Alias Crunching
This page requires Flash Player in order to display properly.
The Alias Cruncher script (AliasCrunchPl) merges the files containing the three types of identifier mappings and outputs the best mappings of each identifier for use by the NMPDR databases. In the Sprout Database, the alias information appears in several different places. The more streamlined Sapling Database collects all the identifiers into a single data object.
The cruncher uses the AliasAnalysisPm module to process all the incoming identifiers. It currently recognizes 10 different identifier types. Identifiers of unknown type are lumped together under the name miscellaneous. The 10 recognized types are described in the table below.