Basic Concepts
In the
NmpdrWebsite and
SeedEnvironment databases, there is a general concept of a piece of DNA. In the databases and in the SEED, this is called a
feature. In the NMPDR cover pages, it is called a
gene, because it is felt this term is more familiar to the general public. We will use
feature in this document.
Each feature in our databases has a unique
FigId. There are, however, features we will need to work with that have not been called in our databases. For these, we must rely on external names. The general concept of a feature name will be called
FeatureName? . A
FigId is a special case of
FeatureName? .
The
AnnotationClearinghouse contains
Assertions made about the nature and purpose of features. These assertions are made using a
FeatureName? and come from an organization or person that we call the
Source? . Some sources are considered expert, and their assertions are called
ExpertAssertions? . Finding
ExpertAssertions? is a critical task.
Each
FeatureName? has zero or more
NameTypes? that indicate the databases to which it belongs. Knowing the
NameType? , we can convert the
FeatureName? to an outbound link. In most cases, the
NameType? can be determined by parsing the
FeatureName? itself. There are, however, two databases that use the same naming schemes, and it may be the case that a particular
FeatureName? belongs to one or both.
Our goal is to provide a mechanism for getting from a
FeatureName? to useful assertions. The assertions may be stored under a different name in the clearing house and we may want to bring in assertions for features that are substantially identical to our original
FeatureName? .
Notions of Equivalence
There are three notions of equivalence of interest to this process.
- For many features, we will have a group of names from other databases that have been determined by PIR International to represent more or less the same thing. We call each such group a PirGroup? . A FigId can only be in one PirGroup? . Each additional name in the PirGroup? has a clearly-identified NameType? , and will also only appear in one group.
- Each feature may belong to one or more SynonymGroups? . A SynonymGroup? contains features that represent essentially identical protein or DNA sequences. Unlike a PirGroup? , the features in a SynonymGroup? can be located in different places and even different Genomes.
- For each feature, we generally have several Aliases that people wish to use to search for features. Unlike PirGroups? and SynonymGroups? , Aliases are not organized into neat little partitions, and it is possible the same alias can have multiple meanings. For example, the common names for features are considered aliases, and these are only guaranteed to be unique within the scope of a single Genome.
The relationship between a feature and the
FeatureNames? in its
PirGroup? is required to compute the feature's assertions.
Currently,
PirGroups? are not represented in the database, but many of the associations that would be made by
PirGroups? are stored as
Aliases. While aliases are not necessary to the basic task of locating assertions, we would like to take advantage of
NameTypes? for building links, and we would like to have the alias processor look at the
PirGroups? in addition to its own data. For this reason, we need to be aware of aliases when we set up the data structures.
The ability to find assertions for substantially identical features is accomplished using
SynonymGroups? .
SynonymGroups? already exist in the database.
Critical Tasks
The following tasks have been identified as necessary for this project. Some of these tasks may be implemented using multiple procedure calls, and some tasks merely add complexity to other tasks. The list given here should not be construed as a list of the methods that need to be implemented.
- Given a FeatureName? , return a list of equivalent FeatureNames? . The equivalence could be via SynonymGroups? , PirGroups? , or by Alias.
- Given a list of FeatureNames? , return a list of their ExpertAssertions? . This will require using PirGroups? , in case the Assertion is associated with a different name for the same feature.
- Given a FeatureName? , return a complete list of its Assertions. In this case, we only return assertions to the given name. It is possible to find assertions to equivalent features by traveling from the FeatureName? to the equivalent names and then applying this procedure to each one. If that needs to be done quickly, we will want to upgrade its importance by giving it a separate entry in this list.
- Given a FeatureName? , return a list of the applicable NameTypes? .
- Given a FeatureName? , return a list of hyperlinks. Most of the time, there is only one hyperlink, but some names will not have any links and a few will have multiple links.
- Given a FeatureName? , return an equivalent FeatureName? of the specified NameType? . In this case, the equivalence can be via PirGroups? or by Aliases.
- Given a list of FeatureNames? and associated Assertions, return a list of the FeatureNames? that have ExpertAssertions? .
- Given a FeatureName? , return a list of equivalent FeatureNames? along with their Assertions and the assertion Sources? . Equivalence in this case can be via PirGroup? or SynonymGroup? .
- Given an Assertion, return a hyperlink to the assertion's source document.
Data Model
The entity-relationship data model represents the underlying truth of the data. Not every entity or relationship is implemented as a database table. For example, in the current system, the notion of
NameType? is implemented by
if statements in HtmlPm, FigPm, MapIdsPm, and ProteinPm, and by an inline hash table in AliasAnalysisPm,
Notes
- Each SynonymGroup? has associated with it a DNA or protein sequence. All members of the group will have more or less the same sequence.
- Expertise is an attribute of the Source? . An ExpertAssertion? is an assertion from an expert source.
- The NameType? includes instructions on how to turn the related FeatureNames? to hyperlinks.