Basic Concepts

In the NmpdrWebsite and SeedEnvironment databases, there is a general concept of a piece of DNA. In the databases and in the SEED, this is called a feature. In the NMPDR cover pages, it is called a gene, because it is felt this term is more familiar to the general public. We will use feature in this document.

Each feature in our databases has a unique FigId. There are, however, features we will need to work with that have not been called in our databases. For these, we must rely on external names. The general concept of a feature name will be called FeatureName? . A FigId is a special case of FeatureName? .

The AnnotationClearinghouse contains Assertions made about the nature and purpose of features. These assertions are made using a FeatureName? and come from an organization or person that we call the Source? . Some sources are considered expert, and their assertions are called ExpertAssertions? . Finding ExpertAssertions? is a critical task.

Each FeatureName? has zero or more NameTypes? that indicate the databases to which it belongs. Knowing the NameType? , we can convert the FeatureName? to an outbound link. In most cases, the NameType? can be determined by parsing the FeatureName? itself. There are, however, two databases that use the same naming schemes, and it may be the case that a particular FeatureName? belongs to one or both.

Our goal is to provide a mechanism for getting from a FeatureName? to useful assertions. The assertions may be stored under a different name in the clearing house and we may want to bring in assertions for features that are substantially identical to our original FeatureName? .

Notions of Equivalence

There are three notions of equivalence of interest to this process.

  • For many features, we will have a group of names from other databases that have been determined by PIR International to represent more or less the same thing. We call each such group a PirGroup? . A FigId can only be in one PirGroup? . Each additional name in the PirGroup? has a clearly-identified NameType? , and will also only appear in one group.
  • Each feature may belong to one or more SynonymGroups? . A SynonymGroup? contains features that represent essentially identical protein or DNA sequences. Unlike a PirGroup? , the features in a SynonymGroup? can be located in different places and even different Genomes.
  • For each feature, we generally have several Aliases that people wish to use to search for features. Unlike PirGroups? and SynonymGroups? , Aliases are not organized into neat little partitions, and it is possible the same alias can have multiple meanings. For example, the common names for features are considered aliases, and these are only guaranteed to be unique within the scope of a single Genome.

The relationship between a feature and the FeatureNames? in its PirGroup? is required to compute the feature's assertions.

Currently, PirGroups? are not represented in the database, but many of the associations that would be made by PirGroups? are stored as Aliases. While aliases are not necessary to the basic task of locating assertions, we would like to take advantage of NameTypes? for building links, and we would like to have the alias processor look at the PirGroups? in addition to its own data. For this reason, we need to be aware of aliases when we set up the data structures.

The ability to find assertions for substantially identical features is accomplished using SynonymGroups? . SynonymGroups? already exist in the database.

Critical Tasks

The following tasks have been identified as necessary for this project. Some of these tasks may be implemented using multiple procedure calls, and some tasks merely add complexity to other tasks. The list given here should not be construed as a list of the methods that need to be implemented.

  • Given a FeatureName? , return a list of equivalent FeatureNames? . The equivalence could be via SynonymGroups? , PirGroups? , or by Alias.
  • Given a list of FeatureNames? , return a list of their ExpertAssertions? . This will require using PirGroups? , in case the Assertion is associated with a different name for the same feature.
  • Given a FeatureName? , return a complete list of its Assertions. In this case, we only return assertions to the given name. It is possible to find assertions to equivalent features by traveling from the FeatureName? to the equivalent names and then applying this procedure to each one. If that needs to be done quickly, we will want to upgrade its importance by giving it a separate entry in this list.
  • Given a FeatureName? , return a list of the applicable NameTypes? .
  • Given a FeatureName? , return a list of hyperlinks. Most of the time, there is only one hyperlink, but some names will not have any links and a few will have multiple links.
  • Given a FeatureName? , return an equivalent FeatureName? of the specified NameType? . In this case, the equivalence can be via PirGroups? or by Aliases.
  • Given a list of FeatureNames? and associated Assertions, return a list of the FeatureNames? that have ExpertAssertions? .
  • Given a FeatureName? , return a list of equivalent FeatureNames? along with their Assertions and the assertion Sources? . Equivalence in this case can be via PirGroup? or SynonymGroup? .
  • Given an Assertion, return a hyperlink to the assertion's source document.

Data Model

The entity-relationship data model represents the underlying truth of the data. Not every entity or relationship is implemented as a database table. For example, in the current system, the notion of NameType? is implemented by if statements in HtmlPm, FigPm, MapIdsPm, and ProteinPm, and by an inline hash table in AliasAnalysisPm,

AliasErModel.png

Notes

  • Each SynonymGroup? has associated with it a DNA or protein sequence. All members of the group will have more or less the same sequence.
  • Expertise is an attribute of the Source? . An ExpertAssertion? is an assertion from an expert source.
  • The NameType? includes instructions on how to turn the related FeatureNames? to hyperlinks.

SequencingForm
Sequence 000400
Summary Notes on the notion of equivalence and how it relates to features and genes
Topic revision: r4 - 11 Oct 2008 - 23:58:14 - BruceParrello
 
NMPDR is a collaboration among researchers from the Computation Institute of the University of Chicago, the Fellowship for Interpretation of Genomes (FIG), Argonne National Laboratory, and the National Center for Supercomputing Applications (NCSA) at the University of Illinois. NMPDR is funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract HHSN266200400042C. Banner images are copyright © Dennis Kunkel.