The Sprout Database

The Sprout database contains 34.6 billion characters of data and and additional 2.8 gigabytes of search indexes. The main tables are shown in the diagram below. Rectangles represent entities (sometimes also called objects) and diamonds represent relationships. Ovals represent external data servers. Hold the mouse over a shape to see a short description. A thick connecting line indicates a multiply-occurring relationship in that direction; a thin one indicates a singly-occurring relationship in that direction. Clicking on most shapes will take you to more detailed information.

Get Adobe Flash Player   This page requires Flash Player in order to display properly.

Introduction

The Sprout database contains the genetic data for all complete organisms in the SeedEnvironment. The data that is not in Sprout-- attributes, similarities, couplings-- is stored on external servers available to the Sprout software. The Sprout database is reloaded approximately once per month. There is significant redundancy in the Sprout database because it has been optimized for searching. In particular, the Feature table contains an extra copy of the feature's functional role and a list of possible search terms.

Data Types

Type Indexable Sort Pos Format Description
boolean yes numeric 1 scalar Boolean value (0 or 1).
char yes alphabetic 1 scalar Single-character value.
counter yes numeric 1 scalar Large, unsigned integer, ranging from 0 to 9 quintillion.
date yes numeric 1 scalar Date and time stamp, in seconds since 1970.
dna no alphabetic 5 scalar Long DNA sequence, compressed
float yes numeric 1 scalar Floating-point number, approximately 15 significant digits, from 10^-308 to 10^+308.
hash-string yes alphabetic 1 scalar A Base64 Digest MD5 code.
int yes numeric 1 scalar Standard signed integers, ranging from approximately -2 billion to 2 billion.
link prefix alphabetic 3 HyperLink Short character string, with optional associated URL.
long-string yes alphabetic 4 scalar Character string, up to approximately 500 characters long, fully indexable.
semi-boolean yes alphabetic 1 scalar Ternary boolean value-- Y, N, or ?.
string yes alphabetic 2 scalar Short character string, from 0 to approximately 250 characters.
text prefix alphabetic 4 scalar Long character string, from 0 to approximately 16 million characters, not generally indexable.

Entities

Annotation

An annotation contains supplementary information about a feature. The most important type of annotation is the assignment of a functional role; however, other types of annotations are also possible.

Name Type Notes
id string Unique identifier for this Annotation.
annotation text Text of the annotation.
time date Date and time of the annotation.

CDD

A CDD is a protein domain designator. It represents the shape of a molecular unit on a feature's protein. The ID is six-digit string assigned by the public Conserved Domain Database. A CDD can occur on multiple features and a feature generally has multiple CDDs.

Name Type Notes
id string Unique identifier for this CDD.

CellLocation

A section of the cell in which a protein might be found. This includes the cell wall or membrane, outside the cell, inside the cell, and so forth.

Name Type Notes
id string Unique identifier for this CellLocation.

Compound

A compound is a chemical that participates in a reaction. All compounds have a unique ID and may also have one or more names.

Name Type Notes
id string Unique identifier for this Compound.
label string Name used in reaction display strings. This is the same as the name possessing a priority of 1, but it is placed here to speed up the query used to create the display strings.

CompoundCAS

This entity represents the Chemical Abstract Service ID for a compound. Each Compound has at most one CAS ID.

Name Type Notes
id string Unique identifier for this CompoundCAS.

CompoundName

A compound name is a common name for the chemical represented by a compound.

Name Type Notes
id string Unique identifier for this CompoundName.

Contig

A contig is a contiguous run of residues. The contig's ID consists of the genome ID followed by a name that identifies which contig this is for the parent genome. As is the case with all keys in this database, the individual components are separated by a period. A contig can contain over a million residues. For performance reasons, therefore, the contig is split into multiple pieces called sequences. The sequences contain the characters that represent the residues as well as data on the quality of the residue identification.

Name Type Notes
id string Unique identifier for this Contig.
length counter Number of base pairs in this contig.

Diagram

A functional diagram describes a network of chemical reactions, often comprising a single subsystem. A diagram is identified by a short name and contains a longer descriptive name. The actual diagram shows which functional roles guide the reactions along with the inputs and outputs; the database, however, only indicates which roles belong to a particular diagram's map.

Name Type Notes
id string Unique identifier for this Diagram.
name text Descriptive name of this diagram.

ExternalDatabase

An external database identifies a biological database surveyed by PIR International as part of an effort to determine which features are essentially identical between bioinformatics organizations. Each feature in the database will have zero or more corresponding IDs that are captured from the PIR data. Each corresponding ID is represented in a relationship between an external database and the feature itself.

Name Type Notes
id string Unique identifier for this ExternalDatabase.

Feature

A feature (sometimes also called a "gene" is a part of a genome that is of special interest. Features may be spread across multiple contigs of a genome, but never across more than one genome. Features can be assigned to roles via spreadsheet cells, and are the targets of annotation. Each feature in the database has a unique FigId.

Name Type Notes
id string Unique identifier for this Feature.
assignment text Default functional assignment for this feature.
assignment-maker string name of the user who made the functional assignment
assignment-quality char quality of the functional assignment, usually a space, but may be W (indicating weak) or X (indicating experimental)
conserved-neighbors int number of coupled features
feature-type string Code indicating the type of this feature. Among the codes currently supported are "peg" for a protein encoding gene, "bs" for a binding site, "opr" for an operon, and so forth.
in-genbank boolean TRUE if a feature can be found in GenBank, else FALSE
isoelectric-point float pH in the surrounding medium at which the charge on a protein is neutral. If the pH of the medium is lower than this value, the protein will have a net positive charge. If the pH of the medium is higher, then the protein will have a net negative charge.
keywords text This is a list of search keywords for the feature. It includes the functional assignment, subsystem roles, and special properties.
location-string text Location of the feature, expressed as a comma-delimited list of Sprout location strings. This gives us a fast mechanism for extracting the feature location. Otherwise, we have to painstakingly paste together the IsLocatedIn records, which are themselves designed to help look for features in a particular region rather than to find the location of a feature.
locked boolean TRUE if a feature's assignment is locked. A locked feature's functional role cannot be changed by automated programs.
molecular-weight float Molecular weight of this feature's protein, in daltons. A weight of 0 indicates that no protein is created.
sequence-length counter Number of base pairs in this feature.
signal-peptide string The signal peptide location for this feature. This is expressed as start and end numbers with a hyphen for the relevant amino acids. So, "1-22" would indicate a signal peptide at the beginning of the feature's protein and extending through 22 amino acid positions. An empty string means no signal peptide is present.
similar-to-human boolean TRUE if this feature generates a protein that is similar to one found in humans, else FALSE
transmembrane-domain-count int number of sections in the feature's protein that become embedded in the cell membrane
transmembrane-map text A map indicating which sections of a protein will be embedded in a membrane. This is expressed as a comma-separated list of as start and end numbers with hyphens for the relevant amino acids. So, "10-12, 40-60" would indicate that there are two sections of the protein that become embedded in a membrane: the 10th through 12th amino acids, and the 40th through the 60th. An empty string means no transmembrane regions are known.
ec string array An EC number associated with this feature.
essential link array A value indicating the essentiality of the feature, coded as HTML. In most cases, this will be a word describing whether the essentiality is confirmed (essential) or potential (potential-essential), hyperlinked to the document from which the essentiality was curated. If a feature is not essential, this field will have no values; otherwise, it may have multiple values.
iedb link array A value indicating whether or not the feature can be found in the Immune Epitope Database. If the feature has not been matched to that database, this field will have no values. Otherwise, it will have an epitope name and/or sequence, hyperlinked to the database.
link text array Web hyperlink for this feature. A feature can have no hyperlinks or it can have many. The links are to other websites that have useful about the gene that the feature represents, and are coded as raw HTML.
translation text array (optional) A translation of this feature's residues into protein character codes, formed by concatenating the pieces of the feature together. Only protein encoding genes have translations.
upstream-sequence text array Upstream sequence for the feature. This includes residues preceding the feature as well as some of the feature's initial residues.
virulent link array A value indicating the virulence of the feature, coded as HTML. In most cases, this will be a phrase or SA number hyperlinked to the document from which the virulence information was curated. If the feature is not virulent, this field will have no values; otherwise, it may have multiple values.

FeatureAlias

Alternative names for features. A feature can have many aliases. In general, each alias corresponds to only one feature, but there are many exceptions to this rule.

  • FeatureAlias IsAliasOf Feature (many-to-many).

Name Type Notes
id string Unique identifier for this FeatureAlias.

Genome

A Genome contains the sequence data for a particular individual organism.

Name Type Notes
id float Unique identifier for this Genome.
complete boolean TRUE if the genome is complete, else FALSE
contigs int Number of contigs for this organism.
dna-size counter number of base pairs in the genome
endospore semi-boolean Y/N/? flag indicating whether or not this organism produces endospores.
gc-content float Percentage of the genome that consists of G and C base pairs.
genus string Genus of the relevant organism.
gram-stain semi-boolean Gram stain behaviour of organism-- positive, negative, or unknown.
habitat string Preferred habitat of organism.
motility semi-boolean Y/N/? flag indicating whether or not this organism can move under its own power.
optimal-temperature-range string Indication of this organism's behavior relating to environmental temperature.
oxygen string Indication of this organism's behavior relating to environmental oxygen.
pathogenic semi-boolean Y/N/? flag indicating whether or not this organism is pathogenic.
pegs int Number of protein encoding genes for this organism
primary-group string The primary NMPDR group for this organism. There is always exactly one NMPDR group per organism (either based on the organism name or a default value for supporting genomes). In general, more data is kept on organisms in NMPDR groups than on supporting organisms.
rnas int Number of RNA features found for this organism.
salinity string Indication of this organism's behavior relating to environmental salinity.
scientific-name string Scientific name of this genome, usually consisting of the genus, species, and unique characterization.
species string Species of the relevant organism.
taxonomy text The taxonomy string contains the full taxonomy of the organism, with individual elements separated by semi-colons (and optional white space), starting with the domain and ending with the disambiguated genus and species (which is the organism's scientific name plus an identifying string).
temperature-max float Maximum optimal temperature for this organism, in degrees Celsius.
temperature-min float Minimum optimal temperature for this organism, in degrees Celsius.
unique-characterization string The unique characterization identifies the particular organism instance from which the genome is taken. It is possible to have in the database more than one genome for a particular species, and every individual organism has variations in its DNA.
version string Version string for this genome, generally consisting of the genome ID followed by a period and a string of digits.

GenomeSubset

A genome subset is a named collection of genomes that participate in a particular subsystem. The subset names are generally very short, non-unique strings. The ID of the parent subsystem is prefixed to the subset ID in order to make it unique.

Name Type Notes
id string Unique identifier for this GenomeSubset.

Host

A host is a particular type of creature in which an organism has pathogenic behavior. Hosts can be specific (Human) or general (Animal).

Name Type Notes
id string Unique identifier for this Host.

Keyword

A keyword is a word that can be used to search the feature table. This entity contains the keyword's stem, its phonetic form, and the number of features that can be found by searching for the word.

Name Type Notes
id string Unique identifier for this Keyword.
phonex string A phonex is a string that identifies the phonetic characteristics of the word stem. This can be used to find alternative spellings if an matching word is not present.
stem string The stem of a keyword is a normalized form that is independent of parts of speech. The actual keywords stored in the database search index are stems.

Ligand

A Ligand is a chemical of interest in computing docking energies against a PDB. The ID of the ligand is an 8-digit ID number in the ZINC database.

Name Type Notes
id string Unique identifier for this Ligand.
name long-string Chemical name of this ligand.

PDB

A PDB is a protein data bank entry containing information that can be used to determine the shape of the protein and the energies required to dock with it. The ID is the four-character name used on the PDB web site.

Name Type Notes
id string Unique identifier for this PDB.
docking-count int The number of ligands that have been docked against this PDB.

Property

A property is a type of assertion that could be made about the properties of a particular feature. Each property instance is a key/value pair and can be associated with many different features. Conversely, a feature can be associated with many key/value pairs, even some that notionally contradict each other. For example, there can be evidence that a feature is essential to the organism's survival and evidence that it is superfluous.

Name Type Notes
id int Unique identifier for this Property.
property-name string Name of this property.
property-value text Value associated with this property. For each property name, there must by a property record for all of its possible values.

ProteinFamily

A protein family represents a group of proteins with related functions. Some protein families are downloaded from the PFAM database and some are FIGfams. The protein family keys all begin with the letters PF and the FIGfams with the letters FIG.

Name Type Notes
id string Unique identifier for this ProteinFamily.
common-name string array Ontological name for the protein family. Not all families have ontological names.

Reaction

A reaction is a chemical process catalyzed by a protein. The reaction ID is generally a small number preceded by a letter.

Name Type Notes
id string Unique identifier for this Reaction.
rev boolean TRUE if this reaction is reversible, else FALSE
url string array HTML string containing a link to a web location that describes the reaction. This field is optional.

Role

A role describes a biological function that may be fulfilled by a feature. One of the main goals of the database is to record the roles of the various features.

Name Type Notes
id string Unique identifier for this Role.

RoleEC

EC code for a role.

Name Type Notes
id string Unique identifier for this RoleEC.

RoleSubset

A role subset is a named collection of roles in a particular subsystem. The subset names are generally very short, non-unique strings. The ID of the parent subsystem is prefixed to the subset ID in order to make it unique.

Name Type Notes
id string Unique identifier for this RoleSubset.

SSCell

Part of the process of subsystem annotation of features is creating a spreadsheet of genomes and roles to which features are assigned. A spreadsheet cell represents one of the positions on the spreadsheet.

Name Type Notes
id hash-string Unique identifier for this SSCell.
column-number int Column number of this cell. This value is put here to improve the performance of an essential query.

Scenario

A scenario used to verify the validity of subsystem assignments. Each scenario converrts input compounds to output compounds using reactions. The scenario may use all of the reactions controlled by a subsystem or only some, and may also incorporate additional reactions.

Name Type Notes
id string Unique identifier for this Scenario.

Sequence

A sequence is a continuous piece of a contig. Contigs are split into sequences so that we don't have to have the entire contig in memory when we are manipulating it. The key of the sequence is the contig ID followed by the index of the begin point.

Name Type Notes
id string Unique identifier for this Sequence.
quality-vector text String describing the quality data for each base pair. Individual values will be separated by periods. The value represents negative exponent of the probability of error. Thus, for example, a quality of 30 indicates the probability of error is 10^-30. A higher quality number indicates a better chance of a correct match. It is possible that the quality data is not known for a sequence. If that is the case, the quality vector will contain the string "unknown".
sequence dna String consisting of the residues (base pairs). Each residue is described by a single character in the string.

Source

A source describes a place from which genome data was taken. This can be an organization or a paper citation.

Name Type Notes
id string Unique identifier for this Source.
description text Description of the source. The description can be a street address or a citation.
URL string array URL the paper cited or of the organization's web site. This field optional.

SproutUser

A user is a person who can make annotations and view data in the database. The user object is keyed on the user's login name.

Name Type Notes
id string Unique identifier for this SproutUser.
description string Full name or description of this user.

Subsystem

A subsystem is a collection of roles that work together in a cell. Identification of subsystems is an important tool for recognizing parallel genetic features in different organisms.

Name Type Notes
id string Unique identifier for this Subsystem.
curator string Name of the person currently in charge of the subsystem.
description text Description of the subsystem's function in the cell.
notes text Descriptive notes about the subsystem.
version int Version number for the subsystem. This value is incremented each time the subsystem is backed up.
classification string array Classification string, colon-delimited. This string organizes the subsystems into a hierarchy.
hope-curation-notes text array Text description of how the scenarios were curated.

SynonymGroup

A synonym group represents a group of features. Features that represent substantially identical proteins or DNA sequences are mapped to the same synonym group, and this information is used to expand similarities.

Name Type Notes
id string Unique identifier for this SynonymGroup.

Relationships

Catalyzes

  • Each Role relates to multiple Reactions.
  • Each Reaction relates to multiple Roles
  • Converse name is IsCatalyzedBy.
This relationship connects a role to the reactions it catalyzes. The purpose of a role is to create proteins that trigger certain chemical reactions. A single reaction can be triggered by many roles, and a role can trigger many reactions.

Name Type Notes
from-link string id of the source Role.
to-link string id of the target Reaction.

ComesFrom

  • Each Genome relates to multiple Sources.
  • Each Source relates to multiple Genomes
This relationship connects a genome to the sources that mapped it. A genome can come from a single source or from a cooperation among multiple sources.

Name Type Notes
from-link float id of the source Genome.
to-link string id of the target Source.

ConsistsOfGenomes

  • Each GenomeSubset relates to multiple Genomes.
  • Each Genome relates to multiple GenomeSubsets
This relationship connects a subset to the genomes that it covers. A subset is, essentially, a named group of genomes participating in a specific subsystem, and this relationship effects that. Note that while a genome may belong to many subsystems, a subset belongs to only one subsystem, and all genomes in the subset must have that subsystem in common.

Name Type Notes
from-link string id of the source GenomeSubset.
to-link float id of the target Genome.

ConsistsOfRoles

  • Each RoleSubset relates to multiple Roles.
  • Each Role relates to multiple RoleSubsets
This relationship connects a role subset to the roles that it covers. A subset is, essentially, a named group of roles belonging to a specific subsystem, and this relationship effects that. Note that will a role may belong to many subsystems, a subset belongs to only one subsystem, and all roles in the subset must have that subsystem in common.

Name Type Notes
from-link string id of the source RoleSubset.
to-link string id of the target Role.

ContainsFeature

  • Each SSCell relates to multiple Features.
  • Each Feature relates to multiple SSCells
This relationship connects a subsystem's spreadsheet cell to the features assigned to it.

Name Type Notes
cluster-number int ID of this feature's cluster. Clusters represent families of related proteins participating in a subsystem.
from-link hash-string id of the source SSCell.
to-link string id of the target Feature.

DocksWith

  • Each PDB relates to multiple Ligands.
  • Each Ligand relates to multiple PDBs
Indicates that a docking result exists between a PDB and a ligand. The docking result describes the energy required for the ligand to dock with the protein described by the PDB. A lower energy indicates the ligand has a good chance of disabling the protein. At the current time, only the best docking results are kept.

Name Type Notes
electrostatic-energy float Docking energy in kcal/mol that results from the movement of electrons (electrostatic force) between the PDB and the ligand.
from-link string id of the source PDB.
reason string Indication of the reason for determining the docking result. A value of Random indicates the docking was attempted as a part of a random survey used to determine the docking characteristics of the PDB. A value of Rich indicates the docking was attempted because a low-energy docking result was predicted for the ligand with respect to the PDB.
to-link string id of the target Ligand.
tool string Name of the tool used to produce the docking result.
total-energy float Total energy required for the ligand to dock with the PDB protein, in kcal/mol. A negative value means energy is released.
vanderwaals-energy float Docking energy in kcal/mol that results from the geometric fit (Van der Waals force) between the PDB and the ligand.

ExcludesReaction

  • Each Scenario relates to multiple Reactions.
  • Each Reaction relates to multiple Scenarios
This relationship connects a scenario to reactions of the parent subsystem that do not participate in it.

Name Type Notes
from-link string id of the source Scenario.
to-link string id of the target Reaction.

HasCompoundName

  • Each Compound relates to multiple CompoundNames.
  • Each CompoundName relates to multiple Compounds
Connects a compound to its names. A compound generally has several names

Name Type Notes
from-link string id of the source Compound.
priority int Priority of this name, with 1 being the highest priority, 2 the next highest, and so forth.
to-link string id of the target CompoundName.

HasContig

  • Each Genome relates to multiple Contigs.
This relationship connects a genome to the contigs that contain the actual genetic information.

Name Type Notes
from-link float id of the source Genome.
to-link string id of the target Contig.

HasFeature

  • Each Genome relates to multiple Features.
  • Converse name is IsInGenome.
This relationship connects a genome to all of its features. This relationship is redundant in a sense, because the genome ID is part of the feature ID; however, it makes the creation of certain queries more convenient because you can drag in filtering information for a feature's genome.

Name Type Notes
from-link float id of the source Genome.
to-link string id of the target Feature.
type string Feature type (eg. peg, rna)

HasGenomeSubset

  • Each Subsystem relates to multiple GenomeSubsets.
This relationship connects a subsystem to its constituent genome subsets. Note that some genomes in a subsystem may not belong to a subset, so the relationship between genomes and subsystems cannot be derived from the relationships going through the subset.

Name Type Notes
from-link string id of the source Subsystem.
to-link string id of the target GenomeSubset.

HasProperty

  • Each Feature relates to multiple Propertys.
  • Each Property relates to multiple Features
This relationship connects a feature to its known property values. The relationship contains text data that indicates the paper or organization that discovered evidence that the feature possesses the property. So, for example, if two papers presented evidence that a feature is essential, there would be an instance of this relationship for both.

Name Type Notes
evidence text URL or citation of the paper or institution that reported evidence of the relevant feature possessing the specified property value.
from-link string id of the source Feature.
to-link int id of the target Property.

HasRoleInSubsystem

  • Each Feature relates to multiple Subsystems.
  • Each Subsystem relates to multiple Features
This relationship connects a feature to the subsystems in which it participates. This is technically redundant information, but it is used so often that it gets its own table for performance reasons.

Name Type Notes
from-link string id of the source Feature.
genome string ID of the genome containing the feature
to-link string id of the target Subsystem.
type string Feature type (eg. peg, rna)

HasRoleSubset

  • Each Subsystem relates to multiple RoleSubsets.
This relationship connects a subsystem to its constituent role subsets. Note that some roles in a subsystem may not belong to a subset, so the relationship between roles and subsystems cannot be derived from the relationships going through the subset.

Name Type Notes
from-link string id of the source Subsystem.
to-link string id of the target RoleSubset.

HasSSCell

  • Each Subsystem relates to multiple SSCells.
This relationship connects a subsystem to the spreadsheet cells used to analyze and display it. The cells themselves can be thought of as a grid with Roles on one axis and Genomes on the other. The various features of the subsystem are then assigned to the cells.

Name Type Notes
from-link string id of the source Subsystem.
to-link hash-string id of the target SSCell.

HasScenario

  • Each Subsystem relates to multiple Scenarios.
  • Each Scenario relates to multiple Subsystems
This relationship connects a role to the scenarios used to validate it.

Name Type Notes
from-link string id of the source Subsystem.
to-link string id of the target Scenario.

IncludesReaction

  • Each Scenario relates to multiple Reactions.
  • Each Reaction relates to multiple Scenarios
This relationship connects a scenario to reactions that participate in it but are not part of the parent subsystem.

Name Type Notes
from-link string id of the source Scenario.
to-link string id of the target Reaction.

IsAComponentOf

  • Each Compound relates to multiple Reactions.
  • Each Reaction relates to multiple Compounds
This relationship connects a reaction to the compounds that participate in it.

Name Type Notes
discriminator int A unique ID for this record. The discriminator does not provide any useful data, but it prevents identical records from being collapsed by the SELECT DISTINCT command used by ERDB to retrieve data.
from-link string id of the source Compound.
loc string An optional character string that indicates the relative position of this compound in the reaction's chemical formula. The location affects the way the compounds present as we cross the relationship from the reaction side. The product/substrate flag comes first, then the value of this field, then the main flag. The default value is an empty string; however, the empty string sorts first, so if this field is used, it should probably be used for every compound in the reaction.
main boolean TRUE if this compound is one of the main participants in the reaction, else FALSE. It is permissible for none of the compounds in the reaction to be considered main, in which case this value would be FALSE for all of the relevant compounds.
product boolean TRUE if the compound is a product of the reaction, FALSE if it is a substrate. When a reaction is written on paper in chemical notation, the substrates are left of the arrow and the products are to the right. Sorting on this field will cause the substrates to appear first, followed by the products. If the reaction is reversible, then the notion of substrates and products is not at intuitive; however, a value here of FALSE still puts the compound left of the arrow and a value of TRUE still puts it to the right.
stoichiometry string Number of molecules of the compound that participate in a single instance of the reaction. For example, if a reaction produces two water molecules, the stoichiometry of water for the reaction would be two. When a reaction is written on paper in chemical notation, the stoichiometry is the number next to the chemical formula of the compound.
to-link string id of the target Reaction.

IsAliasOf

  • Each FeatureAlias relates to multiple Features.
  • Each Feature relates to multiple FeatureAliases
Connects an alias to the feature it represents. Every alias connects to at least 1 feature, and a feature connects to many aliases.

Name Type Notes
from-link string id of the source FeatureAlias.
to-link string id of the target Feature.

IsAlsoFoundIn

  • Each Feature relates to multiple ExternalDatabases.
  • Each ExternalDatabase relates to multiple Features
This relationship connects a feature to external databases that contain essentially identical features. The name used in the external database is stored in the relationship as intersection data.

Name Type Notes
alias string ID of the feature in the specified external database.
from-link string id of the source Feature.
to-link string id of the target ExternalDatabase.

IsFamilyForFeature

  • Each ProteinFamily relates to multiple Features.
  • Each Feature relates to multiple ProteinFamilys
  • Converse name is IsInFamily.
This relationship connects a feature to its protein families.

Name Type Notes
from-link string id of the source ProteinFamily.
range string Location in the feature of the matching protein.
to-link string id of the target Feature.

IsGenomeOf

  • Each Genome relates to multiple SSCells.
This relationship connects a subsystem's spreadsheet cell to the genome for the spreadsheet column.

Name Type Notes
from-link float id of the source Genome.
to-link hash-string id of the target SSCell.

IsIdentifiedByCAS

  • Each Compound relates to multiple CompoundCASs.
  • Each CompoundCAS relates to multiple Compounds
Relates a compound's CAS ID to the compound itself. Every CAS ID is associated with a compound, and some are associated with two compounds, but not all compounds have CAS IDs.

Name Type Notes
from-link string id of the source Compound.
to-link string id of the target CompoundCAS.

IsIdentifiedByEC

  • Each Role relates to multiple RoleECs.
  • Each RoleEC relates to multiple Roles
Relates a role to its EC number. Every EC number is associated with a role, but not all roles have EC numbers.

Name Type Notes
from-link string id of the source Role.
to-link string id of the target RoleEC.

IsInputFor

  • Each Compound relates to multiple Scenarios.
  • Each Scenario relates to multiple Compounds
This relationship connects a scenario to its input compounds.

Name Type Notes
from-link string id of the source Compound.
to-link string id of the target Scenario.

IsLocatedIn

  • Each Feature relates to multiple Contigs.
  • Each Contig relates to multiple Features
  • Converse name is IsLocusFor.
This relationship connects a feature to the contig segments that work together to effect it. The segments are numbered sequentially starting from 1. The database is required to place an upper limit on the length of each segment. If a segment is longer than the maximum, it can be broken into smaller bits. The upper limit enables applications to locate all features that contain a specific residue. For example, if the upper limit is 100 and we are looking for a feature that contains residue 234 of contig ABC, we can look for features with a begin point between 135 and 333. The results can then be filtered by direction and length of the segment.

Name Type Notes
beg int Index (1-based) of the first residue in the contig that belongs to the segment.
dir char Direction of the segment: + if it is forward and - if it is backward.
from-link string id of the source Feature.
len int Number of residues in the segment. A length of 0 identifies a specific point between residues. This is the point before the residue if the direction is forward and the point after the residue if the direction is backward.
locN int Sequence number of this segment.
to-link string id of the target Contig.

IsMadeUpOf

  • Each Contig relates to multiple Sequences.
A contig is stored in the database as an ordered set of sequences. By splitting the contig into sequences, we get a performance boost from only needing to keep small portions of a contig in memory at any one time. This relationship connects the contig to its constituent sequences.

Name Type Notes
from-link string id of the source Contig.
len int Length of the sequence.
start-position int Index (1-based) of the point in the contig where this sequence starts.
to-link string id of the target Sequence.

IsOnDiagram

  • Each Scenario relates to multiple Diagrams.
  • Each Diagram relates to multiple Scenarios
This relationship connects a scenario to related diagrams.

Name Type Notes
from-link string id of the source Scenario.
to-link string id of the target Diagram.

IsOutputOf

  • Each Compound relates to multiple Scenarios.
  • Each Scenario relates to multiple Compounds
This relationship connects a scenario to its output compounds

Name Type Notes
auxiliary boolean TRUE if this is an auxiliary output compound, FALSE if it is a main output compound.
from-link string id of the source Compound.
to-link string id of the target Scenario.

IsPathogenicIn

  • Each Genome relates to multiple Hosts.
  • Each Host relates to multiple Genomes
This relationship connects a genome to a host in which it is pathogenic. Many genomes do not have a pathogenic host; some have multiple hosts.

Name Type Notes
from-link float id of the source Genome.
to-link string id of the target Host.

IsPossiblePlaceFor

  • Each CellLocation relates to multiple Features.
  • Each Feature relates to multiple CellLocations
This relationship connects a feature with the various places in a cell that the feature might be found. The confidence factor is included as intersection data.

Name Type Notes
confidence float Confidence that the protein will be found in this location, expressed as a value from 0 to 10.
from-link string id of the source CellLocation.
to-link string id of the target Feature.

IsPresentOnProteinOf

  • Each CDD relates to multiple Features.
  • Each Feature relates to multiple CDDs
This relationship connects a feature to its CDD protein domains. The match score is included as intersection data.

Name Type Notes
from-link string id of the source CDD.
score float This is the match score between the feature and the CDD. A lower score is a better match.
to-link string id of the target Feature.

IsProteinForFeature

  • Each PDB relates to multiple Features.
  • Each Feature relates to multiple PDBs
Relates a PDB to features that produce highly similar proteins.

Name Type Notes
end-location int Ending location within the feature of the matching region.
from-link string id of the source PDB.
score float Similarity score for the comparison between the feature and the PDB protein. A lower score indicates a better match.
start-location int Starting location within the feature of the matching region.
to-link string id of the target Feature.

IsRepresentativeOf

  • Each Genome relates to multiple Genomes.
This relationship connects a genome to its representative. Genomes are partitioned into multiple sets of close strains, each having a single representative. In certain situations, it is desirable to analyze only representative genomes rather than the full suite.

Name Type Notes
from-link float id of the source Genome.
to-link float id of the target Genome.

IsRoleOf

  • Each Role relates to multiple SSCells.
This relationship connects a subsystem's spreadsheet cell to the role for the spreadsheet row.

Name Type Notes
from-link string id of the source Role.
to-link hash-string id of the target SSCell.

IsSynonymGroupFor

  • Each SynonymGroup relates to multiple Features.
  • Each Feature relates to multiple SynonymGroups
This relation connects a synonym group to the features that make it up.

Name Type Notes
from-link string id of the source SynonymGroup.
to-link string id of the target Feature.

IsTargetOfAnnotation

  • Each Feature relates to multiple Annotations.
This relationship connects a feature to its annotations.

Name Type Notes
from-link string id of the source Feature.
to-link string id of the target Annotation.

IsTrustedBy

  • Each SproutUser relates to multiple SproutUsers.
This relationship identifies the users trusted by each particular user. When viewing functional assignments, the assignment displayed is the most recent one by a user trusted by the current user. The current user implicitly trusts himself. If no trusted users are specified in the database, the user also implicitly trusts the user FIG.

Name Type Notes
from-link string id of the source SproutUser.
to-link string id of the target SproutUser.

MadeAnnotation

  • Each SproutUser relates to multiple Annotations.
This relationship connects an annotation to the user who made it.

Name Type Notes
from-link string id of the source SproutUser.
to-link string id of the target Annotation.

OccursInSubsystem

  • Each Role relates to multiple Subsystems.
  • Each Subsystem relates to multiple Roles
  • Converse name is Uses.
This relationship connects roles to the subsystems that implement them.

Name Type Notes
abbr string Abbreviated name for the role, generally non-unique, but useful in column headings for HTML tables.
auxiliary boolean If TRUE, then this role is ancillary to the purpose of the subsystem. If FALSE, it is essential to its metabolic pathway.
column-number int Column number for this role in the specified subsystem's spreadsheet.
from-link string id of the source Role.
hope-reaction-note text A description of the status of a role in relation to the reactions it produces as determined by the scenarios. If present, will indicate if the role has been determined to be auxiliary, if it has been examined to verify an automatic assignment, and so forth.
to-link string id of the target Subsystem.

ParticipatesIn

  • Each Genome relates to multiple Subsystems.
  • Each Subsystem relates to multiple Genomes
This relationship connects subsystems to the genomes that use it. If the subsystem has been curated for the genome, then the subsystem's roles will also be connected to the genome features through the SSCell object.

Name Type Notes
from-link float id of the source Genome.
to-link string id of the target Subsystem.
variant-code string Code indicating the subsystem variant to which this genome belongs. Each subsystem can have multiple variants. A variant code of -1 indicates that the genome does not have a functional variant of the subsystem. A variant code of 0 indicates that the genome's participation is considered iffy.

RoleOccursIn

  • Each Role relates to multiple Diagrams.
  • Each Diagram relates to multiple Roles
This relationship connects a role to the diagrams on which it appears. A role frequently identifies an enzyme, and can appear in many diagrams. A diagram generally contains many different roles.

Name Type Notes
from-link string id of the source Role.
to-link string id of the target Diagram.

Miscellaneous

BBHs

For each feature, the BBH Server has that feature's bidirectional best hits in other genomes.

Pins

The Pin Server provides information about functional couplings between features.

Sims

The Similarity Server contains a high-performance custom database of similarities between features.

WebServices

HTTP services are used to transmit data between the servers and the NMPDR.
SequencingForm
Sequence 003000
Summary A diagram and description of the main NMPDR database
Topic attachments
I Attachment Action Size Date Who Comment
xmlxml SproutDBD.xml manage 64.4 K 13 Jan 2009 - 03:20 Bruce Parrello Sprout database definition
Topic revision: r15 - 16 Jan 2009 - 15:24:01 - Bruce Parrello
 
Notice to NMPDR Users - The NMPDR BRC contract has ended and bacterial data from NMPDR has been transferred to PATRIC (http://www.patricbrc.org), a new consolidated BRC for all NIAID category A-C priority pathogenic bacteria. NMPDR was a collaboration among researchers from the Computation Institute of the University of Chicago, the Fellowship for Interpretation of Genomes (FIG), Argonne National Laboratory, and the National Center for Supercomputing Applications (NCSA) at the University of Illinois. NMPDR is funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract HHSN266200400042C. Banner images are copyright © Dennis Kunkel.