Introduction
The Sprout database contains the genetic data for all complete organisms in the SeedEnvironment. The data that is not in Sprout-- attributes, similarities, couplings-- is stored on external servers available to the Sprout software. The Sprout database is reloaded approximately once per month. There is significant redundancy in the Sprout database because it has been optimized for searching. In particular, the Feature table contains an extra copy of the feature's functional role and a list of possible search terms.
Entities
Annotation
An annotation contains supplementary information about a feature. The most important type of annotation is the assignment of a
functional role; however, other types of annotations are also possible.
Annotation Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Annotation. |
| time |
date |
n/a |
Date and time of the annotation. |
| annotation |
text |
n/a |
Text of the annotation. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
time DESC |
This index allows the user to find recent annotations. |
| idx1 |
yes |
id |
Primary index for Annotation. |
CDD
A CDD is a protein domain designator. It represents the shape of a molecular unit on a feature's protein. The ID is six-digit string assigned by the public Conserved Domain Database. A CDD can occur on multiple features and a feature generally has multiple CDDs.
CDD Table
| Field |
Type |
Default |
Description |
| id |
key-string |
n/a |
Unique identifier for this CDD. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for CDD. |
CellLocation
A section of the cell in which a protein might be found. This includes the cell wall or membrane, outside the cell, inside the cell, and so forth.
CellLocation Table
| Field |
Type |
Default |
Description |
| id |
key-string |
n/a |
Unique identifier for this CellLocation. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for CellLocation. |
Compound
A compound is a chemical that participates in a reaction. All compounds have a unique ID and may also have one or more names.
Compound Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Compound. |
| label |
string |
n/a |
Name used in reaction display strings. This is the same as the name possessing a priority of 1, but it is placed here to speed up the query used to create the display strings. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Compound. |
CompoundCAS
This entity represents the Chemical Abstract Service ID for a compound. Each Compound has at most one CAS ID.
CompoundCAS Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this CompoundCAS. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for CompoundCAS. |
CompoundName
A compound name is a common name for the chemical represented by a compound.
CompoundName Table
| Field |
Type |
Default |
Description |
| id |
string |
n/a |
Unique identifier for this CompoundName. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for CompoundName. |
Contig
A contig is a contiguous run of residues. The contig's ID consists of the genome ID followed by a name that identifies which contig this is for the parent genome. As is the case with all keys in this database, the individual components are separated by a period. A contig can contain over a million residues. For performance reasons, therefore, the contig is split into multiple pieces called sequences. The sequences contain the characters that represent the residues as well as data on the quality of the residue identification.
Contig Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Contig. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Contig. |
Diagram
A functional diagram describes a network of chemical reactions, often comprising a single subsystem. A diagram is identified by a short name and contains a longer descriptive name. The actual diagram shows which functional roles guide the reactions along with the inputs and outputs; the database, however, only indicates which roles belong to a particular diagram's map.
Diagram Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Diagram. |
| name |
text |
n/a |
Descriptive name of this diagram. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Diagram. |
ExternalDatabase
An external database identifies a biological database surveyed by PIR International as part of an effort to determine which features are essentially identical between bioinformatics organizations. Each feature in the database will have zero or more corresponding IDs that are captured from the PIR data. Each corresponding ID is represented in a relationship between an external database and the feature itself.
ExternalDatabase Table
| Field |
Type |
Default |
Description |
| id |
key-string |
n/a |
Unique identifier for this ExternalDatabase. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for ExternalDatabase. |
Feature
A feature (sometimes also called a "gene" is a part of a genome that is of special interest. Features may be spread across multiple contigs of a genome, but never across more than one genome. Features can be assigned to roles via spreadsheet cells, and are the targets of annotation. Each feature in the database has a unique FigId.
Feature Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| assignment-maker |
name-string |
n/a |
name of the user who made the functional assignment |
| assignment-quality |
char |
n/a |
quality of the functional assignment, usually a space, but may be W (indicating weak) or X (indicating experimental) |
| feature-type |
id-string |
n/a |
Code indicating the type of this feature. Among the codes currently supported are peg for a protein encoding gene, bs for a binding site, opr for an operon, and so forth. |
| in-genbank |
boolean |
n/a |
TRUE if a feature can be found in GenBank, else FALSE |
| isoelectric-point |
float |
n/a |
pH in the surrounding medium at which the charge on a protein is neutral. If the pH of the medium is lower than this value, the protein will have a net positive charge. If the pH of the medium is higher, then the protein will have a net negative charge. |
| locked |
boolean |
n/a |
TRUE if a feature's assignment is locked. A locked feature's functional role cannot be changed by automated programs. |
| molecular-weight |
float |
n/a |
Molecular weight of this feature's protein, in daltons. A weight of 0 indicates that no protein is created. |
| sequence-length |
counter |
n/a |
Number of base pairs in this feature. |
| signal-peptide |
name-string |
n/a |
The signal peptide location for this feature. This is expressed as start and end numbers with a hyphen for the relevant amino acids. So, "1-22" would indicate a signal peptide at the beginning of the feature's protein and extending through 22 amino acid positions. An empty string means no signal peptide is present. |
| similar-to-human |
boolean |
n/a |
TRUE if this feature generates a protein that is similar to one found in humans, else FALSE |
| assignment |
text |
n/a |
Default functional assignment for this feature. |
| keywords |
text |
n/a |
This is a list of search keywords for the feature. It includes the functional assignment, subsystem roles, and special properties. |
| location-string |
text |
n/a |
Location of the feature, expressed as a comma-delimited list of Sprout location strings. This gives us a fast mechanism for extracting the feature location. Otherwise, we have to painstakingly paste together the #IsLocatedIn records, which are themselves designed to help look for features in a particular region rather than to find the location of a feature. |
| transmembrane-map |
text |
n/a |
A map indicating which sections of a protein will be embedded in a membrane. This is expressed as a comma-separated list of as start and end numbers with hyphens for the relevant amino acids. So, "10-12, 40-60" would indicate that there are two sections of the protein that become embedded in a membrane: the 10th through 12th amino acids, and the 40th through the 60th. An empty string means no transmembrane regions are known. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Feature. |
FeatureConservation Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| conservation |
float |
n/a |
(optional) A number between 0 and 1 that indicates the degree to which this feature's DNA is conserved in related genomes. A value of 1 indicates perfect conservation. A value less than 1 is a reflection of the degree to which gap characters interfere in the alignment between the feature and its close relatives. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
FeatureEssential Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| essential |
text |
n/a |
A value indicating the essentiality of the feature, coded as HTML. In most cases, this will be a word describing whether the essentiality is confirmed (essential) or potential (potential-essential), hyperlinked to the document from which the essentiality was curated. If a feature is not essential, this field will have no values; otherwise, it may have multiple values. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
FeatureIEDB Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| iedb |
text |
n/a |
A value indicating whether or not the feature can be found in the Immune Epitope Database. If the feature has not been matched to that database, this field will have no values. Otherwise, it will have an epitope name and/or sequence, hyperlinked to the database. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
FeatureLink Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| link |
text |
n/a |
Web hyperlink for this feature. A feature can have no hyperlinks or it can have many. The links are to other websites that have useful about the gene that the feature represents, and are coded as raw HTML, using <a href="_link_">_text_</a> notation. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
FeatureTranslation Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| translation |
text |
n/a |
(optional) A translation of this feature's residues into character codes, formed by concatenating the pieces of the feature together. For a protein encoding gene, the translation contains protein characters. For other types it contains DNA characters. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
FeatureUpstream Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| upstream-sequence |
text |
n/a |
Upstream sequence for the feature. This includes residues preceding the feature as well as some of the feature's initial residues. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
FeatureVirulent Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| virulent |
text |
n/a |
A value indicating the virulence of the feature, coded as HTML. In most cases, this will be a phrase or SA number hyperlinked to the document from which the virulence information was curated. If the feature is not virulent, this field will have no values; otherwise, it may have multiple values. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
FeatureAlias
Alternative names for features. A feature can have many aliases. In general, each alias corresponds to only one feature, but there are many exceptions to this rule.
- FeatureAlias IsAliasOf Feature (many-to-many)
FeatureAlias Table
| Field |
Type |
Default |
Description |
| id |
medium-string |
n/a |
Unique identifier for this FeatureAlias. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for FeatureAlias. |
Genome
A Genome contains the sequence data for a particular individual organism.
Genome Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Genome. |
| complete |
boolean |
n/a |
TRUE if the genome is complete, else FALSE |
| contigs |
int |
n/a |
Number of contigs for this organism. |
| dna-size |
counter |
n/a |
number of base pairs in the genome |
| genus |
name-string |
n/a |
Genus of the relevant organism. |
| pegs |
int |
n/a |
Number of protein encoding genes for this organism |
| primary-group |
name-string |
n/a |
The primary NMPDR group for this organism. There is always exactly one NMPDR group per organism (either based on the organism name or the default value Supporting). In general, more data is kept on organisms in NMPDR groups than on supporting organisms. |
| rnas |
int |
n/a |
Number of RNA features found for this organism. |
| species |
name-string |
n/a |
Species of the relevant organism. |
| unique-characterization |
medium-string |
|
The unique characterization identifies the particular organism instance from which the genome is taken. It is possible to have in the database more than one genome for a particular species, and every individual organism has variations in its DNA. |
| version |
name-string |
n/a |
version string for this genome, generally consisting of the genome ID followed by a period and a string of digits. |
| taxonomy |
text |
n/a |
The taxonomy string contains the full Wikipedia:taxonomy of the organism, while individual elements separated by semi-colons (and optional white space), starting with the domain and ending with the disambiguated genus and species (which is the organism's scientific name plus an identifying string). |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
primary-group, genus, species, unique-characterization |
This index allows the applications to find all genomes associated with a specific primary (NMPDR) group. |
| idx1 |
|
genus, species, unique-characterization |
This index allows the applications to find all genomes for a particular species. |
| idx2 |
yes |
id |
Primary index for Genome. |
GenomeSubset
A genome subset is a named collection of genomes that participate in a particular subsystem. The subset names are generally very short, non-unique strings. The ID of the parent subsystem is prefixed to the subset ID in order to make it unique.
GenomeSubset Table
| Field |
Type |
Default |
Description |
| id |
string |
n/a |
Unique identifier for this GenomeSubset. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for GenomeSubset. |
Keyword
A keyword is a word that can be used to search the feature table. This entity contains the keyword's stem, its phonetic form, and the number of features that can be found by searching for the word.
Keyword Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Keyword. |
| phonex |
name-string |
n/a |
A phonex is a string that identifies the phonetic characteristics of the word stem. This can be used to find alternative spellings if an matching word is not present. |
| stem |
name-string |
n/a |
The stem of a keyword is a normalized form that is independent of parts of speech. The actual keywords stored in the database search index are stems. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
stem |
This index allows the user to find words by stem. |
| idx1 |
|
phonex |
This index allows the user to find words by phonex. |
| idx2 |
yes |
id |
Primary index for Keyword. |
Ligand
A Ligand is a chemical of interest in computing docking energies against a PDB. The ID of the ligand is an 8-digit ID number in the
ZINC database.
Ligand Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Ligand. |
| name |
long-string |
n/a |
Chemical name of this ligand. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Ligand. |
PDB
A PDB is a protein data bank entry containing information that can be used to determine the shape of the protein and the energies required to dock with it. The ID is the four-character name used on the PDB web site.
PDB Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this PDB. |
| docking-count |
int |
n/a |
The number of ligands that have been docked against this PDB. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
docking-count DESC, id |
|
| idx1 |
yes |
id |
Primary index for PDB. |
Property
A property is a type of assertion that could be made about the properties of a particular feature. Each property instance is a key/value pair and can be associated with many different features. Conversely, a feature can be associated with many key/value pairs, even some that notionally contradict each other. For example, there can be evidence that a feature is essential to the organism's survival and evidence that it is superfluous.
Property Table
| Field |
Type |
Default |
Description |
| id |
int |
n/a |
Unique identifier for this Property. |
| property-name |
name-string |
n/a |
Name of this property. |
| property-value |
string |
n/a |
Value associated with this property. For each property name, there must by a property record for all of its possible values. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
property-name, property-value |
This index enables the application to find all values for a specified property name, or any given name/value pair. |
| idx1 |
yes |
id |
Primary index for Property. |
Reaction
A reaction is a chemical process catalyzed by a protein. The reaction ID is generally a small number preceded by a letter.
Reaction Table
| Field |
Type |
Default |
Description |
| id |
key-string |
n/a |
Unique identifier for this Reaction. |
| rev |
boolean |
n/a |
TRUE if this reaction is reversible, else FALSE |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Reaction. |
ReactionURL Table
| Field |
Type |
Default |
Description |
| id |
key-string |
n/a |
Unique identifier for this Reaction. |
| url |
string |
n/a |
HTML string containing a link to a web location that describes the reaction. This field is optional. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
Role
A role describes a biological function that may be fulfilled by a feature. One of the main goals of the database is to record the roles of the various features.
Role Table
| Field |
Type |
Default |
Description |
| id |
string |
n/a |
Unique identifier for this Role. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Role. |
RoleEC
EC code for a role.
RoleEC Table
| Field |
Type |
Default |
Description |
| id |
string |
n/a |
Unique identifier for this RoleEC. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for RoleEC. |
RoleSubset
A role subset is a named collection of roles in a particular subsystem. The subset names are generally very short, non-unique strings. The ID of the parent subsystem is prefixed to the subset ID in order to make it unique.
RoleSubset Table
| Field |
Type |
Default |
Description |
| id |
string |
n/a |
Unique identifier for this RoleSubset. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for RoleSubset. |
SSCell
Part of the process of subsystem annotation of features is creating a spreadsheet of genomes and roles to which features are assigned. A spreadsheet cell represents one of the positions on the spreadsheet.
SSCell Table
| Field |
Type |
Default |
Description |
| id |
hash-string |
n/a |
Unique identifier for this SSCell. |