Introduction
The Sapling database is a distributable, self-contained copy of the NMPDR data. Unlike Sprout, which is optimized for searching, Sapling is designed to be structurally simple without sacrificing the ability to find information quickly.
Issues
- We may want to do some compression on the "dna" data type.
- Must add back the ability to index a secondary relation. Note that such indexes can only have a single field.
- We probably need some type tables that describe things like Identifier(source) or Family(kind).
- The ERDB documentation needs to be updated to include DisplayInfo, Asides, the "converse" attribute for relationships, and the Shapes section.
Entities
Annotation
An annotation is a comment attached to a feature. Annotations are used to
track the history of a feature's functional assignments and any related issues. The
key is the feature ID followed by a colon and an complemented eight-digit sequence number.
The complemented sequence number causes the annotations to sort with the most recent one
first.
Annotation Table
| Field |
Type |
Default |
Description |
| id |
string |
n/a |
Unique identifier for this Annotation. |
| annotation-time |
date |
n/a |
Date and time at which the annotation was made. |
| annotator |
string |
n/a |
Name of the annotator who made the comment. |
| comment |
text |
n/a |
Text of the annotation. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Annotation. |
Compound
A compound is a chemical that participates in a reaction. All compounds have a unique ID and may also have one or more names. Both ligands and reaction components are treated as compounds.
Compound Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Compound. |
| label |
string |
n/a |
Primary name of the compound. This is the name used in reaction display strings. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Compound. |
CompoundCAS Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Compound. |
| cas-id |
string |
n/a |
The Chemical Abstract Service ID for the compound. A compound may have at most one CAS ID. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
cas-id |
This index allows searching for compounds by CAS ID. |
| idx1 |
|
id |
|
CompoundName Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Compound. |
| name |
string |
n/a |
Alternate name for the compound. A compound may have many alternate names. The primary name should also be one of the alternate names. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
name |
This index allows searching for compounds by name. |
| idx1 |
|
id |
|
CompoundZinc Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Compound. |
| zinc-id |
string |
n/a |
The ZINC database ID for the compound. A compound may have at most one ZINC ID. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
zinc-id |
This index allows searching for compounds by ZINC ID. |
| idx1 |
|
id |
|
Diagram
A functional diagram describes a network of chemical reactions, often comprising a single subsystem. A diagram is identified by a short name and contains a longer descriptive name.
Diagram Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Diagram. |
| name |
text |
n/a |
Descriptive name of this diagram. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Diagram. |
DiagramContent Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Diagram. |
| content |
image |
n/a |
The content of the diagram, in PNG format encoded as base 64 MIME. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
DnaSequence
A DNA sequence (sometimes called a "contig") is a contiguous sequence of base pairs belonging to a single genome. The key of the DNA sequence is the genome ID followed by the contig ID.
DnaSequence Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this DnaSequence. |
| length |
counter |
n/a |
Number of base pairs in the DNA sequence. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for DnaSequence. |
DnaSequenceBases Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this DnaSequence. |
| bases |
text |
n/a |
A string of letters representing the nucleotides of the sequence. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
Family
A family is a group of features united by a particular determination algorithm. The algorithm will frequently-- but not always-- signify a functional role.
Family Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Family. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Family. |
FcEvidenceSet
A functional coupling evidence set indicates evidence for a functional connection between protein sequence pairs. The protein sequences possessing the connection are the ones that participate in the evidence set's pairings.
The pairings for a particular evidence set will contain protein sequences that are significantly similar. In other words, if (A,B) and (X,Y) are both pairings in a single evidence set, then (A =~ X) and (B =~ Y) or (A =~ Y) and (B =~ X), depending on the value of the "inverted" attribute of the IsDeterminedBy relationship. Essentially, a pairing in its own right is unordered.
If (A,B) is a pair, then so is (B,A). However, the evidence set maintains a correspondence
between its pairs that
is ordered, because the constituent pairs must match. The
direction in which a pair matches others in the set is an attribute of the relationship from the pairs
to the sets.
FcEvidenceSet Table
| Field |
Type |
Default |
Description |
| id |
int |
n/a |
Unique identifier for this FcEvidenceSet. |
| score |
int |
n/a |
Score for this evidence set. The score indicates the number of significantly different genomes represented by the pairings. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for FcEvidenceSet. |
Feature
A feature (sometimes also called a gene) is a part of a genome that is of special interest. Features may be spread across multiple DNA sequences (contigs) of a genome, but never across more than one genome. Each feature in the database has a unique FIG ID.
Feature Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| feature-type |
id-string |
n/a |
Code indicating the type of this feature. Among the codes currently supported are "peg" for a protein encoding gene, "bs" for a binding site, "opr" for an operon, and so forth. |
| sequence-length |
counter |
n/a |
Number of base pairs in this feature. |
| function |
text |
n/a |
Functional assignment for this feature. This will often indicate the feature's functional role or roles, and may also have comments. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Feature. |
FeatureEssential Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| essential |
text |
n/a |
A value indicating the essentiality of the feature, coded as HTML. In most cases, this will be a word describing whether the essentiality is confirmed (essential) or potential (potential-essential), hyperlinked to the document from which the essentiality was curated. If a feature is not essential, this field will have no values; otherwise, it may have multiple values. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
FeatureEvidence Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| evidence-code |
string |
n/a |
An evidence code describes the possible evidence that exists for deciding a feature's functional assignment. A feature may have no evidence, a single evidence code, or several. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
FeatureLink Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| link |
text |
n/a |
Web hyperlink for this feature. A feature can have no hyperlinks or it can have many. The links are to other websites that have useful about the gene that the feature represents, and are coded as raw HTML, using an anchor href tag. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
FeatureVirulent Table
| Field |
Type |
Default |
Description |
| id |
id-string |
n/a |
Unique identifier for this Feature. |
| virulent |
text |
n/a |
A value indicating the virulence of the feature, coded as HTML. In most cases, this will be a phrase or SA number hyperlinked to the document from which the virulence information was curated. If the feature is not virulent, this field will have no values; otherwise, it may have multiple values. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
Genome
A genome represents a specific organism with DNA, or a specific meta-genome. All DNA
sequences in the database belong to genomes.
- Genome IsMadeUpOf DnaSequence (one-to-many)
- Genome IsOwnerOf Feature (one-to-many)
- TaxonomicGrouping IsTaxonomyOf Genome (one-to-many)
- Genome Uses MolecularMachine (one-to-many)
Genome Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Genome. |
| complete |
boolean |
n/a |
TRUE if the genome is complete, else FALSE |
| contigs |
int |
n/a |
Number of contigs for this organism. |
| dna-size |
counter |
n/a |
number of base pairs in the genome |
| domain |
name-string |
n/a |
Domain for this genome or taxonomic classification. The domain is the highest level of the taxonomy tree. |
| full-name |
name-string |
n/a |
Full genus/species/strain name of the genome. |
| pegs |
int |
n/a |
Number of protein encoding genes for this organism |
| primary-group |
name-string |
n/a |
The primary NMPDR group for this organism. There is always exactly one NMPDR group per organism. An empty string indicates the organism is supporting. In general, more data is kept on organisms in NMPDR groups than on supporting organisms. |
| rnas |
int |
n/a |
Number of RNA features found for this organism. |
| version |
name-string |
n/a |
Version string for this genome, generally consisting of the genome ID followed by a period and a string of digits. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
primary-group, full-name |
This index allows the applications to find all genomes associated with a specific primary (NMPDR) group. |
| idx1 |
|
full-name |
This index allows the applications to find all genomes in lexical order by name. |
| idx2 |
yes |
id |
Primary index for Genome. |
Identifier
An identifier is an alternate name for a feature or protein sequence.
Some identifiers name features or protein sequences that do not exist in the database. In this case, the feature or protein sequence is considered
external; that is, it belongs to another database.
Identifier Table
| Field |
Type |
Default |
Description |
| id |
string |
n/a |
Unique identifier for this Identifier. |
| source |
key-string |
n/a |
Specific type of the identifier, such as its source database or category. The type can usually be decoded to convert the identifier to a URL. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
source |
This index allows all the identifiers of a specified type to be located. |
| idx1 |
yes |
id |
Primary index for Identifier. |
IdentifierSet
The identifier set is a group of identifiers that mean the same thing, usually either a Feature or a Protein Sequence. The identifiers in a set will frequently belong to different genomic databases. Thus, if a specific protein sequence has one name in the NMPDR and another name in RefSeq, both of the names would be in the same identifier set.
IdentifierSet Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this IdentifierSet. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for IdentifierSet. |
MachineRole
A machine role represents a role as it occurs in a molecular machine. The key is the machine key plus the role abbreviation.
The machine role corresponds to a cell on the subsystem spreadsheet. Features in the subsystem are assigned directly to the machine role.
MachineRole Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this MachineRole. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for MachineRole. |
MolecularMachine
A molecular machine is a collection of features that implements a metabolic pathway. Machines are the physical instances of variants. Each machine corresponds to a row in a subsystem spreadsheet. The key is the variant key followed by a colon and the Genome ID.
- Variant IsImplementedBy MolecularMachine (one-to-many)
- MolecularMachine IsMachineOf MachineRole (one-to-many)
- Genome Uses MolecularMachine (one-to-many)
MolecularMachine Table
| Field |
Type |
Default |
Description |
| id |
key-string |
n/a |
Unique identifier for this MolecularMachine. |
| type |
key-string |
n/a |
The machine type indicates how it relates to the parent variant. A type of "vacant" means that the machine does not appear to actually exist in the organism. A type of "incomplete" means that the machine appears to be missing many reactions. In all other cases, the type is "normal". |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for MolecularMachine. |
Pairing
A pairing indicates that two protein sequences are found close together on one or more DNA sequences. Not all possible pairings are stored in the database; only those that are considered for some reason to be significant for annotation purposes.The key of the pairing is the concatenation of the protein sequence keys in alphabetical order.
Because the protein sequence key is a hash of the sequence letters, the key of a pairing between two sequences is computable from the sequences themselves. Theoretically, the pairing is unordered: (A,B) and (B,A) are the same pairing. It is frequently the case, however, that we need to refer to the "first" or "second" protein in the pairing. When this happens, the first one is always the protein with the alphabetically lesser key. The IsInPair relationship automatically shows the proteins in this order.
Pairing Table
| Field |
Type |
Default |
Description |
| id |
name-string |
n/a |
Unique identifier for this Pairing. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Pairing. |
ProteinSequence
A protein sequence is a specific sequence of amino acids. Unlike a DNA sequence, a protein sequence does not belong to a genome. Identical proteins generated by different genomes are generally stored as a single ProteinSequence instance. The key is a hash of the protein letter sequence.
- ProteinSequence Catalyzes Role (many-to-many)
- Publication Concerns ProteinSequence (many-to-many)
- ProteinSequence Exposes Structure (many-to-many)
- ProteinSequence IsSequenceFor Identifier (one-to-many)
ProteinSequence Table
| Field |
Type |
Default |
Description |
| id |
hash-string |
n/a |
Unique identifier for this ProteinSequence. |
| isoelectric-point |
float |
n/a |
pH in the surrounding medium at which the charge on a protein is neutral. If the pH of the medium is lower than this value, the protein will have a net positive charge. If the pH of the medium is higher, then the protein will have a net negative charge. |
| molecular-weight |
float |
n/a |
Molecular weight of this feature's protein, in daltons. A weight of 0 indicates that no protein is created. |
| sequence |
dna |
n/a |
The sequence contains the letters corresponding to the protein's amino acids. |
| signal-peptide |
name-string |
n/a |
The signal peptide location for this feature. This is expressed as start and end numbers with a hyphen for the relevant amino acids. So, "1-22" would indicate a signal peptide at the beginning of the feature's protein and extending through 22 amino acid positions. An empty string means no signal peptide is present. |
| similar-to-human |
boolean |
n/a |
TRUE if this feature generates a protein that is similar to one found in humans, else FALSE |
| transmembrane-map |
text |
n/a |
A map indicating which sections of a protein will be embedded in a membrane. This is expressed as a comma-separated list of as start and end numbers with hyphens for the relevant amino acids. So, "10-12, 40-60" would indicate that there are two sections of the protein that become embedded in a membrane: the 10th through 12th amino acids, and the 40th through the 60th. An empty string means no transmembrane regions are known. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for ProteinSequence. |
ProteinSequenceIEDB Table
| Field |
Type |
Default |
Description |
| id |
hash-string |
n/a |
Unique identifier for this ProteinSequence. |
| iedb |
text |
n/a |
A value indicating whether or not the feature can be found in the Immune Epitope Database. If the feature has not been matched to that database, this field will have no values. Otherwise, it will have an epitope name and/or sequence, hyperlinked to the database. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
Publication
A
publication is an article or citation that may be used as evidence for assertions made in the database. The key is a hash code computed from the URL.
- Publication Concerns ProteinSequence (many-to-many)
Publication Table
| Field |
Type |
Default |
Description |
| id |
hash-string |
n/a |
Unique identifier for this Publication. |
| url |
string |
n/a |
URL of the article or of its citation. |
| citation |
text |
n/a |
Citation string for the article. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
citation |
This index allows searching for the article by the author names and title. |
| idx1 |
yes |
id |
Primary index for Publication. |
Reaction
A reaction is a chemical process that converts one set of compounds (substrate) to another set (products). The reaction ID is generally a small number preceded by a letter.
Reaction Table
| Field |
Type |
Default |
Description |
| id |
key-string |
n/a |
Unique identifier for this Reaction. |
| rev |
boolean |
n/a |
TRUE if this reaction is reversible, else FALSE |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Reaction. |
ReactionURL Table
| Field |
Type |
Default |
Description |
| id |
key-string |
n/a |
Unique identifier for this Reaction. |
| url |
string |
n/a |
HTML string containing a link to a web location that describes the reaction. This field is optional. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
|
id |
|
Role
A role describes a biological function that may be fulfilled by a feature. One of the main goals of the database is to assign features to roles. Most roles are effected by the construction of proteins. Some, however, deal with functional regulation and message transmission.
A role represents a single gene function. Many roles are in
subsystems, but some are not. If a feature has multiple functions, each
is represented as a separate role.
Role Table
| Field |
Type |
Default |
Description |
| id |
hash-string |
n/a |
Unique identifier for this Role. |
| hypothetical |
boolean |
n/a |
TRUE if a role is hypothetical, else FALSE |
| name |
string |
n/a |
English name of this role. The actual role ID is computed from this field. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Role. |
RoleSet
A role set is a group of roles that work together to stimulate a reaction. Most role sets consist of a single
role; however, some reactions require the presence of multiple roles to get them started.
A reaction is usually triggered by a single role, but some reactions are triggered
by a boolean combination of roles (e.g.
(A and (B or C) and D) or (E and B and F) or G). The boolean
expression can be converted into disjunctive normal form, which is a list of alternative sets (e.g.
(A and B and D) or (A and C and D) or (E and B and F) or G). Each alternative is then converted
into a role set. This allows us to precisely represent the triggering conditions of a reaction in the database.
RoleSet Table
| Field |
Type |
Default |
Description |
| id |
int |
n/a |
Unique identifier for this RoleSet. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for RoleSet. |
Scenario
A scenario is a partial instance of a subsystem with a defined set of reactions.Each scenario converrts input compounds to output compounds using reactions. The scenario may use all of the reactions controlled by a subsystem or only some, and may also incorporate additional reactions.
Scenario Table
| Field |
Type |
Default |
Description |
| id |
string |
n/a |
Unique identifier for this Scenario. |
| Index |
Unique |
Fields |
Notes |
| idx0 |
yes |
id |
Primary index for Scenario. |
Structure
A structure is the geometrical representation of a protein sequence. A single protein sequence may have multiple structural representations, either because it is folded in different ways or because there are alternative representation formats. The key field is the representation type (e.g. PDB, SCOPE) followed by the ID, with an intervening vertical bar.
- Structure Attracts Compound (many-to-many)
- ProteinSequence Exposes Structure (many-to-many)
Structure Table
| Field |
Type |
Default |
Description |
| id |
name-string |
|