The Sapling database is a portable database that covers three realms of bioinformatics? . It is currently a work in progress and subject to change; however, one of our near term goals is to make it generally available on our FTP site. In the diagram below, red indicates DNA sequencing data, blue indicates genome annotation data, and green indicates biochemistry data.
This page requires Flash Player in order to display properly.
The Sapling database is a distributable, self-contained copy of the NMPDR data. Unlike Sprout, which is optimized for searching, Sapling is designed to be structurally simple without sacrificing the ability to find information quickly.
Issues
We may want to do some compression on the "dna" data type.
Must add back the ability to index a secondary relation. Note that such indexes can only have a single field.
We probably need some type tables that describe things like Identifier(source) or Family(kind).
The ERDB documentation needs to be updated to include DisplayInfo, Asides, the "converse" attribute for relationships, and the Shapes section.
Entities
Annotation
An annotation is a comment attached to a feature. Annotations are used to
track the history of a feature's functional assignments and any related issues. The
key is the feature ID followed by a colon and an complemented eight-digit sequence number.
The complemented sequence number causes the annotations to sort with the most recent one
first.
A compound is a chemical that participates in a reaction. All compounds have a unique ID and may also have one or more names. Both ligands and reaction components are treated as compounds.
Primary name of the compound. This is the name used in reaction display strings.
Index
Unique
Fields
Notes
idx0
yes
id
Primary index for Compound.
CompoundCAS Table
Field
Type
Default
Description
id
name-string
n/a
Unique identifier for this Compound.
cas-id
string
n/a
The Chemical Abstract Service ID for the compound. A compound may have at most one CAS ID.
Index
Unique
Fields
Notes
idx0
cas-id
This index allows searching for compounds by CAS ID.
idx1
id
CompoundName Table
Field
Type
Default
Description
id
name-string
n/a
Unique identifier for this Compound.
name
string
n/a
Alternate name for the compound. A compound may have many alternate names. The primary name should also be one of the alternate names.
Index
Unique
Fields
Notes
idx0
name
This index allows searching for compounds by name.
idx1
id
CompoundZinc Table
Field
Type
Default
Description
id
name-string
n/a
Unique identifier for this Compound.
zinc-id
string
n/a
The ZINC database ID for the compound. A compound may have at most one ZINC ID.
Index
Unique
Fields
Notes
idx0
zinc-id
This index allows searching for compounds by ZINC ID.
idx1
id
Diagram
A functional diagram describes a network of chemical reactions, often comprising a single subsystem. A diagram is identified by a short name and contains a longer descriptive name.
The content of the diagram, in PNG format encoded as base 64 MIME.
Index
Unique
Fields
Notes
idx0
id
DnaSequence
A DNA sequence (sometimes called a "contig") is a contiguous sequence of base pairs belonging to a single genome. The key of the DNA sequence is the genome ID followed by the contig ID.
A string of letters representing the nucleotides of the sequence.
Index
Unique
Fields
Notes
idx0
id
Family
A family is a group of features united by a particular determination algorithm. The algorithm will frequently-- but not always-- signify a functional role.
A functional coupling evidence set indicates evidence for a functional connection between protein sequence pairs. The protein sequences possessing the connection are the ones that participate in the evidence set's pairings.
The pairings for a particular evidence set will contain protein sequences that are significantly similar. In other words, if (A,B) and (X,Y) are both pairings in a single evidence set, then (A =~ X) and (B =~ Y) or (A =~ Y) and (B =~ X), depending on the value of the "inverted" attribute of the IsDeterminedBy relationship. Essentially, a pairing in its own right is unordered.
If (A,B) is a pair, then so is (B,A). However, the evidence set maintains a correspondence
between its pairs that is ordered, because the constituent pairs must match. The
direction in which a pair matches others in the set is an attribute of the relationship from the pairs
to the sets.
Score for this evidence set. The score indicates the number of significantly different genomes represented by the pairings.
Index
Unique
Fields
Notes
idx0
yes
id
Primary index for FcEvidenceSet.
Feature
A feature (sometimes also called a gene) is a part of a genome that is of special interest. Features may be spread across multiple DNA sequences (contigs) of a genome, but never across more than one genome. Each feature in the database has a unique FIG ID.
Code indicating the type of this feature. Among the codes currently supported are "peg" for a protein encoding gene, "bs" for a binding site, "opr" for an operon, and so forth.
sequence-length
counter
n/a
Number of base pairs in this feature.
function
text
n/a
Functional assignment for this feature. This will often indicate the feature's functional role or roles, and may also have comments.
Index
Unique
Fields
Notes
idx0
yes
id
Primary index for Feature.
FeatureEssential Table
Field
Type
Default
Description
id
id-string
n/a
Unique identifier for this Feature.
essential
text
n/a
A value indicating the essentiality of the feature, coded as HTML. In most cases, this will be a word describing whether the essentiality is confirmed (essential) or potential (potential-essential), hyperlinked to the document from which the essentiality was curated. If a feature is not essential, this field will have no values; otherwise, it may have multiple values.
Index
Unique
Fields
Notes
idx0
id
FeatureEvidence Table
Field
Type
Default
Description
id
id-string
n/a
Unique identifier for this Feature.
evidence-code
string
n/a
An evidence code describes the possible evidence that exists for deciding a feature's functional assignment. A feature may have no evidence, a single evidence code, or several.
Index
Unique
Fields
Notes
idx0
id
FeatureLink Table
Field
Type
Default
Description
id
id-string
n/a
Unique identifier for this Feature.
link
text
n/a
Web hyperlink for this feature. A feature can have no hyperlinks or it can have many. The links are to other websites that have useful about the gene that the feature represents, and are coded as raw HTML, using an anchor href tag.
Index
Unique
Fields
Notes
idx0
id
FeatureVirulent Table
Field
Type
Default
Description
id
id-string
n/a
Unique identifier for this Feature.
virulent
text
n/a
A value indicating the virulence of the feature, coded as HTML. In most cases, this will be a phrase or SA number hyperlinked to the document from which the virulence information was curated. If the feature is not virulent, this field will have no values; otherwise, it may have multiple values.
Index
Unique
Fields
Notes
idx0
id
Genome
A genome represents a specific organism with DNA, or a specific meta-genome. All DNA
sequences in the database belong to genomes.
Domain for this genome or taxonomic classification. The domain is the highest level of the taxonomy tree.
full-name
name-string
n/a
Full genus/species/strain name of the genome.
pegs
int
n/a
Number of protein encoding genes for this organism
primary-group
name-string
n/a
The primary NMPDR group for this organism. There is always exactly one NMPDR group per organism. An empty string indicates the organism is supporting. In general, more data is kept on organisms in NMPDR groups than on supporting organisms.
rnas
int
n/a
Number of RNA features found for this organism.
version
name-string
n/a
Version string for this genome, generally consisting of the genome ID followed by a period and a string of digits.
Index
Unique
Fields
Notes
idx0
primary-group, full-name
This index allows the applications to find all genomes associated with a specific primary (NMPDR) group.
idx1
full-name
This index allows the applications to find all genomes in lexical order by name.
idx2
yes
id
Primary index for Genome.
Identifier
An identifier is an alternate name for a feature or protein sequence.
Some identifiers name features or protein sequences that do not exist in the database. In this case, the feature or protein sequence is considered external; that is, it belongs to another database.
Specific type of the identifier, such as its source database or category. The type can usually be decoded to convert the identifier to a URL.
Index
Unique
Fields
Notes
idx0
source
This index allows all the identifiers of a specified type to be located.
idx1
yes
id
Primary index for Identifier.
IdentifierSet
The identifier set is a group of identifiers that mean the same thing, usually either a Feature or a Protein Sequence. The identifiers in a set will frequently belong to different genomic databases. Thus, if a specific protein sequence has one name in the NMPDR and another name in RefSeq, both of the names would be in the same identifier set.
A machine role represents a role as it occurs in a molecular machine. The key is the machine key plus the role abbreviation.
The machine role corresponds to a cell on the subsystem spreadsheet. Features in the subsystem are assigned directly to the machine role.
A molecular machine is a collection of features that implements a metabolic pathway. Machines are the physical instances of variants. Each machine corresponds to a row in a subsystem spreadsheet. The key is the variant key followed by a colon and the Genome ID.
The machine type indicates how it relates to the parent variant. A type of "vacant" means that the machine does not appear to actually exist in the organism. A type of "incomplete" means that the machine appears to be missing many reactions. In all other cases, the type is "normal".
Index
Unique
Fields
Notes
idx0
yes
id
Primary index for MolecularMachine.
Pairing
A pairing indicates that two protein sequences are found close together on one or more DNA sequences. Not all possible pairings are stored in the database; only those that are considered for some reason to be significant for annotation purposes.The key of the pairing is the concatenation of the protein sequence keys in alphabetical order.
Because the protein sequence key is a hash of the sequence letters, the key of a pairing between two sequences is computable from the sequences themselves. Theoretically, the pairing is unordered: (A,B) and (B,A) are the same pairing. It is frequently the case, however, that we need to refer to the "first" or "second" protein in the pairing. When this happens, the first one is always the protein with the alphabetically lesser key. The IsInPair relationship automatically shows the proteins in this order.
A protein sequence is a specific sequence of amino acids. Unlike a DNA sequence, a protein sequence does not belong to a genome. Identical proteins generated by different genomes are generally stored as a single ProteinSequence instance. The key is a hash of the protein letter sequence.
pH in the surrounding medium at which the charge on a protein is neutral. If the pH of the medium is lower than this value, the protein will have a net positive charge. If the pH of the medium is higher, then the protein will have a net negative charge.
molecular-weight
float
n/a
Molecular weight of this feature's protein, in daltons. A weight of 0 indicates that no protein is created.
sequence
dna
n/a
The sequence contains the letters corresponding to the protein's amino acids.
signal-peptide
name-string
n/a
The signal peptide location for this feature. This is expressed as start and end numbers with a hyphen for the relevant amino acids. So, "1-22" would indicate a signal peptide at the beginning of the feature's protein and extending through 22 amino acid positions. An empty string means no signal peptide is present.
similar-to-human
boolean
n/a
TRUE if this feature generates a protein that is similar to one found in humans, else FALSE
transmembrane-map
text
n/a
A map indicating which sections of a protein will be embedded in a membrane. This is expressed as a comma-separated list of as start and end numbers with hyphens for the relevant amino acids. So, "10-12, 40-60" would indicate that there are two sections of the protein that become embedded in a membrane: the 10th through 12th amino acids, and the 40th through the 60th. An empty string means no transmembrane regions are known.
Index
Unique
Fields
Notes
idx0
yes
id
Primary index for ProteinSequence.
ProteinSequenceIEDB Table
Field
Type
Default
Description
id
hash-string
n/a
Unique identifier for this ProteinSequence.
iedb
text
n/a
A value indicating whether or not the feature can be found in the Immune Epitope Database. If the feature has not been matched to that database, this field will have no values. Otherwise, it will have an epitope name and/or sequence, hyperlinked to the database.
Index
Unique
Fields
Notes
idx0
id
Publication
A publication is an article or citation that may be used as evidence for assertions made in the database. The key is a hash code computed from the URL.
This index allows searching for the article by the author names and title.
idx1
yes
id
Primary index for Publication.
Reaction
A reaction is a chemical process that converts one set of compounds (substrate) to another set (products). The reaction ID is generally a small number preceded by a letter.
HTML string containing a link to a web location that describes the reaction. This field is optional.
Index
Unique
Fields
Notes
idx0
id
Role
A role describes a biological function that may be fulfilled by a feature. One of the main goals of the database is to assign features to roles. Most roles are effected by the construction of proteins. Some, however, deal with functional regulation and message transmission.
A role represents a single gene function. Many roles are in
subsystems, but some are not. If a feature has multiple functions, each
is represented as a separate role.
English name of this role. The actual role ID is computed from this field.
Index
Unique
Fields
Notes
idx0
yes
id
Primary index for Role.
RoleSet
A role set is a group of roles that work together to stimulate a reaction. Most role sets consist of a single
role; however, some reactions require the presence of multiple roles to get them started.
A reaction is usually triggered by a single role, but some reactions are triggered
by a boolean combination of roles (e.g. (A and (B or C) and D) or (E and B and F) or G). The boolean
expression can be converted into disjunctive normal form, which is a list of alternative sets (e.g. (A and B and D) or (A and C and D) or (E and B and F) or G). Each alternative is then converted
into a role set. This allows us to precisely represent the triggering conditions of a reaction in the database.