The Sapling database is a portable database that covers three realms of bioinformatics? . It is currently a work in progress and subject to change; however, one of our near term goals is to make it generally available on our FTP site. In the diagram below, red indicates DNA sequencing data, blue indicates genome annotation data, and green indicates biochemistry data.

Get Adobe Flash Player   This page requires Flash Player in order to display properly.

Introduction

The Sapling database is a distributable, self-contained copy of the NMPDR data. Unlike Sprout, which is optimized for searching, Sapling is designed to be structurally simple without sacrificing the ability to find information quickly.

Issues

  • We may want to do some compression on the "dna" data type.
  • Must add back the ability to index a secondary relation. Note that such indexes can only have a single field.
  • We probably need some type tables that describe things like Identifier(source) or Family(kind).
  • The ERDB documentation needs to be updated to include DisplayInfo, Asides, the "converse" attribute for relationships, and the Shapes section.

Entities

Annotation

An annotation is a comment attached to a feature. Annotations are used to track the history of a feature's functional assignments and any related issues. The key is the feature ID followed by a colon and an complemented eight-digit sequence number.

The complemented sequence number causes the annotations to sort with the most recent one first.

Annotation Table

Field Type Default Description
id string n/a Unique identifier for this Annotation.
annotation-time date n/a Date and time at which the annotation was made.
annotator string n/a Name of the annotator who made the comment.
comment text n/a Text of the annotation.

Index Unique Fields Notes
idx0 yes id Primary index for Annotation.

Compound

A compound is a chemical that participates in a reaction. All compounds have a unique ID and may also have one or more names. Both ligands and reaction components are treated as compounds.

  • Structure Attracts Compound (many-to-many)
  • Reaction Involves Compound (many-to-many)
  • Compound IsTerminusFor Scenario (many-to-many)
  • Diagram Shows Compound (many-to-many)

Compound Table

Field Type Default Description
id name-string n/a Unique identifier for this Compound.
label string n/a Primary name of the compound. This is the name used in reaction display strings.

Index Unique Fields Notes
idx0 yes id Primary index for Compound.

CompoundCAS Table

Field Type Default Description
id name-string n/a Unique identifier for this Compound.
cas-id string n/a The Chemical Abstract Service ID for the compound. A compound may have at most one CAS ID.

Index Unique Fields Notes
idx0   cas-id This index allows searching for compounds by CAS ID.
idx1   id  

CompoundName Table

Field Type Default Description
id name-string n/a Unique identifier for this Compound.
name string n/a Alternate name for the compound. A compound may have many alternate names. The primary name should also be one of the alternate names.

Index Unique Fields Notes
idx0   name This index allows searching for compounds by name.
idx1   id  

CompoundZinc Table

Field Type Default Description
id name-string n/a Unique identifier for this Compound.
zinc-id string n/a The ZINC database ID for the compound. A compound may have at most one ZINC ID.

Index Unique Fields Notes
idx0   zinc-id This index allows searching for compounds by ZINC ID.
idx1   id  

Diagram

A functional diagram describes a network of chemical reactions, often comprising a single subsystem. A diagram is identified by a short name and contains a longer descriptive name.

Diagram Table

Field Type Default Description
id name-string n/a Unique identifier for this Diagram.
name text n/a Descriptive name of this diagram.

Index Unique Fields Notes
idx0 yes id Primary index for Diagram.

DiagramContent Table

Field Type Default Description
id name-string n/a Unique identifier for this Diagram.
content image n/a The content of the diagram, in PNG format encoded as base 64 MIME.

Index Unique Fields Notes
idx0   id  

DnaSequence

A DNA sequence (sometimes called a "contig") is a contiguous sequence of base pairs belonging to a single genome. The key of the DNA sequence is the genome ID followed by the contig ID.

DnaSequence Table

Field Type Default Description
id name-string n/a Unique identifier for this DnaSequence.
length counter n/a Number of base pairs in the DNA sequence.

Index Unique Fields Notes
idx0 yes id Primary index for DnaSequence.

DnaSequenceBases Table

Field Type Default Description
id name-string n/a Unique identifier for this DnaSequence.
bases text n/a A string of letters representing the nucleotides of the sequence.

Index Unique Fields Notes
idx0   id  

Family

A family is a group of features united by a particular determination algorithm. The algorithm will frequently-- but not always-- signify a functional role.

Family Table

Field Type Default Description
id name-string n/a Unique identifier for this Family.

Index Unique Fields Notes
idx0 yes id Primary index for Family.

FcEvidenceSet

A functional coupling evidence set indicates evidence for a functional connection between protein sequence pairs. The protein sequences possessing the connection are the ones that participate in the evidence set's pairings.

The pairings for a particular evidence set will contain protein sequences that are significantly similar. In other words, if (A,B) and (X,Y) are both pairings in a single evidence set, then (A =~ X) and (B =~ Y) or (A =~ Y) and (B =~ X), depending on the value of the "inverted" attribute of the IsDeterminedBy relationship. Essentially, a pairing in its own right is unordered. If (A,B) is a pair, then so is (B,A). However, the evidence set maintains a correspondence between its pairs that is ordered, because the constituent pairs must match. The direction in which a pair matches others in the set is an attribute of the relationship from the pairs to the sets.

FcEvidenceSet Table

Field Type Default Description
id int n/a Unique identifier for this FcEvidenceSet.
score int n/a Score for this evidence set. The score indicates the number of significantly different genomes represented by the pairings.

Index Unique Fields Notes
idx0 yes id Primary index for FcEvidenceSet.

Feature

A feature (sometimes also called a gene) is a part of a genome that is of special interest. Features may be spread across multiple DNA sequences (contigs) of a genome, but never across more than one genome. Each feature in the database has a unique FIG ID.

Feature Table

Field Type Default Description
id id-string n/a Unique identifier for this Feature.
feature-type id-string n/a Code indicating the type of this feature. Among the codes currently supported are "peg" for a protein encoding gene, "bs" for a binding site, "opr" for an operon, and so forth.
sequence-length counter n/a Number of base pairs in this feature.
function text n/a Functional assignment for this feature. This will often indicate the feature's functional role or roles, and may also have comments.

Index Unique Fields Notes
idx0 yes id Primary index for Feature.

FeatureEssential Table

Field Type Default Description
id id-string n/a Unique identifier for this Feature.
essential text n/a A value indicating the essentiality of the feature, coded as HTML. In most cases, this will be a word describing whether the essentiality is confirmed (essential) or potential (potential-essential), hyperlinked to the document from which the essentiality was curated. If a feature is not essential, this field will have no values; otherwise, it may have multiple values.

Index Unique Fields Notes
idx0   id  

FeatureEvidence Table

Field Type Default Description
id id-string n/a Unique identifier for this Feature.
evidence-code string n/a An evidence code describes the possible evidence that exists for deciding a feature's functional assignment. A feature may have no evidence, a single evidence code, or several.

Index Unique Fields Notes
idx0   id  

FeatureLink Table

Field Type Default Description
id id-string n/a Unique identifier for this Feature.
link text n/a Web hyperlink for this feature. A feature can have no hyperlinks or it can have many. The links are to other websites that have useful about the gene that the feature represents, and are coded as raw HTML, using an anchor href tag.

Index Unique Fields Notes
idx0   id  

FeatureVirulent Table

Field Type Default Description
id id-string n/a Unique identifier for this Feature.
virulent text n/a A value indicating the virulence of the feature, coded as HTML. In most cases, this will be a phrase or SA number hyperlinked to the document from which the virulence information was curated. If the feature is not virulent, this field will have no values; otherwise, it may have multiple values.

Index Unique Fields Notes
idx0   id  

Genome

A genome represents a specific organism with DNA, or a specific meta-genome. All DNA sequences in the database belong to genomes.

  • Genome IsMadeUpOf DnaSequence (one-to-many)
  • Genome IsOwnerOf Feature (one-to-many)
  • TaxonomicGrouping IsTaxonomyOf Genome (one-to-many)
  • Genome Uses MolecularMachine (one-to-many)

Genome Table

Field Type Default Description
id name-string n/a Unique identifier for this Genome.
complete boolean n/a TRUE if the genome is complete, else FALSE
contigs int n/a Number of contigs for this organism.
dna-size counter n/a number of base pairs in the genome
domain name-string n/a Domain for this genome or taxonomic classification. The domain is the highest level of the taxonomy tree.
full-name name-string n/a Full genus/species/strain name of the genome.
pegs int n/a Number of protein encoding genes for this organism
primary-group name-string n/a The primary NMPDR group for this organism. There is always exactly one NMPDR group per organism. An empty string indicates the organism is supporting. In general, more data is kept on organisms in NMPDR groups than on supporting organisms.
rnas int n/a Number of RNA features found for this organism.
version name-string n/a Version string for this genome, generally consisting of the genome ID followed by a period and a string of digits.

Index Unique Fields Notes
idx0   primary-group, full-name This index allows the applications to find all genomes associated with a specific primary (NMPDR) group.
idx1   full-name This index allows the applications to find all genomes in lexical order by name.
idx2 yes id Primary index for Genome.

Identifier

An identifier is an alternate name for a feature or protein sequence.

Some identifiers name features or protein sequences that do not exist in the database. In this case, the feature or protein sequence is considered external; that is, it belongs to another database.

Identifier Table

Field Type Default Description
id string n/a Unique identifier for this Identifier.
source key-string n/a Specific type of the identifier, such as its source database or category. The type can usually be decoded to convert the identifier to a URL.

Index Unique Fields Notes
idx0   source This index allows all the identifiers of a specified type to be located.
idx1 yes id Primary index for Identifier.

IdentifierSet

The identifier set is a group of identifiers that mean the same thing, usually either a Feature or a Protein Sequence. The identifiers in a set will frequently belong to different genomic databases. Thus, if a specific protein sequence has one name in the NMPDR and another name in RefSeq, both of the names would be in the same identifier set.

IdentifierSet Table

Field Type Default Description
id name-string n/a Unique identifier for this IdentifierSet.

Index Unique Fields Notes
idx0 yes id Primary index for IdentifierSet.

MachineRole

A machine role represents a role as it occurs in a molecular machine. The key is the machine key plus the role abbreviation.

The machine role corresponds to a cell on the subsystem spreadsheet. Features in the subsystem are assigned directly to the machine role.

MachineRole Table

Field Type Default Description
id name-string n/a Unique identifier for this MachineRole.

Index Unique Fields Notes
idx0 yes id Primary index for MachineRole.

MolecularMachine

A molecular machine is a collection of features that implements a metabolic pathway. Machines are the physical instances of variants. Each machine corresponds to a row in a subsystem spreadsheet. The key is the variant key followed by a colon and the Genome ID.

  • Variant IsImplementedBy MolecularMachine (one-to-many)
  • MolecularMachine IsMachineOf MachineRole (one-to-many)
  • Genome Uses MolecularMachine (one-to-many)

MolecularMachine Table

Field Type Default Description
id key-string n/a Unique identifier for this MolecularMachine.
type key-string n/a The machine type indicates how it relates to the parent variant. A type of "vacant" means that the machine does not appear to actually exist in the organism. A type of "incomplete" means that the machine appears to be missing many reactions. In all other cases, the type is "normal".

Index Unique Fields Notes
idx0 yes id Primary index for MolecularMachine.

Pairing

A pairing indicates that two protein sequences are found close together on one or more DNA sequences. Not all possible pairings are stored in the database; only those that are considered for some reason to be significant for annotation purposes.The key of the pairing is the concatenation of the protein sequence keys in alphabetical order.

Because the protein sequence key is a hash of the sequence letters, the key of a pairing between two sequences is computable from the sequences themselves. Theoretically, the pairing is unordered: (A,B) and (B,A) are the same pairing. It is frequently the case, however, that we need to refer to the "first" or "second" protein in the pairing. When this happens, the first one is always the protein with the alphabetically lesser key. The IsInPair relationship automatically shows the proteins in this order.

Pairing Table

Field Type Default Description
id name-string n/a Unique identifier for this Pairing.

Index Unique Fields Notes
idx0 yes id Primary index for Pairing.

ProteinSequence

A protein sequence is a specific sequence of amino acids. Unlike a DNA sequence, a protein sequence does not belong to a genome. Identical proteins generated by different genomes are generally stored as a single ProteinSequence instance. The key is a hash of the protein letter sequence.

  • ProteinSequence Catalyzes Role (many-to-many)
  • Publication Concerns ProteinSequence (many-to-many)
  • ProteinSequence Exposes Structure (many-to-many)
  • ProteinSequence IsSequenceFor Identifier (one-to-many)

ProteinSequence Table

Field Type Default Description
id hash-string n/a Unique identifier for this ProteinSequence.
isoelectric-point float n/a pH in the surrounding medium at which the charge on a protein is neutral. If the pH of the medium is lower than this value, the protein will have a net positive charge. If the pH of the medium is higher, then the protein will have a net negative charge.
molecular-weight float n/a Molecular weight of this feature's protein, in daltons. A weight of 0 indicates that no protein is created.
sequence dna n/a The sequence contains the letters corresponding to the protein's amino acids.
signal-peptide name-string n/a The signal peptide location for this feature. This is expressed as start and end numbers with a hyphen for the relevant amino acids. So, "1-22" would indicate a signal peptide at the beginning of the feature's protein and extending through 22 amino acid positions. An empty string means no signal peptide is present.
similar-to-human boolean n/a TRUE if this feature generates a protein that is similar to one found in humans, else FALSE
transmembrane-map text n/a A map indicating which sections of a protein will be embedded in a membrane. This is expressed as a comma-separated list of as start and end numbers with hyphens for the relevant amino acids. So, "10-12, 40-60" would indicate that there are two sections of the protein that become embedded in a membrane: the 10th through 12th amino acids, and the 40th through the 60th. An empty string means no transmembrane regions are known.

Index Unique Fields Notes
idx0 yes id Primary index for ProteinSequence.

ProteinSequenceIEDB Table

Field Type Default Description
id hash-string n/a Unique identifier for this ProteinSequence.
iedb text n/a A value indicating whether or not the feature can be found in the Immune Epitope Database. If the feature has not been matched to that database, this field will have no values. Otherwise, it will have an epitope name and/or sequence, hyperlinked to the database.

Index Unique Fields Notes
idx0   id  

Publication

A publication is an article or citation that may be used as evidence for assertions made in the database. The key is a hash code computed from the URL.

  • Publication Concerns ProteinSequence (many-to-many)

Publication Table

Field Type Default Description
id hash-string n/a Unique identifier for this Publication.
url string n/a URL of the article or of its citation.
citation text n/a Citation string for the article.

Index Unique Fields Notes
idx0   citation This index allows searching for the article by the author names and title.
idx1 yes id Primary index for Publication.

Reaction

A reaction is a chemical process that converts one set of compounds (substrate) to another set (products). The reaction ID is generally a small number preceded by a letter.

Reaction Table

Field Type Default Description
id key-string n/a Unique identifier for this Reaction.
rev boolean n/a TRUE if this reaction is reversible, else FALSE

Index Unique Fields Notes
idx0 yes id Primary index for Reaction.

ReactionURL Table

Field Type Default Description
id key-string n/a Unique identifier for this Reaction.
url string n/a HTML string containing a link to a web location that describes the reaction. This field is optional.

Index Unique Fields Notes
idx0   id  

Role

A role describes a biological function that may be fulfilled by a feature. One of the main goals of the database is to assign features to roles. Most roles are effected by the construction of proteins. Some, however, deal with functional regulation and message transmission.

A role represents a single gene function. Many roles are in subsystems, but some are not. If a feature has multiple functions, each is represented as a separate role.

Role Table

Field Type Default Description
id hash-string n/a Unique identifier for this Role.
hypothetical boolean n/a TRUE if a role is hypothetical, else FALSE
name string n/a English name of this role. The actual role ID is computed from this field.

Index Unique Fields Notes
idx0 yes id Primary index for Role.

RoleSet

A role set is a group of roles that work together to stimulate a reaction. Most role sets consist of a single role; however, some reactions require the presence of multiple roles to get them started.

A reaction is usually triggered by a single role, but some reactions are triggered by a boolean combination of roles (e.g. (A and (B or C) and D) or (E and B and F) or G). The boolean expression can be converted into disjunctive normal form, which is a list of alternative sets (e.g. (A and B and D) or (A and C and D) or (E and B and F) or G). Each alternative is then converted into a role set. This allows us to precisely represent the triggering conditions of a reaction in the database.

RoleSet Table

Field Type