The Entity-Relationship Database System (ERDB)
ERDB is used to implement three NMPDR databases.
- Sprout: our current storehouse for genetic data
- CustomAttributes: a database of general information collected from other databases and journal papers
- Sapling: the planned successor to Sprout
For each database, ERDB uses a single XML file to determine the structure and format of the data, generate the documentation, and draw an interactive database diagram. While most access to the database is controlled by custom PERL applications, there are enough basic access methods that a great deal of work can be performed using just the ERDB system itself. ERDB uses an extensible type system so that specialized data such as DNA sequences and graphic images can be stored compactly in the database and treated as PERL objects by the database clients. Finally, an abstraction layer is placed between ERDB and the actual SQL database so that it can be ported more easily to different database back-ends.
The documentation engine of the native ERDB system is specifically designed to create pages for the NMPDR web site; however, it uses a PERL object to control how font styles, headings, tables, and links are rendered. A drop-in replacement for this PERL object can be used to output the documentation as POD, HTML, or any other markup format. The database diagrams are rendered using Adobe Flash. ERDB allows the database designer to add shapes to the diagram that aren't related to a specific entity or relationship in the database. It is possible to define a database with no data or tables, only shapes, and in fact several NMPDR web pages (such as
Genome Pipeline and
NMPDR Website) use diagrams generated by ERDB.
A complete description of the XML definition file can be found
here.
Querying in ERDB
As you can tell from the name, ERDB uses an
entity-relationship model of the database. An entity translates to a group of one or more SQL tables. Relationships are always binary (between two entities) and are represented by a single table. Entities and relationships are collectively referred to in the documentation as
objects.
To query the database, you need to specify the list of objects involved in the query, an SQL-style
filter string, and a list of parameters to be plugged into the filter. The filter will contain one or more database field names, and these must be in
standard field name format.
Standard Field Name Format
The standard field name format is the name of the entity or relationship followed by the field name in parentheses. For example,
would specify the species name field for the
Feature entity.
In some cases, the object name may not be the actual name of an object in the database. It could be an alias assigned by a query, or the converse name of a relationship. The
converse name is an alternative name for a relationship which expresses what happens when you cross the relationship in the reverse direction. So, for example, in the
Sprout Database, the
IsFamilyForFeature relationship connects a
ProteinFamily to a
Feature, and this looks sensible when you type it.
ProteinFamily IsFamilyForFeature Feature
If you're going in the other direction, however, it makes more sense to use the converse name, which is
IsInFamily.
Feature IsInFamily ProteinFamily
There's nothing special about the converse name: it's simply provided to make queries look nice. For that reason, they're completely optional, and many relationships don't have converse names.
An alias name is an alternative name for a particular object, formed by putting a number after the real name. So,
Feature2 is an alias for the
Feature table. Aliases are discussed in more detail below.
Filter Clause
The filter clause is an
SQL WHERE clause (without the WHERE) to be used to filter and sort the query. The WHERE clause can be parameterized with question marks (
?). When the query is executed, the question marks are replaced by values from the parameter list. The fields in the filter clause must be specified in Standard Field Name Format. For example, the following filter clause would ask for all genomes in a particular genus.
The genus of interest (e.g.
Streptococcus) would be specified as the first and only entry in the parameter list.
The filter clause can also specify a sort order. To do this, simply follow the filter string with an ORDER BY clause. For example, the following filter string gets all genomes for a particular genus and sorts them by species name.
Genome(genus) = ? ORDER BY Genome(species)
Note that the case is important. Only an uppercase "ORDER BY" with a single space will be processed. The idea is to make it less likely to find the verb by accident.
Finally, you can limit the number of rows returned by adding a LIMIT clause. The LIMIT must be the last thing in the filter clause, and it contains only the word "LIMIT" followed by a positive number. So, for example
Genome(genus) = ? ORDER BY Genome(species) LIMIT 10
will only return the first ten genomes for the specified genus. The ORDER BY clause is not required. For example, to just get the first 10 genomes in the
Genome table, you could use
as your filter clause.
Unlike a normal SQL query, the filter clause does not need to contain any information about joins or relationships between tables. This is all handled automatically by the Object Name List.
Parameter List
The parameter list contains the parameter values to be plugged in to the question marks of the filter string. The parameter values are substituted for the question marks in the filter clause in strict left-to-right order.
In the parameter list for a filter clause, you must be aware of the proper data types. In general, queries will work if you are filtering on numbers and strings; they will fail if you want to filter on something like an image or a particular DNA sequence. The query console parameters are just lines of text, and it's a general tool, meant to be unaware of the special needs of the individual databases, and this limits its capabilities.
Object Name List
The
object name list specifies the names of the entities and relationships that participate in the query. This includes every object used to filter the query as well as every object from which data is expected. You can specify the same object multiple times in the list, either insisting that it be the same thing or allowing it to be different, and you can use the keyword
AND to separate different join paths.
For example, the following query would find all genes (features) in the
FIGfam FIG000171.
| Object Name List |
ProteinFamily IsFamilyForFeature Feature |
| Filter String |
ProteinFamily(id) = ? |
| Parameter List |
FIG000171 |
We're using two entities—
ProteinFamily and
Feature—and they're connected by the
IsFamilyForFeature relationship. The filter string specifies a single parameter, which happens to be the ID of the
FIGfam in which we're interested.
Now, consider a more complicated query: we want to find all features that are in the same
FIGfam as
fig|360108.3.peg.1596 which is the multimodular transpeptidase-transglycosylase enzyme for
Campylobacter jejuni subsp. jejuni. This requires using the
Feature table twice: once for the feature of interest (fig|360108.3.peg.1596) and once for the features in the family. The following query does the job.
| Object Name List |
Feature IsInFamily ProteinFamily IsFamilyForFeature Feature2 |
| Filter String |
Feature(id) = ? |
| Parameter List |
fig|360108.3.peg.1596 |
Here the second instance of the
Feature table is given a numbered suffix to distinguish it from the first. Were we to use
| Object Name List |
Feature IsInFamily ProteinFamily IsFamilyForFeature Feature |
| Filter String |
Feature(id) = ? |
| Parameter List |
fig|360108.3.peg.1596 |
Then the query would only return the single feature. Note that when using the query console, if you specify
Feature(id) Feature(assignment)
for the fields to be displayed in the results, you will get
| id |
assignment |
| fig|360108.3.peg.1596 |
Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-) |
| fig|360108.3.peg.1596 |
Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-) |
| fig|360108.3.peg.1596 |
Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-) |
| fig|360108.3.peg.1596 |
Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-) |
| fig|360108.3.peg.1596 |
Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-) |
| fig|360108.3.peg.1596 |
Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-) |
| fig|360108.3.peg.1596 |
Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-) |
and so forth. This is because the filter string fixes
Feature by its ID. To get the desired result, you need to use
Feature2.
Feature2(id) Feature2(assignment)
Most queries involve a single continuous path through the database, but sometimes it is necessary to have a branching path. Let us say that in addition to the ID and assignment of the features found we also want to know which subsystems they are in and the scientific names of the genomes. To get the genome, we use the
HasFeature relationship. To get the subsystem, we use the
HasRoleInSubsystem relationship; however, both of these relationships start from the same place:
Feature2. We use the
AND keyword to get access to
Feature2 a second time.
| Object Name List |
Feature IsInFamily ProteinFamily IsFamilyForFeature Feature2 HasFeature Genome AND Feature2 HasRoleInSubsystem |
| Filter String |
Feature(id) = ? |
| Parameter List |
fig|360108.3.peg.1596 |
| Field List |
Feature2(id) Feature2(assignment) Genome(scientific-name) HasRoleInSubsystem(to-link) |
The Controversial Aspect of from-link and to-link
Many relationships have two names. For example, the
HasFeature relationship that leads from features to genomes has an alternate name
IsInGenome. So, the following object name paths are the same.
| Normal |
Genome HasFeature Feature |
| Converse |
Feature IsInGenome Genome |
The
IsInGenome relationship is called the
converse of
HasFeature. Both relationships are implemented using a single table; however, if you're using the converse, the meaning of
from-link and
to-link is different, as shown below.
HasFeature(from-link) |
ID of the source Genome |
HasFeature(to-link) |
ID of the target Feature |
IsInGenome(from-link) |
ID of the source Feature |
IsInGenome(to-link) |
ID of the target Genome |
After extensive research, we've determined that no matter how we set this up, people are going to be confused. Thankfully, in most queries the link fields are not necessary, because they are the same as the ID fields in the entities connected by the relationship. So, you can get the same result as
HasFeature(from-link) by asking for
Genome(id).
The ERDB Query Console
The ERDB Query Console allows you to make general queries against our active ERDB databases. We don't recommend using it to search our database: most people will find the various
NMPDR Search pages easier to use. Rather, the Console is provided as an aid to people interested ERDB's capabilities and how it works. The query results will be displayed in the embedded frame below the console. All queries are artificially limited to a maximum of 500 results.
Show Help Hide Help
Select a
Database name and click
RUN to run a query. Click on one of the links next to the
RUN button to see a database's help page in the embedded frame or to see a general description of the parameters to an ERDB query. The form parameters are as follows.
- Object Names
- List the entities and relationships involved in the query, in order, space-delimited.
- Filter String
- Enter a WHERE clause for the query (without the WHERE). Field names are indicated by an entity or relationship name followed by the actual field name in parentheses. So, for example,
Feature(assignment) would indicate the assignment field of the Feature entity. You can also include an ORDER BY clause. Use question marks (C<?>) to indicate points in the filter string into which parameter values should be inserted.
- Parameters
- A list of parameter values, one per line. Each parameter will be substituted for a question mark in the query. Parameters that are string values must have new-lines and tabs escaped (e.g.
Set function to\nhypothetical protein\nmaster).
- Fields
- A list of field names to display, space-delimited. Field name syntax is the same as it is in the filter string, except that if the field belongs to the first entity or relationship in the object name list, the field name can be specified by itself.

PERL API Basics
A complete description of the ERDB methods and commands is given on the pages listed under
code documentation; however, the following tables contain an overview of the most important PERL API calls.
| Method |
Operates On |
Parameters |
Description |
| Constructors |
GetDatabase |
static |
name |
Connect to the named database. This is a static constructor for any ERDB database. Each database has its own PERL object that should be aware of the details of how to access the database and what security procedures need to be followed. This method finds that object and calls its constructor. |
new |
static |
database connection, XML file name |
Connect to an ERDB database. This is the constructor used by the PERL database objects mentioned above. |
| Query Methods |
Get |
ErdbPm |
object names, filter clause, parameter list |
Execute a query. This method returns an ERDBQueryPm? object which can be used to iterate through the data returned. |
Fetch |
ERDBQueryPm |
|
Fetch the next record of data. This method returns an ERDBObjectPm? which contains actual data from the database. |
Value |
ERDBObjectPm |
field name |
Return the value or values of the specified field. This method always returns a list, even if a field has only one value. |
GetAll |
ErdbPm |
object names, filter clause, parameter list, field names. |
Execute a query and return all results. This method performs all the functions of Get, Fetch, and Value in a single call. |
GetCount |
ErdbPm |
object names, filter clause, parameter list |
Count records. Return the number of database records that match the specified criteria. |
| Update Methods |
Delete |
ErdbPm |
entity name, entity ID |
Delete an entity instance. This method removes all traces of a particular entity instance from the database, including connecting relationships, subordinate entity occurrences, and everything in between. |
DeleteLike |
ErdbPm |
relationship name, filter string, parameter list |
Disconnect relationships. All relationship instances in the given relationship that match the filter criteria will be deleted. |
InsertObject |
ErdbPm |
object name, field hash |
Insert an object. Insert a new entity or relationship instance into the database. |
UpdateEntity |
ErdbPm |
entity name, entity ID, field hash |
Update an entity instance. Change the values of the specified fields in the specified entity instance. |
Code Documentation
The ERDB code documentation is available in the
Sprout Project.