The Entity-Relationship Database System (ERDB)

ERDB is used to implement three NMPDR databases.

  • Sprout: our current storehouse for genetic data
  • CustomAttributes: a database of general information collected from other databases and journal papers
  • Sapling: the planned successor to Sprout

ERDBFlow.png
For each database, ERDB uses a single XML file to determine the structure and format of the data, generate the documentation, and draw an interactive database diagram. While most access to the database is controlled by custom PERL applications, there are enough basic access methods that a great deal of work can be performed using just the ERDB system itself. ERDB uses an extensible type system so that specialized data such as DNA sequences and graphic images can be stored compactly in the database and treated as PERL objects by the database clients. Finally, an abstraction layer is placed between ERDB and the actual SQL database so that it can be ported more easily to different database back-ends.

The documentation engine of the native ERDB system is specifically designed to create pages for the NMPDR web site; however, it uses a PERL object to control how font styles, headings, tables, and links are rendered. A drop-in replacement for this PERL object can be used to output the documentation as POD, HTML, or any other markup format. The database diagrams are rendered using Adobe Flash. ERDB allows the database designer to add shapes to the diagram that aren't related to a specific entity or relationship in the database. It is possible to define a database with no data or tables, only shapes, and in fact several NMPDR web pages (such as Genome Pipeline and NMPDR Website) use diagrams generated by ERDB.

A complete description of the XML definition file can be found here.

Querying in ERDB

As you can tell from the name, ERDB uses an entity-relationship model of the database. An entity translates to a group of one or more SQL tables. Relationships are always binary (between two entities) and are represented by a single table. Entities and relationships are collectively referred to in the documentation as objects.

To query the database, you need to specify the list of objects involved in the query, an SQL-style filter string, and a list of parameters to be plugged into the filter. The filter will contain one or more database field names, and these must be in standard field name format.

Standard Field Name Format

The standard field name format is the name of the entity or relationship followed by the field name in parentheses. For example,

Feature(species-name)

would specify the species name field for the Feature entity.

In some cases, the object name may not be the actual name of an object in the database. It could be an alias assigned by a query, or the converse name of a relationship. The converse name is an alternative name for a relationship which expresses what happens when you cross the relationship in the reverse direction. So, for example, in the Sprout Database, the IsFamilyForFeature relationship connects a ProteinFamily to a Feature, and this looks sensible when you type it.

ProteinFamily IsFamilyForFeature Feature

If you're going in the other direction, however, it makes more sense to use the converse name, which is IsInFamily.

Feature IsInFamily ProteinFamily

There's nothing special about the converse name: it's simply provided to make queries look nice. For that reason, they're completely optional, and many relationships don't have converse names.

An alias name is an alternative name for a particular object, formed by putting a number after the real name. So, Feature2 is an alias for the Feature table. Aliases are discussed in more detail below.

Filter Clause

The filter clause is an SQL WHERE clause (without the WHERE) to be used to filter and sort the query. The WHERE clause can be parameterized with question marks (?). When the query is executed, the question marks are replaced by values from the parameter list. The fields in the filter clause must be specified in Standard Field Name Format. For example, the following filter clause would ask for all genomes in a particular genus.

Genome(genus) = ?

The genus of interest (e.g. Streptococcus) would be specified as the first and only entry in the parameter list.

The filter clause can also specify a sort order. To do this, simply follow the filter string with an ORDER BY clause. For example, the following filter string gets all genomes for a particular genus and sorts them by species name.

Genome(genus) = ? ORDER BY Genome(species)

Note that the case is important. Only an uppercase "ORDER BY" with a single space will be processed. The idea is to make it less likely to find the verb by accident.

Finally, you can limit the number of rows returned by adding a LIMIT clause. The LIMIT must be the last thing in the filter clause, and it contains only the word "LIMIT" followed by a positive number. So, for example

Genome(genus) = ? ORDER BY Genome(species) LIMIT 10

will only return the first ten genomes for the specified genus. The ORDER BY clause is not required. For example, to just get the first 10 genomes in the Genome table, you could use

LIMIT 10

as your filter clause.

Unlike a normal SQL query, the filter clause does not need to contain any information about joins or relationships between tables. This is all handled automatically by the Object Name List.

Parameter List

The parameter list contains the parameter values to be plugged in to the question marks of the filter string. The parameter values are substituted for the question marks in the filter clause in strict left-to-right order.

In the parameter list for a filter clause, you must be aware of the proper data types. In general, queries will work if you are filtering on numbers and strings; they will fail if you want to filter on something like an image or a particular DNA sequence. The query console parameters are just lines of text, and it's a general tool, meant to be unaware of the special needs of the individual databases, and this limits its capabilities.

Object Name List

The object name list specifies the names of the entities and relationships that participate in the query. This includes every object used to filter the query as well as every object from which data is expected. You can specify the same object multiple times in the list, either insisting that it be the same thing or allowing it to be different, and you can use the keyword AND to separate different join paths.

For example, the following query would find all genes (features) in the FIGfam FIG000171.

Object Name List ProteinFamily IsFamilyForFeature Feature
Filter String ProteinFamily(id) = ?
Parameter List FIG000171

We're using two entities—ProteinFamily and Feature—and they're connected by the IsFamilyForFeature relationship. The filter string specifies a single parameter, which happens to be the ID of the FIGfam in which we're interested.

Now, consider a more complicated query: we want to find all features that are in the same FIGfam as fig|360108.3.peg.1596 which is the multimodular transpeptidase-transglycosylase enzyme for Campylobacter jejuni subsp. jejuni. This requires using the Feature table twice: once for the feature of interest (fig|360108.3.peg.1596) and once for the features in the family. The following query does the job.

Object Name List Feature IsInFamily ProteinFamily IsFamilyForFeature Feature2
Filter String Feature(id) = ?
Parameter List fig|360108.3.peg.1596

Here the second instance of the Feature table is given a numbered suffix to distinguish it from the first. Were we to use

Object Name List Feature IsInFamily ProteinFamily IsFamilyForFeature Feature
Filter String Feature(id) = ?
Parameter List fig|360108.3.peg.1596

Then the query would only return the single feature. Note that when using the query console, if you specify

Feature(id) Feature(assignment)

for the fields to be displayed in the results, you will get

id assignment
fig|360108.3.peg.1596 Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-)
fig|360108.3.peg.1596 Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-)
fig|360108.3.peg.1596 Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-)
fig|360108.3.peg.1596 Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-)
fig|360108.3.peg.1596 Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-)
fig|360108.3.peg.1596 Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-)
fig|360108.3.peg.1596 Multimodular transpeptidase-transglycosylase (EC 2.4.1.129) (EC 3.4.-.-)

and so forth. This is because the filter string fixes Feature by its ID. To get the desired result, you need to use Feature2.

Feature2(id) Feature2(assignment)

Most queries involve a single continuous path through the database, but sometimes it is necessary to have a branching path. Let us say that in addition to the ID and assignment of the features found we also want to know which subsystems they are in and the scientific names of the genomes. To get the genome, we use the HasFeature relationship. To get the subsystem, we use the HasRoleInSubsystem relationship; however, both of these relationships start from the same place: Feature2. We use the AND keyword to get access to Feature2 a second time.

Object Name List Feature IsInFamily ProteinFamily IsFamilyForFeature Feature2 HasFeature Genome AND Feature2 HasRoleInSubsystem
Filter String Feature(id) = ?
Parameter List fig|360108.3.peg.1596
Field List Feature2(id) Feature2(assignment) Genome(scientific-name) HasRoleInSubsystem(to-link)

The Controversial Aspect of from-link and to-link

Many relationships have two names. For example, the HasFeature relationship that leads from features to genomes has an alternate name IsInGenome. So, the following object name paths are the same.

Normal Genome HasFeature Feature
Converse Feature IsInGenome Genome

The IsInGenome relationship is called the converse of HasFeature. Both relationships are implemented using a single table; however, if you're using the converse, the meaning of from-link and to-link is different, as shown below.

HasFeature(from-link) ID of the source Genome
HasFeature(to-link) ID of the target Feature
IsInGenome(from-link) ID of the source Feature
IsInGenome(to-link) ID of the target Genome

After extensive research, we've determined that no matter how we set this up, people are going to be confused. Thankfully, in most queries the link fields are not necessary, because they are the same as the ID fields in the entities connected by the relationship. So, you can get the same result as HasFeature(from-link) by asking for Genome(id).

The ERDB Query Console

The ERDB Query Console allows you to make general queries against our active ERDB databases. We don't recommend using it to search our database: most people will find the various NMPDR Search pages easier to use. Rather, the Console is provided as an aid to people interested ERDB's capabilities and how it works. The query results will be displayed in the embedded frame below the console. All queries are artificially limited to a maximum of 500 results.

Database  
Object Names
Filter String
Parameters
(one per line)
Fields
Sprout Help - Sapling Help - CustomAttributes Help - ERDB Queries

Show Help Hide Help

Select a Database name and click RUN to run a query. Click on one of the links next to the RUN button to see a database's help page in the embedded frame or to see a general description of the parameters to an ERDB query. The form parameters are as follows.

Object Names
List the entities and relationships involved in the query, in order, space-delimited.
Filter String
Enter a WHERE clause for the query (without the WHERE). Field names are indicated by an entity or relationship name followed by the actual field name in parentheses. So, for example, Feature(assignment) would indicate the assignment field of the Feature entity. You can also include an ORDER BY clause. Use question marks (C<?>) to indicate points in the filter string into which parameter values should be inserted.
Parameters
A list of parameter values, one per line. Each parameter will be substituted for a question mark in the query. Parameters that are string values must have new-lines and tabs escaped (e.g. Set function to\nhypothetical protein\nmaster).
Fields
A list of field names to display, space-delimited. Field name syntax is the same as it is in the filter string, except that if the field belongs to the first entity or relationship in the object name list, the field name can be specified by itself.

arrowdownarrowup

PERL API Basics

A complete description of the ERDB methods and commands is given on the pages listed under code documentation; however, the following tables contain an overview of the most important PERL API calls.

Method Operates On Parameters Description
Constructors
GetDatabase static name Connect to the named database. This is a static constructor for any ERDB database. Each database has its own PERL object that should be aware of the details of how to access the database and what security procedures need to be followed. This method finds that object and calls its constructor.
new static database connection, XML file name Connect to an ERDB database. This is the constructor used by the PERL database objects mentioned above.
Query Methods
Get ErdbPm object names, filter clause, parameter list Execute a query. This method returns an ERDBQueryPm? object which can be used to iterate through the data returned.
Fetch ERDBQueryPm   Fetch the next record of data. This method returns an ERDBObjectPm? which contains actual data from the database.
Value ERDBObjectPm field name Return the value or values of the specified field. This method always returns a list, even if a field has only one value.
GetAll ErdbPm object names, filter clause, parameter list, field names. Execute a query and return all results. This method performs all the functions of Get, Fetch, and Value in a single call.
GetCount ErdbPm object names, filter clause, parameter list Count records. Return the number of database records that match the specified criteria.
Update Methods
Delete ErdbPm entity name, entity ID Delete an entity instance. This method removes all traces of a particular entity instance from the database, including connecting relationships, subordinate entity occurrences, and everything in between.
DeleteLike ErdbPm relationship name, filter string, parameter list Disconnect relationships. All relationship instances in the given relationship that match the filter criteria will be deleted.
InsertObject ErdbPm object name, field hash Insert an object. Insert a new entity or relationship instance into the database.
UpdateEntity ErdbPm entity name, entity ID, field hash Update an entity instance. Change the values of the specified fields in the specified entity instance.

Code Documentation

The ERDB code documentation is available in the Sprout Project.

SequencingForm
Sequence 0002--
Summary Entity-Relationship Database System
Topic revision: r10 - 05 Feb 2009 - 08:01:03 - Bruce Parrello
 
Notice to NMPDR Users - The NMPDR BRC contract has ended and bacterial data from NMPDR has been transferred to PATRIC (http://www.patricbrc.org), a new consolidated BRC for all NIAID category A-C priority pathogenic bacteria. NMPDR was a collaboration among researchers from the Computation Institute of the University of Chicago, the Fellowship for Interpretation of Genomes (FIG), Argonne National Laboratory, and the National Center for Supercomputing Applications (NCSA) at the University of Illinois. NMPDR is funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract HHSN266200400042C. Banner images are copyright © Dennis Kunkel.