Start with a keyword search for the name of a gene or protein.
Start with the nucleotide sequence of your gene or amino acid sequence of your protein and blast against any complete genome. Use one or more query sequences to search one or more genomes.
Start from the annotation status table of a focus organism page, for example, the Staphylococcus page, which provides quick access to proteins about which much is known (named genes in subsystems), little is known (named genes not in subsystems and hypothetical genes in subsystems), or nothing is known (hypothetical genes not in subsystems).
Start from the subsystems tree to view the phylogenetic distribution of an interesting biological process.
Start from the essential genes page to view essential genes in model organisms and to project essentiality to closely or distantly related organisms.
Start from a virtual structural proteome to investigate proteins about which structural information is available in PDB.
^topYes, to BLAST more than one sequence at a time, click the blast link from the home page, and paste all your fastA sequences into the box. It makes no difference whether there is an empty line between the different sequences, just as long as each sequence begins on a new line with a fastA header.
Then, in the genome selection part of the search form, you may select one or more genomes. Make sure you set the tool to blastp if your query sequences are proteins, or blastx, if the query sequences are DNA. If you are blasting more than a handful of sequences, you may want to increase the number of search results per page from the default of 50. Now click the blast button.
Results are returned in order of blast score, which is dependent on protein length, so the result of a lot of sequences blasted against a lot of genomes may be a bit messy. The more work there is to do, the longer the search will take.
To save the table of results, use one of the download options at the top of the results page. You may download the text of the results table as a tab-delimited file, which you can then open in Excel and resort by organism or functional name. You may also download the nucleotide or amino acid sequences of the matching genes with one-click buttons at the top of the results page.
^topA feature is anything that can be mapped onto a strand of DNA, and is defined by its start and stop location. A gene is a feature. A protein coding sequence (CDS or PEG) is a feature that, in bacteria, shares the same location on the DNA as its gene, but is represented as an amino acid sequence rather than a nucleotide sequence. Eukaryotic genes also have intron and exon features defined as subsets of the gene feature. Short regulatory elements or functional motifs may also be defined as features. Pathogenicity islands are features that include many genes. Operons are not presently annotated as features in NMPDR.
^topComplete genomes are the primary data. As such, most chromosomes are one contiguous length of DNA sequence data, or one "contig." Some genomes that are fragmented into several "contigs" are considered to be essentially complete. Genome data include DNA sequences, protein sequences, and the associated annotations. Annotations include both the accurate determination of gene boundaries and the assignment of a functional name to the encoded proteins. NMPDR curators use bioinformatics tools to correct errors in the start or stop codons of genes, and to change incorrect or ambiguous names in the annotations of protein encoding genes, which are called "pegs" in the NMPDR. A peg is equivalent to a CDS.
Populated subsystems are a data type unique to the NMPDR and its underlying annotation environment, the SEED. Subsystems are sets of functional roles grouped according to any biologically useful organizing principle. A subsystem may describe a metabolic pathway, but subsystems are not limited to pathways. For example, there are subsystems that include the ribosomal proteins, or cell division proteins, or pathogen-specific virulence factors. A subsystem may be comprised of a very few or very many proteins that are related in some functional or structural way. Each protein included in a subsystem plays a "functional role" which may be enzymatic, signaling, regulatory, structural, or other. A subsystem may exist in all genomes or be present in only a few closely related genomes. A populated subsystem is a two-dimensional integration of biological functions with genome sequences. It is presented as a spreadsheet with columns of functional roles, rows of genomes, and cells populated by the genes responsible for each function.
Functional clusters are another data type unique to NMPDR and the SEED. The functions of two proximal genes are more likely to be related when they are similarly clustered together in a large number of organisms distributed over a wide phylogenetic space, which is represented as a high functional clustering score. The score is approximately equal to the number of different species (not strains) in which the two genes are co-localized. Functional clustering provides insight to the specific roles played by proteins that may initially be assigned an ambiguous functional role, like "transporter." Functional clusters are presented graphically and by scores for every peg in the database. Additionally, if the peg of interest to you does not share conserved proximity with others, it is possible to discover whether orthologs of your protein are clustered in other genomes. The CL button will display a table of orthologs that might be clustered with other proteins. Clicking on the functional clustering score will also provide a list of orthologs paired in other genomes.
BLAST hits are pre-computed in a reciprocal analysis for all proteins in the database. The results are presented in a table of bidirectional best hits, or BBH. A comprehensive list of orthologs is thus provided for every protein in every genome. You may select orthologs from this list and generate a ClustalW? alignment with one click.
^topA subsystem has two components. First is a list of functional roles that are united by any common process or biologically meaningful organizing principle. Second is a spreadsheet, called a populated subsystem, which is a two-dimensional integration of biological functions with genome sequences. In the populated subsystem, functional roles are represented in columns, genomes are represented in rows, and cells of the spreadsheet are populated by the genes responsible for each function. Genes that are clustered on the chromosome share the same background color in the spreadsheet. Gene identification numbers are linked to NMPDR protein context pages. If multiple genes play the same functional role, the variants are named in the table of functional roles. The row number from that table is then appended to the gene number in the spreadsheet to identify which variant is used. Have a look at the Adhesins in Staphylococcus as an example.
Subsystems may be accessed from the context page of a member protein by clicking the link in the biological context section of the page. There is also a list of subsystems organized in a tree view on this page. After clicking on show subsystems, a metabolic reconstruction, or comprehensive list of subsystems and proteins that perform functional roles, is returned for the chosen genome. Subsystem headers link to populated subsystem (spreadsheet) displays, and proteins link to their respective context pages.
An investigator can learn much by establishing a subsystem of functions in genomes that are known to contain all the required genes, then using the computer to extend the subsystem to genomes about which less is known. NMPDR is used to browse subsystems established by our curators. The SEED may be used by investigators to create their own subsystems.
^topAll genomes may be searched from the home page. To limit your search to one of the NMPDR core organism groups, you may start your search from one of the organism summary pages:
On each page you will find a search box where you can search for specific genes or proteins by text. To search by gene or protein sequence, choose the "BLAST or Scan" option. To compose an advanced query with selected genomes or subsystems, choose the "Genes" option. Search results are returned in a table that links to two viewing environments:
To understand the full functionality available in the two environments, you will need to take some time to experiment. Feel free to contact us with questions that you have. We are addding help text and smoothing out the interfaces as quickly as possible and largely in response to specific requests and suggestions; so please do take the time to formulate them.
^topThe NMPDR contains two classes of genomes -- those pathogens we are being funded to annotate, which we call "core genomes," and "supporting genomes," which include all publicly available genomes for comparative analysis. The table below lists , with strain designation and serotype when known, and closely related supporting genomes.
| Campylobacter | Listeria | Staphylococcus | Streptococcus | Vibrio |
|---|---|---|---|---|
| jejuni RM1221 | monocytogenes 1/2a EGD-e | aureus subsp. aureus COL | pneumoniae R6 unencapsulated | cholerae O1 ElTor? str. N16961 |
| jejuni subsp. jejuni NCTC 11168 | monocytogenes 1/2a F6854 | aureus subsp. aureus MRSA252 | pneumoniae TIGR4 type 4 | cholerae O1 classical str. O395 |
| jejuni subsp. jejuni 81-176 | monocytogenes 1/2a F6900 | aureus subsp. aureus MSSA476 | pyogenes M1 GAS SF370 | cholerae O139 str. MO10 |
| jejuni subsp. jejuni 260.94 | monocytogenes 1/2a J0161 | aureus subsp. aureus MW2 | pyogenes M1 MGAS 5005 | cholerae non-O1 str. NRT36s |
| jejuni subsp. jejuni 84-25 | monocytogenes 1/2a J2818 | aureus subsp. aureus Mu50 | pyogenes M2 MGAS 10270 | parahaemolyticus RIMD 2210633 |
| jejuni subsp. jejuni CF93-6 | monocytogenes 1/2a 10403S | aureus subsp. aureus N315 | pyogenes M3 SSI-1 | vulnificus CMCP6 |
| jejuni subsp. jejuni HB93-13 | monocytogenes 1/2a1 FSL N3-165 | aureus subsp. aureus NCTC 8325 | pyogenes M3 MGAS 315 | vulnificus YJ016 |
| coli RM2228 | monocytogenes 1/2b FSL J1-194 | aureus subsp. aureus JH1 | pyogenes M4 MGAS 10750 | fischeri ES114 |
| fetus subsp. fetus 82-40 | monocytogenes 1/2b FSL R2-503 | aureus subsp. aureus JH9 | pyogenes M5 Manfredo | splendidus 12B01 |
| lari RM2100 | monocytogenes 4b Aureli 1997 HPB2262 | aureus subsp. aureus USA300 | pyogenes M6 MGAS 10394 | sp. MED222 |
| upsaliensis RM3195 | monocytogenes 4b F2365 | aureus RF122 | pyogenes M12 MGAS 2096 | sp. Ex25 O62 |
| monocytogenes 4b FSL N1-017 | epidermidis RP62A? | pyogenes M12 MGAS 9429 | ||
| monocytogenes 4b H7858 | epidermidis ATCC12228 | pyogenes M18 MGAS 8232 | ||
| monocytogenes 4c FSL J2-071 | haemolyticus JCSC1435 | pyogenes M28 MGAS 6180 | ||
| innocua 6a Clip11262 | saprophyticus ATCC 15305 | agalactiae serotype V 2603V/R | ||
| welshimeri 6b SLCC5334 | agalactiae A909 | |||
| agalactiae NEM316 |
^top
In the NMPDR environment you can visualize a gene, the protein it encodes, the context of the gene on its contig, and a wealth of specific information relating to that gene. The NMPDR protein page presents a table of information related to the target gene and those genes found up- and down-stream on the chromosome. The feature identification number for NMPDR, fid, is listed first. Next are listed the start and stop nucleotide coordinates, length, size of gap (or overlap) between genes, and the orientation on the + or - strand. The functional name assigned to each gene in the NMPDR annotation is listed under "function," while names or numbers assigned to the gene by other sources are listed as aliases with links to external resources such as UniProt? , GenBank, and KEGG. A graphic representation of gene context shows the target gene in green, functionally related genes in blue, and unrelated neighbors in red.
Two powerful tools for comparative analysis of functional clustering are linked as buttons on the NMPDR protein page. The "CL" button will open a table of homologs to the target gene that appear in functional clusters in the genomes of the other organisms. The table is ordered by the size of the cluster, that is, by how many genes are functionally linked. The "Pins" button will present a graphic of gene clusters, similar to the context graphic, but this will include homologous regions from many organisms, ordered phylogenetically. The target gene is numbered 1, and clustered homologs share a numerical label.
At the bottom of the NMPDR protein page, links are provided to other sites and to tools that help to analyze the encoded protein. If you have a tool that you would like linked from our site, or a tool that you would like to link into our site, please contact us.
^topThe term functional coupling has been used to indicate that two genes appear to have related functional roles (e.g., the encoded proteins both participate in the same metabolic pathway or they both are components in a single complex). One exciting challenge to bioinformatics is to predict functional coupling. Perhaps the most effective technique for doing so relates to analysis of proximity on the chromosome; when two genes tend to occur fairly close to one another in numerous genomes, it amounts to solid evidence that the roles of the gene products are closely related. For details see "The use of gene clusters to infer functional coupling," Proc Natl Acad Sci U S A. 1999 Mar 16; 96(6):2896-2901.
Because comparative analysis of gene clusters has begun to play a much larger role in determination of gene function (due to the rapid increase in the number of available genomes), we have computed instances in which genes appear to be functionally coupled and make the inferences accessible from the NMPDR environment. When you are on the NMPDR protein page, which shows a table of the genes that occur in the region around the gene you are focused on, you will see a column labeled "fc-sc." If a number occurs in this column, there is evidence based on clustering that the genes with numbers are functionally coupled to the gene of focus in that number of genomes. The number is actually a link to a table displaying co-occurrences of the two genes.
^topNMPDR contains bi-directional best hits (BBH) precomputed using BLASTP. Two sequences S1 and S2 from genomes G1 and G2 are bi-directional best hits if:
The clear disadvantage of BBH is that duplicates or paralogs within the same genome will not be listed in the BBH table. The table will display the annotations, E-values (probabilities that the BLAST hit is random), and links to similar proteins in other organisms.
^topWhen we load data into the NMPDR we collect links for each feature. Some of these links take you to corresponding entries in databases maintained by numerous groups worldwide. Other links take you to tools that will aid in your analysis of the proteins. In either case, you need to ensure that the database and tool is appropriate for the question that you are trying to answer. Not all tools are appropriate for all questions. If you know of a resource that we should be linking features to, please feel free to point this out to us.
^topThe motivation for the Signature Genes Tool is to try to locate genes related to a phenotype that is associated with one set of organisms (call this set1) but not with another (call these set2).
The search goes through the genes in one organism from Set 1, selected as the reference genome. For each gene in the reference genome, the tool evaluates the bidirectional best hits of the genes that occur in genomes from set1 and set2. It tabulates these and constructs a score from 0 to 1. A score of 1 means that the gene has a bidirectional best hit in every genome from set1 and no bidirectional best hits against any genome in set2.
The scores are tabulated. The best candidate genes are then presented to you as a list of genes to explore. The main shortcoming of the tool relates to our use of bidirectional best hits. If there are paralogs to the gene within genomes, a bidirectional best hit may not exist in a genome that contains several clear homologs. This means that we may miss genes with paralogs, and we may include genes that do not discriminate as well as we seem to indicate. This means that you must explore each gene as a candidate, but nothing more. There is now the option of running the tool using precomputed similarities rather than bidirectional best hits.
^topComparative analysis of genomes reveals multiple gaps in our knowledge of basic biochemical and cellular processes. Accurate mapping of the revealed open problems within a framework of specific subsystems and groups of organisms sets the stage for generating hypotheses amenable to experimental testing. In a growing number of cases, predictions of novel genes and pathways revealed by comparative genomics techniques have been successfully verified. Based on this vision, the scope of the HOPS Database is to build and maintain a public repository of:
It is important to emphasize that we aim to restrict the breadth of open problems to those that are:
Likewise, we aim to accumulate predictions that provide a precisely defined and testable functional role, transformation or interaction. For this reason, "general class" functional predictions (e.g., putative kinase), are not the focus of our effort. By launching this site, we commit to populate it by problems and conjectures emerging from our effort to encode subsystems in the SEED environment, capturing many aspects of the Central Machinery of Life. Our goal is to share this information with the broad scientific community in order to encourage further computational and experimental analysis. Most importantly, we solicit community contributions to all three aspects of HOPS Database, which is meant to become a joint effort of bioinformaticians and experimentalists. It is important to emphasize that an experimental verification of a single gene carefully propagated via subsystems-based annotations, will often impact a significant number of genes in a variety of species.