Frequently Asked Questions

  1. How do I use NMPDR to find a degenerate peptide motif in selected organisms?

  2. Where do I start?

  3. Can I blast multiple sequences at once against more than one genome?

  4. How do I save or download data?

  5. What is a feature?

  6. What data are in NMPDR?

  7. What is a subsystem?

  8. How do I find what I'm looking for?

  9. What genomes are supported in the NMPDR?

  10. What can I do in the NMPDR environment?

  11. What is meant by "functional coupling"?

  12. What similarities are shown?

  13. What about links to other tools?

  14. How do I use the signature genes tool to search for genes that discriminate between two sets of organisms?

  15. What practical use can a bench scientist make of comparative genomics?

 

  1. How do I use NMPDR to find a degenerate peptide motif in selected organisms?
    1. Select the BLAST or Scan search option.
    2. Select the protScan tool from the drop down list.
    3. Type the motif of interest in the sequence box.
      • For example, use a collagen-binding motif implicated in acute rheumatic fever (ARF) following streptococcal infection, AXYLXXLN (icon J Biol Chem 282:18686).
    4. Select the genomes to search in the genome list.
      • For example, select all strains of Streptococcus pyogenes from the genomes list quickly by typing "pyogenes" in the text box and clicking "Select genomes containing."
    5. Now click the Scan button.
    6. Matching sequences are presented in a table with links to respective NMPDR protein pages and a new Context Viewer.
      • ProtScan finds this matching sequence, AEYLKGLN, in the M protein of two M3 strains.

    ^top

  2. Where do I start?
  3. Start with a keyword search for the name of a gene or protein.

    Start with the nucleotide sequence of your gene or amino acid sequence of your protein and blast against any complete genome. Use one or more query sequences to search one or more genomes.

    Start from the annotation status table of a focus organism page, for example, the Staphylococcus page, which provides quick access to proteins about which much is known (named genes in subsystems), little is known (named genes not in subsystems and hypothetical genes in subsystems), or nothing is known (hypothetical genes not in subsystems).

    Start from the subsystems tree to view the phylogenetic distribution of an interesting biological process.

    Start from the essential genes page to view essential genes in model organisms and to project essentiality to closely or distantly related organisms.

    Start from a virtual structural proteome to investigate proteins about which structural information is available in PDB.

    ^top

  4. Can I blast multiple sequences at once against more than one genome?
  5. Yes, to BLAST more than one sequence at a time, click the blast link from the home page, and paste all your fastA sequences into the box. It makes no difference whether there is an empty line between the different sequences, just as long as each sequence begins on a new line with a fastA header.

    Then, in the genome selection part of the search form, you may select one or more genomes. Make sure you set the tool to blastp if your query sequences are proteins, or blastx, if the query sequences are DNA. If you are blasting more than a handful of sequences, you may want to increase the number of search results per page from the default of 50. Now click the blast button.

    Results are returned in order of blast score, which is dependent on protein length, so the result of a lot of sequences blasted against a lot of genomes may be a bit messy. The more work there is to do, the longer the search will take.

    To save the table of results, use one of the download options at the top of the results page. You may download the text of the results table as a tab-delimited file, which you can then open in Excel and resort by organism or functional name. You may also download the nucleotide or amino acid sequences of the matching genes with one-click buttons at the top of the results page.

    ^top

  6. How do I save or download data?
    • To save the table of search results as a tab-delimited text file that may be opened as a spreadsheet, simply click on the download button. This will save all results, not just the first 50 displayed. You may also download amino acid or nucleotide sequences of the search results with one-click buttons.
    • To save individual protein or gene sequences from a protein page, just copy the shown fastA-formatted text and paste into a local file.
    • To save the protein context graphic, just select or point to it and save it as an image. This is true also for the compare regions and pins displays.
    • To download whole annotated genomes for each of the focus organisms in modified GFF3 format, see the NMPDRDownloads page. The formatted GFF3 files contain rows of records, each with nine tab-delimited fields: seqid, source, type, start, end, score, strand, phase, and attributes. The "score" and "phase" fields are not in use, so in each row, those fields contain the "." character. Each row describes a feature, which is a region on the DNA located between start and end nucleotide coordinates. To describe a protein-encoding gene, two rows are used to record two features at the same location: gene and CDS. FASTA formatted gene and protein sequences follow the tab-delimited table of feature annotations.

    ^top

  7. What is a feature?
  8. A feature is anything that can be mapped onto a strand of DNA, and is defined by its start and stop location. A gene is a feature. A protein coding sequence (CDS or PEG) is a feature that, in bacteria, shares the same location on the DNA as its gene, but is represented as an amino acid sequence rather than a nucleotide sequence. Eukaryotic genes also have intron and exon features defined as subsets of the gene feature. Short regulatory elements or functional motifs may also be defined as features. Pathogenicity islands are features that include many genes. Operons are not presently annotated as features in NMPDR.

    ^top

  9. What data are in NMPDR?
  10. Complete genomes are the primary data. As such, most chromosomes are one contiguous length of DNA sequence data, or one "contig." Some genomes that are fragmented into several "contigs" are considered to be essentially complete. Genome data include DNA sequences, protein sequences, and the associated annotations. Annotations include both the accurate determination of gene boundaries and the assignment of a functional name to the encoded proteins. NMPDR curators use bioinformatics tools to correct errors in the start or stop codons of genes, and to change incorrect or ambiguous names in the annotations of protein encoding genes, which are called "pegs" in the NMPDR. A peg is equivalent to a CDS.

    Populated subsystems are a data type unique to the NMPDR and its underlying annotation environment, the SEED. Subsystems are sets of functional roles grouped according to any biologically useful organizing principle. A subsystem may describe a metabolic pathway, but subsystems are not limited to pathways. For example, there are subsystems that include the ribosomal proteins, or cell division proteins, or pathogen-specific virulence factors. A subsystem may be comprised of a very few or very many proteins that are related in some functional or structural way. Each protein included in a subsystem plays a "functional role" which may be enzymatic, signaling, regulatory, structural, or other. A subsystem may exist in all genomes or be present in only a few closely related genomes. A populated subsystem is a two-dimensional integration of biological functions with genome sequences. It is presented as a spreadsheet with columns of functional roles, rows of genomes, and cells populated by the genes responsible for each function.

    Functional clusters are another data type unique to NMPDR and the SEED. The functions of two proximal genes are more likely to be related when they are similarly clustered together in a large number of organisms distributed over a wide phylogenetic space, which is represented as a high functional clustering score. The score is approximately equal to the number of different species (not strains) in which the two genes are co-localized. Functional clustering provides insight to the specific roles played by proteins that may initially be assigned an ambiguous functional role, like "transporter." Functional clusters are presented graphically and by scores for every peg in the database. Additionally, if the peg of interest to you does not share conserved proximity with others, it is possible to discover whether orthologs of your protein are clustered in other genomes. The CL button will display a table of orthologs that might be clustered with other proteins. Clicking on the functional clustering score will also provide a list of orthologs paired in other genomes.

    BLAST hits are pre-computed in a reciprocal analysis for all proteins in the database. The results are presented in a table of bidirectional best hits, or BBH. A comprehensive list of orthologs is thus provided for every protein in every genome. You may select orthologs from this list and generate a ClustalW? alignment with one click.

    ^top

  11. What is a subsystem?
  12. A subsystem has two components. First is a list of functional roles that are united by any common process or biologically meaningful organizing principle. Second is a spreadsheet, called a populated subsystem, which is a two-dimensional integration of biological functions with genome sequences. In the populated subsystem, functional roles are represented in columns, genomes are represented in rows, and cells of the spreadsheet are populated by the genes responsible for each function. Genes that are clustered on the chromosome share the same background color in the spreadsheet. Gene identification numbers are linked to NMPDR protein context pages. If multiple genes play the same functional role, the variants are named in the table of functional roles. The row number from that table is then appended to the gene number in the spreadsheet to identify which variant is used. Have a look at the Adhesins in Staphylococcus as an example.

    Subsystems may be accessed from the context page of a member protein by clicking the link in the biological context section of the page. There is also a list of subsystems organized in a tree view on this page. After clicking on show subsystems, a metabolic reconstruction, or comprehensive list of subsystems and proteins that perform functional roles, is returned for the chosen genome. Subsystem headers link to populated subsystem (spreadsheet) displays, and proteins link to their respective context pages.

    An investigator can learn much by establishing a subsystem of functions in genomes that are known to contain all the required genes, then using the computer to extend the subsystem to genomes about which less is known. NMPDR is used to browse subsystems established by our curators. The SEED may be used by investigators to create their own subsystems.

    ^top

  13. How do I find what I'm looking for?
  14. All genomes may be searched from the home page. To limit your search to one of the NMPDR core organism groups, you may start your search from one of the organism summary pages:

    On each page you will find a search box where you can search for specific genes or proteins by text. To search by gene or protein sequence, choose the "BLAST or Scan" option. To compose an advanced query with selected genomes or subsystems, choose the "Genes" option. Search results are returned in a table that links to two viewing environments:

    • By clicking on the NMPDR option, you will get to a page that focuses on the specific protein. You will see a table that lists other proteins in the neighborhood of the target gene, with the target highlighted in green. A graphic of the genetic neighborhood is also presented, with the target gene in green, functionally coupled neighbors of the target in blue, and unrelated neighboring genes in red. Additionally, access to annotations, sequence, subsystems, and comparisons with genes in other genomes are offered.

    To understand the full functionality available in the two environments, you will need to take some time to experiment. Feel free to contact us with questions that you have. We are addding help text and smoothing out the interfaces as quickly as possible and largely in response to specific requests and suggestions; so please do take the time to formulate them.

    ^top

  15. What genomes are in the NMPDR?
  16. The NMPDR contains two classes of genomes -- those pathogens we are being funded to annotate, which we call "core genomes," and "supporting genomes," which include all publicly available genomes for comparative analysis. The table below lists core pathogens, with strain designation and serotype when known, and closely related supporting genomes.

    Campylobacter Listeria Staphylococcus Streptococcus Vibrio
    coli RM2228 fischeri ES114
    fetus subsp. fetus 82-40 splendidus 12B01
    lari RM2100 sp. MED222
    upsaliensis RM3195 sp. Ex25 O62
      epidermidis RP62A?  
      epidermidis ATCC12228  
      haemolyticus JCSC1435  
      innocua 6a Clip11262 saprophyticus ATCC 15305 agalactiae serotype V 2603V/R  
      welshimeri 6b SLCC5334   agalactiae A909  
          agalactiae NEM316  

     

    ^top

  17. What can I do in the NMPDR environment?
  18. In the NMPDR environment you can visualize a gene, the protein it encodes, the context of the gene on its contig, and a wealth of specific information relating to that gene. The NMPDR protein page presents a table of information related to the target gene and those genes found up- and down-stream on the chromosome. The feature identification number for NMPDR, fid, is listed first. Next are listed the start and stop nucleotide coordinates, length, size of gap (or overlap) between genes, and the orientation on the + or - strand. The functional name assigned to each gene in the NMPDR annotation is listed under "function," while names or numbers assigned to the gene by other sources are listed as aliases with links to external resources such as UniProt? , GenBank, and KEGG.  A graphic representation of gene context shows the target gene in green, functionally related genes in blue, and unrelated neighbors in red.

    Two powerful tools for comparative analysis of functional clustering are linked as buttons on the NMPDR protein page. The "CL" button will open a table of homologs to the target gene that appear in functional clusters in the genomes of the other organisms. The table is ordered by the size of the cluster, that is, by how many genes are functionally linked. The "Pins" button will present a graphic of gene clusters, similar to the context graphic, but this will include homologous regions from many organisms, ordered phylogenetically. The target gene is numbered 1, and clustered homologs share a numerical label.

    At the bottom of the NMPDR protein page, links are provided to other sites and to tools that help to analyze the encoded protein. If you have a tool that you would like linked from our site, or a tool that you would like to link into our site, please contact us.

    ^top

  19. What is meant by "functional coupling"?
  20. The term functional coupling has been used to indicate that two genes appear to have related functional roles (e.g., the encoded proteins both participate in the same metabolic pathway or they both are components in a single complex). One exciting challenge to bioinformatics is to predict functional coupling. Perhaps the most effective technique for doing so relates to analysis of proximity on the chromosome; when two genes tend to occur fairly close to one another in numerous genomes, it amounts to solid evidence that the roles of the gene products are closely related. For details see "The use of gene clusters to infer functional coupling," Proc Natl Acad Sci U S A. 1999 Mar 16; 96(6):2896-2901.

    Because comparative analysis of gene clusters has begun to play a much larger role in determination of gene function (due to the rapid increase in the number of available genomes), we have computed instances in which genes appear to be functionally coupled and make the inferences accessible from the NMPDR environment. When you are on the NMPDR protein page, which shows a table of the genes that occur in the region around the gene you are focused on, you will see a column labeled "fc-sc." If a number occurs in this column, there is evidence based on clustering that the genes with numbers are functionally coupled to the gene of focus in that number of genomes. The number is actually a link to a table displaying co-occurrences of the two genes.

    ^top

  21. What similarities are shown?
  22. NMPDR contains bi-directional best hits (BBH) precomputed using BLASTP. Two sequences S1 and S2 from genomes G1 and G2 are bi-directional best hits if:

    • S1 and S2 are from different genomes (G1 ≠ G2),
    • S1 is the most similar to S2 of all the sequences in the genome G1
    • S2 is the most similar to S1 of all the sequences in the genome G2

    The clear disadvantage of BBH is that duplicates or paralogs within the same genome will not be listed in the BBH table. The table will display the annotations, E-values (probabilities that the BLAST hit is random), and links to similar proteins in other organisms.

    ^top

  23. What about links to other tools?
  24. When we load data into the NMPDR we collect links for each feature. Some of these links take you to corresponding entries in databases maintained by numerous groups worldwide. Other links take you to tools that will aid in your analysis of the proteins. In either case, you need to ensure that the database and tool is appropriate for the question that you are trying to answer. Not all tools are appropriate for all questions. If you know of a resource that we should be linking features to, please feel free to point this out to us.

    ^top

  25. How does the signature genes tool work?
  26. Searching for genes that discriminate between two sets of organisms

    The motivation for the Signature Genes Tool is to try to locate genes related to a phenotype that is associated with one set of organisms (call this set1) but not with another (call these set2).

    The search goes through the genes in one organism from Set 1, selected as the reference genome. For each gene in the reference genome, the tool evaluates the bidirectional best hits of the genes that occur in genomes from set1 and set2. It tabulates these and constructs a score from 0 to 1. A score of 1 means that the gene has a bidirectional best hit in every genome from set1 and no bidirectional best hits against any genome in set2.

    The scores are tabulated. The best candidate genes are then presented to you as a list of genes to explore. The main shortcoming of the tool relates to our use of bidirectional best hits. If there are paralogs to the gene within genomes, a bidirectional best hit may not exist in a genome that contains several clear homologs. This means that we may miss genes with paralogs, and we may include genes that do not discriminate as well as we seem to indicate. This means that you must explore each gene as a candidate, but nothing more. There is now the option of running the tool using precomputed similarities rather than bidirectional best hits.

    ^top

  27. What practical use can a bench scientist make of comparative genomics?
  28. HOPS: public depository of Hypotheses and Open Problems identified by Subsystem analysis

    Comparative analysis of genomes reveals multiple gaps in our knowledge of basic biochemical and cellular processes. Accurate mapping of the revealed open problems within a framework of specific subsystems and groups of organisms sets the stage for generating hypotheses amenable to experimental testing. In a growing number of cases, predictions of novel genes and pathways revealed by comparative genomics techniques have been successfully verified. Based on this vision, the scope of the HOPS Database is to build and maintain a public repository of:

    1. Well-defined open problems (knowledge gaps) revealed by comparative genome analysis of various subsystems. The major types of such problems, "missing genes" or "functionally coupled hypotheticals," are listed at Help on How to Pick problem Types.

    2. Hypotheses, testable predictions pertaining to these problems.

    3. Records of experimental follow-up and comments on any of the suggested hypotheses, in a range from "intend to test" to "proven right/wrong."

    It is important to emphasize that we aim to restrict the breadth of open problems to those that are:

    • In specific functional context. For example, in general, we avoid recording questions like: "what is the function of this hypothetical protein?" On the other hand, "missing gene" questions of a form: "what gene encodes this enzyme in otherwise complete pathway?" are highly valued.

    • Within the realm of comparative genome analysis. We realize that many interesting problems of biology do not fall in this category.

    • Tractable. This requirement may filter out many problems (e.g., related to complex regulatory systems, etc.) that may not be addressed by conjectures amenable to straightforward experimental verification.

    Likewise, we aim to accumulate predictions that provide a precisely defined and testable functional role, transformation or interaction. For this reason, "general class" functional predictions (e.g., putative kinase), are not the focus of our effort. By launching this site, we commit to populate it by problems and conjectures emerging from our effort to encode subsystems in the SEED environment, capturing many aspects of the Central Machinery of Life. Our goal is to share this information with the broad scientific community in order to encourage further computational and experimental analysis. Most importantly, we solicit community contributions to all three aspects of HOPS Database, which is meant to become a joint effort of bioinformaticians and experimentalists. It is important to emphasize that an experimental verification of a single gene carefully propagated via subsystems-based annotations, will often impact a significant number of genes in a variety of species.

Topic revision: r8 - 24 Aug 2008 - 11:19:50 - BruceParrello
 
NMPDR is a collaboration among researchers from the Computation Institute of the University of Chicago, the Fellowship for Interpretation of Genomes (FIG), Argonne National Laboratory, and the National Center for Supercomputing Applications (NCSA) at the University of Illinois. NMPDR is funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract HHSN266200400042C. Banner images are copyright © Dennis Kunkel.