Frequently Asked Questions
NMPDR questions about genome analysis
How do I start?
- Use the protein name or gene name in a keyword search. A search for a protein name will return all instances of the name found in all organisms in the database. A search for a gene name (e.g. lacZ) may return fewer results because we do not curate gene names. Add the name of an organism to your keywords to focus the search. Explore the genomic and biological context of your protein using the viewer and subsystem links in the search results. Detailed instructions are found in Tutorials under the Help menu.
- Use the sequence of your gene or protein to blast against any complete genome. Use one or more query sequences to search one or more genomes.
- Use the subsystems tree to view the phylogenetic distribution of a metabolic pathway or biological process.
- Use the essential genes page to view sets of essential genes in model organisms and to project essentiality to closely or distantly related organisms.
What data are in NMPDR?
are the primary data. A genome is the complete complement of DNA contained in a single organism. One genome may consist of more than one replicating molecule (replicon) such as chromosomes and plasmids. In a finished genome sequence, replicons are one contiguous length of DNA sequence data, or one "contig." Genomes that are fragmented into several "contigs" are considered to be essentially complete when it is a statistical likelihood that greater than 99.9% of the total nucleotides are represented in the data, and at least 70% of the nucleotides are in contigs at least 20 kbp in length. Secondary data include translated protein sequences, and the associated annotations. Annotations include both the accurate determination of gene boundaries and the assignment of a functional name to the encoded proteins. NMPDR curators use bioinformatics tools to correct errors in the start or stop codons of genes, and to change incorrect or ambiguous names in the annotations of protein encoding genes
, which are called "pegs" in the NMPDR. A peg is the most common type of genomic feature
are a data type unique to the NMPDR and its underlying annotation environment, the SEED
. Subsystems are sets of functional roles grouped according to any biologically useful organizing principle. A subsystem may describe a metabolic pathway, but subsystems are not limited to pathways. For example, there are subsystems that include the ribosomal proteins, or cell division proteins, or pathogen-specific virulence factors. A subsystem may be comprised of a very few or very many proteins that are related in some functional or structural way. Each protein included in a subsystem plays a "functional role" which may be enzymatic, signaling, regulatory, structural, or other. A subsystem may exist in all genomes or be present in only a few closely related genomes. A populated subsystem is a two-dimensional integration of biological functions with genome sequences. It is presented as a spreadsheet with columns of functional roles, rows of genomes, and cells populated by the genes responsible for each function.
Functional couples and clusters
are another data type unique to NMPDR and the SEED
. The functions of two proximal genes are more likely to be related when they are located close together in a large number of organisms distributed over a wide phylogenetic space, which is represented as a high functional coupling score. The score is approximately equal to the number of different species (not strains) in which the two genes are co-localized. Functional coupling provides insight to the specific roles played by proteins that may initially be assigned an ambiguous functional role, like "transporter." One focus protein may be functionally coupled with more than one other protein. These functional clusters are presented graphically as gray background shading in the compare regions graphic on the annotation overview page. In the tabular view of the compare regions graphic, proteins functionally coupled with the focus protein will have a score in the column labeled "FC". Clicking on the functional clustering score will provide a list of genomes in which the focus and scored proteins are co-localized. Additionally, if the focus protein of interest to you does not share conserved proximity with others, it is possible to discover whether homologs of your protein are clustered in other genomes. The "Find Clusters" function will display a table of orthologs that are clustered in other organisms. For details about the computation, see "The use of gene clusters to infer functional coupling," Proc Natl Acad Sci U S A. 1999 Mar 16; 96(6):2896-2901
Orthologs and paralogs
are pre-computed by BLASTP for all proteins in the database. Results of unidirectional BLASTP analyses are provided as a table of similarities on the feature evidence page for every protein in the database. The results of reciprocal BLASTP analyses, called bidirectional best hits or BBH, provide a comprehensive list of orthologs for every protein in every genome.
What is a feature?
is anything that can be mapped onto a strand of DNA, and is defined by its start and stop location. A gene is a feature. A protein coding sequence (CDS or PEG) is a feature that, in bacteria, shares the same location on the DNA as its gene, and may be represented as an amino acid sequence translated from the nucleotide sequence. Eukaryotic genes also have intron and exon features defined as subsets of the gene feature. Short regulatory elements or functional motifs may also be defined as features. Pathogenicity islands are features that include many genes.
What is a subsystem?
has two components. First is a list of functional roles
that are united by any common process or biologically meaningful organizing principle. Second is a spreadsheet, called a populated subsystem, which is a two-dimensional integration of biological functions with genome sequences. In the populated subsystem, functional roles are represented in columns, genomes are represented in rows, and cells of the spreadsheet are populated by the genes responsible for each function. Genes that are clustered on the chromosome share the same background color in the spreadsheet. Gene identification numbers are linked to NMPDR annotation overview pages. If multiple genes play the same functional role, the variants are named in the table of functional roles. The row number from that table is then appended to the gene number in the spreadsheet to identify which variant is used. Have a look at the Adhesins in Staphylococcus
as an example.
Subsystems may be accessed from the annotation overview page of a member protein by clicking its link. Access to subsystems is also available from the table of features in subsystems shown on every organism overview page. You can access subsystems directly by starting from the subsystems tree
An investigator can learn much by establishing a subsystem of functions in genomes that are known to contain all the required genes, then using the computer to extend the subsystem to genomes about which less is known. NMPDR is used to browse subsystems established by our curators. The SEED
may be used by investigators to create their own subsystems.
How do I save or download data?
- Save the table of search results as a tab-delimited text file that may be opened as a spreadsheet by clicking on the download button at the top of the page. This will save all results, not just the first 50 displayed. You may also download all sequences of the search results as amino acid or nucleotide FASTA files with one-click buttons.
- Save an individual sequence from an annotation overview page by clicking the "sequence" link, and then copy the shown FASTA-formatted text and paste into a local file. Use the radio buttons to select amino acids or DNA with the desired number of flanking nucleotides.
- Save the Compare Regions graphic by pointing to white space within the graphic, then right-clicking to save it as an image.
- Download whole annotated genomes for each of the focus organisms in modified GFF3 format from the NMPDR Downloads page. The formatted GFF3 files contain rows of records, each with nine tab-delimited fields: seqid, source, type, start, end, score, strand, phase, and attributes. The "score" and "phase" fields are not in use, so in each row, those fields contain the "." character. Each row describes a feature, which is a region on the DNA located between start and end nucleotide coordinates. To describe a protein-encoding gene, two rows are used to record two features at the same location: gene and CDS. FASTA formatted gene and protein sequences follow the tab-delimited table of feature annotations.
How do I BLAST multiple sequences at once against one or more genomes?
- Select the BLAST or Scan search option.
- Select the appropriate BLAST tool from the drop down list.
- Select blastp if your query sequences are proteins. Select blastx if the query sequences are DNA and you are looking for orthologous pegs (CDS). Select blastn if you are looking for a very close nucleotide match.
- Paste all your FASTA format sequences into the box. It makes no difference whether there is an empty line between the different sequences, just as long as each sequence begins on a new line with a FASTA header.
- Select one or more genomes from the scrolling menu by using control-click. You can quickly select all strains of one species by typing either the species or genus name into the text box and clicking the button "Select genomes containing."
- Click the BLAST button
- Results are returned in order of blast score, which is dependent on protein length, so the result of a lot of sequences blasted against a lot of genomes may be a bit messy. The more work there is to do, the longer the search will take.
- To save the table of results, use one of the download options at the top of the results page. You may download the text of the results table as a tab-delimited file, which you can then open in Excel and resort by organism or functional name. You may also download the nucleotide or amino acid sequences of the matching genes with one-click buttons at the top of the results page.
How do I find a degenerate peptide motif in selected organisms?
- Select the BLAST or Scan search option.
- Select the protScan tool from the drop down list.
- Type the motif of interest in the sequence box.
- For example, use a collagen-binding motif implicated in acute rheumatic fever (ARF) following streptococcal infection, AXYLXXLN ( J Biol Chem 282:18686).
- Select the genomes to search in the genome list.
- For example, select all strains of Streptococcus pyogenes from the genomes list quickly by typing "pyogenes" in the text box and clicking the button, "Select genomes containing."
- Now click the Scan button.
- Matching sequences are presented in a table with links to respective annotation overview pages and a new Context viewer.
- ProtScan finds this matching sequence, AEYLKGLN, in the M protein of two M3 strains. Does the motif appear in the ARF-associated M18 strain MGAS8232? (answer: PMID 12813026) The tool also finds AAYLDDLN in the SatD protein of strains with emm-types M1, M2, M4, M12, and M28. What is the role of SatD in strep? (answer: PMID 11274114)
RAST and MG-RAST questions about genome annotation servers
Which server should I use?
- RAST is designed to annotate complete, or essentially complete, prokaryotic genomes--not viruses, not independent plasmids, not eukaryotes (even small ones). By essentially complete we mean about 99% of the total nucleotides are represented in a set of assembled sequence data in which 70% of the total nucleotides are in contigs longer than 20 kbp. This is typically equivalent to 5X coverage by the Sanger method and 10X coverage by 454.
- MG-RAST is designed to annotate a large set of short nucleotide sequences--not a complete genome and not amino acid sequences.
Which sequences should I upload?
- RAST will accept sequence data in FASTA format (.fna), and GenBank (.gbk) format, uploaded as plain text files with no special characters. FASTA sequences must be DNA, and all contigs for one genome should be together in one file. GenBank files must include DNA sequence and may also include protein sequences.
- MG-RAST will accept 454 ouput files directly if compressed together in one archive using tar or gzip. All reads for one metagenome should be in one file. Several metagenomes that constitute one project (e.g. a time series) may be uploaded at once by compressing all metagenome files together in one archive using tar or gzip. All sequence reads must be DNA.
When will I get my results?
- RAST results will typically take no longer than a day or two. You will recieve an automated email upon completion or if there are any problems that require your intervention.
- MG-RAST results will typically take a week or two. Progress is dependent on the size of your data set and the number and size of jobs in the queue. You will recieve an automated email upon completion or if there are any problems that require your intervention.
What can I do with my results?
Please see the tutorials on how to view and analyze annotation results in RAST
Where can I find more answers?
More specific questions and answers are listed in the RAST FAQ
and the MG-RAST FAQ