Using the MG-RAST Metagenome Annotation Server

MG-RAST is designed to rapidly call and annotate the genes from a large set of short DNA sequence reads (icon PMID: 18803844). MG-RAST is based on a modified version of the RAST (Rapid Annotations based on Subsystem Technology) server, which was designed specifically for complete genome sequences. MG-RAST produces automated functional assignments of sequences in the metagenome by comparison with both protein and nucleotide databases. Phylogenetic and functional summaries of the metagenomes are generated, and tools for comparative metagenomics are available. MG-RAST returns an analysis of the genes and subsystems in your data set, as supported by comparative and other forms of evidence.

Users may browse the analysis of public data sets without registering or logging in. If you want to use the server to process your metagenome data set, please register for a free account in order for access to your data to be kept under your control. You may elect to share your private data with others of your choosing, or you may elect to make your data public. Private data will be kept on the server for a minimum of 120 days after completion of processing. Time required for complete processing is largely dependent on the number of jobs in the queue and the size of the data set.

The tour of the MG-RAST server will follow the workflow listed below. For short answers to specific questions, see the MG-RAST FAQ.

Upload and manage your job

Sequence format and upload steps

Log in and select "Upload New Job" from the arrow icon on the home page. uploadarrow.png

  • Step 1: Browse for the file of nucleotide sequences, which must be a single, plain-text file in FASTA format only. Sequences in the correct FASTA format must NOT be in a Microsoft Word document--save as a plain text (*.txt) file, text encoding Windows (default); do NOT insert line breaks or allow character sustitution.
    • All reads for one metagenome data set should be together in one file.
    • If your data file is larger than 30 MB, please use zip, gzip, or tar to compress it. The file name should end in .tgz if compressed, or an uncompressed fasta file should end in .fa, .fasta, .fas, .fsa or .fna.
    • If your project has resulted in more than one distinct metagenome data set, include all sequence files in one compressed archive file.
    • Optionally, you can include a quality file along with the sequence file in a single compressed file. To do this, compress both files into a single archive and then upload the archive file. In this case the sequence file name should end in .fa, .fasta, .fas, .fsa or .fa; the quality file name should end in .qual; and the archive name should end in .tgz. Files encoded as html, pdf, rtf, doc, docx, embl, gff3, or gtf will be rejected.
    • Click the button to "Upload and go to step 2."

Upload.png

  • Step 2:
    • Enter a project name, and for each sequence file uploaded, enter a name for the metagenome set and a short description of it. Descriptive terms may be selected from standard ontologies.
    • A separate pair of boxes will be provided for naming and describing each sequence file detected during the upload.
    • Click the "Upload Summary" tab to see what the server detected during upload.
    • Click the button to "Use this data and go on to step 3."

uploadstep.png

  • Step 3: Provide information about the sequence data and select options.
    • This step will soon be converted to a form for submitting a description that complies with the minimal information about a genome sequence (MIGS) specification. At present, MIGS information is required when (if) you choose to make your data public.
    • Options include removal of exactly duplicate sequences and the choice to make your data public. Data is kept private by default.
    • Look at the information in the "Upload summary" tab to confirm that the system detected the sequence data you intended to upload.
    • Click the button to "Finish Upload."

Manage your job

From the upload summary, select the "status page" link to track the progress of the job. If you have logged out, click the puzzle pieces icon puzzle.png to "Manage uploaded data" after logging back in. The time required to process you job depends on the number and size of other jobs in the queue as well as on the size of your job.

  • Track progress: All of your jobs are displayed in a table with active headers.
    • An overview of the progress is shown in the table as a series of colored boxes. Select the link to view details of one job.

jobs.png

  • Access completed job: There are two points to access your private data.
    • From the MG-RAST home page, your completed jobs will be displayed in a list behind the public data in the green tab labeled "Private Metagenomes." This tab is not displayed if you are not logged in.
    • From the job details page (accessed via the jobs table when you choose to manage your uploaded data), you may view or download the annotated data.
    • How to navigate the genome viewer will be discussed below. This link is available only upon completion of processing.
    • Download format is GenBank.

details.png

  • Share your annotated genome with one or more other users.
    • You can share this job with others by clicking the link and adding the email addresses (one at a time) of registered users to whom you would like to grant access to your otherwise private data.
    • If you would like to share with many people, e.g. a class, request a new group by emailing mg-rast @ mcs.anl.gov. Group memberships may be viewed from the account management page, which is accessed by clicking on the pair of people at the far right of the green menu bar folks.png. This is also where you can change your password, if needed.

share.png

  • Delete job: First you must click on the "view details" link in the jobs table. Then, the green menu bar in the header of the page will provide an option labeled with your job number. The only action to choose from this menu is "Delete this job." An intermediate screen will appear to confirm whether you are sure you want to delete. Click the button to do so.
delete.png

View your annotated metagenome results

Metagenome Overview page

  • One way to access your private data is to start from the jobs table that is accessed by clicking the puzzle icon. Click on "view details" and then "Browse annotated metagenome in SEED Viewer" (illustrated above). Alternatively, a list of your completed jobs will be available in the green tab "Private Metagenomes" upon logging in. Click that tab to bring it to the front and select one from the list of your completed jobs. For the purpose of this tutorial, please select the Obese Mouse metagenome from the public data.
access.png

  • The Metagenome overview page opens with a table that shows your input description and lists how many sequences along with maximum, minimum, and average sequence lengths.
  • There is a paragraph of automatically generated text that describes your data set and provides some important statistics such as the number and percentage of reads that could be matched to protein or RNA sequences in various databases.
  • A broad phylogenetic overview table lists how many sequences were classified as Archaea, Bacteria, Eukaryota, or Other on the basis of protein and rRNA sequence analysis.
  • The lengths and %G+C content of the sequences in your metagenome are displayed in histograms.

Sequence Profile - metabolic (what are they doing?)

  • A metabolic reconstruction of the sequence data is available from the "Sequence Profile" item in the "Metagenome" menu, and also from the "Metabolic Analysis" tab of the hint box on the overview page. Both phylogenetic (see below) and metabolic profiles are available, with the metabolic profile shown first by default.
ProfileMenu.png

  • The metabolic profile sorts reads into subsystems. Subsystems are sorted into a 3-level hierarchy. Results are shown in expandable pie charts as well as a table with active headers. Click on any category link below the first pie chart to expand it to a second chart displaying the proportion and number of sequences that match proteins in curated subsystems. The same category labels are shown as menus in the header of the table for use as a filter.
ObMetabolicProfile.png

  • Details about individual sequence reads are accessed by clicking any link in the table. Details include the alignment length, the FIG ID of the hit in the database, and the region of alignment in both the hit and query sequence. The table may be sorted by alignment length, query ID and hit ID. Click on the query ID to open a page showing its nucleotide sequence and alignments. Click on the hit ID to open that protein's annotation page in NMPDR. Using the check boxes you may select reads that match the same hit or the same functional role and align them by clicking a button below the table.
reads.png

  • Parameters of the analysis are changeable; users can change e-value, p-value, percent identity, and minimum alignment length. This will allow you to refine the analysis to suit the sequence characteristics of your sample. The default settings are the most permissive. For your changes to take effect, you must click the button to re-compute the results.

Sequence Profile - phylogenetic (who's there?)

  • A phylogenetic profile of the sequence data is available from the "Sequence Profile" item in the "Metagenome" menu, and also from the "Phylogenetic Analysis" tab of the hint box on the overview page. The metabolic profile is shown first by default, so you will have to select the phylogenetic profile with a radio button. Parameters of the analysis such as the comparison database, e-value, p-value, percent identity, and minimum alignment length should be set by the user. This will allow you to refine the analysis to suit the sequence characteristics of your sample. The default settings are the most permissive. We recommend you use a minimum alignment length of 50 bp with all RNA databases. For your changes to take effect, you must click the button to re-compute the results.
phylogeneticprofile.png

  • The phylogenetic profile sorts reads into taxonomic groups. Results are shown in expandable pie charts as well as a table with active headers. Click on any category link below the first (Domain) pie chart to expand it to a second chart displaying the proportion and number of sequences classified into Phyla based on aligments with sequences in the selected reference database (Phyla expand to Orders, and so on through Genus). Taxonomic category labels are shown as menus in the header of the table for use as a filter.
  • A recruitment table is displayed by clicking any link in the tabular view of the sequence profile. The recruitment table shows the alignment length and taxonomic assignment of the hit in the reference database. The table may be sorted by alignment length, query ID and hit ID. Click on the query ID to open a page showing its sequence and alignments. Using the check boxes you may select reads that match the same taxonomic assignment and align them by clicking a button below the table.
  • It is important for the success of your phylogenetic analysis that you thoughtfully re-set parameters of the analysis. One of the major conclusions of the published analysis of the obese mouse metagenome is that it is enriched in Firmicutes relative to Bacteroidetes, while the opposite is true for the metagenome of the lean mouse. The phylogenetic profile of the obese mouse metagenome with default settings results in 57% Bacteroidetes and 14% Firmicutes. Increasing the stringency of analysis by decreasing the maximum e-value by only one order of magnitude inverts those proportions to the expected 13% Bacteroidetes and 58% Firmicutes. At the same time, it decreases the number of classified sequences from 387 to 52. The majority of sequences excluded by increasing the stringency have an alignment length of only 20 bp.
EvalPhylProfile.png

-- Leslie Mc Neil - 19 Dec 2008

Topic revision: r14 - 23 Jun 2009 - 18:07:18 - Leslie Mc Neil
 
NMPDR is a collaboration among researchers from the Computation Institute of the University of Chicago, the Fellowship for Interpretation of Genomes (FIG), Argonne National Laboratory, and the National Center for Supercomputing Applications (NCSA) at the University of Illinois. NMPDR is funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract HHSN266200400042C. Banner images are copyright © Dennis Kunkel.