Frequently Asked Questions About RAST

What is RAST?

What does RAST mean?

  • RAST stands for "Rapid Annotation using Subsystem Technology."

What is RAST for?

  • RAST is designed to rapidly call and annotate the genes of a complete or essentially complete prokaryotic genome using a "Highest Confidence First" assignment propagation strategy, and return an analysis of the genes and subsystems that comparative and other evidences support being present in the genome.

I/O (input/output)

What input formats does RAST accept?

  • RAST accepts sequence data in multi-sequence FASTA format (.fna), and multi-accession GenBank format (.gbk), uploaded as plain text files containing no special characters. One file should contain the sequences of all contigs and/or replicons. RAST does not yet support other upload formats, such as EMBL, GFF3, GTF, etc. (although it can generate output in these formats). RAST does not currently accept any other input formats, and will also reject any file format that is not plain text, e.g. it will not accept genomes encoded as HTML, PDF, RTF, Microsoft Word, etc.

Will RAST assemble my reads into contigs for me?

  • No. You will need to assemble your reads into contigs yourself, using some other tool.

What if I just want my genome re-annotated, not re-called?

  • If you upload your data in GenBank format and select the "Keep existing gene calls" option, RAST will re-annotate the functions of the existing calls and perform a subsystem analysis, without recalling the genes of your genome. RAST cannot keep existing genecalls for uploaded FASTA contig data, because the FASTA format cannot specify gene locations.

What is frameshift correction?

  • Frameshift correction takes a gene that appears to be fragmented, and constructs a "conceptual translation" by attempting to join the fragments using single insertions and deletions to remove the apparent frameshift errors. However, sometimes it makes mistakes. Fragmented genes are bad for subsystem analysis, so the FIG Philosophy is that it is better to risk a few "false positive" repairs than to accept a large number of frameshift errors. Hence, FS repair is the default for SEED inclusion.

When uploading sequences, the option to fix frameshifts is the only one that is NOT automatically selected. Why not?

  1. It is still somewhat experimental.
  2. NCBI does not like it, and some users may not like it either.

When should I elect to fix frameshifts?

  • NCBI will reject our repair attempts, so frameshift repair should not be selected for genomes that will go to NCBI.

What do you mean by an "essentially complete" genome?

  • We consider a genome to be "essentially complete" at about 99% coverage, because beyond that point, the expected number of missing or truncated genes due to sequencing gaps has become less than or comparable to the expected number of "false negatives" from the gene-caller. Therefore, from a Subsystem Analysis standpoint, once a genome assembly passes roughly 99% completeness, the sequencing project has passed the point of diminishing returns.

What does "essentially complete" correspond to in terms of sequencing redundancy?

  • Experience suggests that a genome project is "essentially complete" once it reaches at least 5x coverage using the Sanger method, or at least 10x coverage using 454 pyrosequencing.

What does "essentially complete" correspond to in terms of contig length?

  • Experience suggests that a genome is "essentially complete" once at least 70% of the assembled sequence data for it are in contigs longer than 20 kbp.

What is the poorest quality of data that RAST can handle?

  • We recommend that your median contig length be at least 2 kbp; if your assembled sequence data are poorer quality that this, most of the contigs will not contain even one complete gene, and RAST will most likely abort with errors. We also recommend that your sequence data contain at most 1% ambiguity characters. It is possible that the metagenomic version of RAST, MG-RAST, may be able to do something with extremely low quality assemblies, since it is designed to handle large amounts of low-quality 454 data containing many frameshift errors. However, "Your Milage May Vary," since MG-RAST is designed to analyze metagenomes rather than single genomes, and lacks many of the features of RAST.

What if I don't have (or am not interested in) the entire genome?

  • Again, RAST is designed for and performs best on complete or near-complete genomes. Conversely, RAST's performance degrades substantially when presented with only a small fragment of a genome. We therefore strongly recommend that you upload as much of your genome as you can get your hands on, even if you are only interested in a few genes in a small region, and that at a bare minimum you upload at least 100 kbp of contig data. The probability that RAST will abort with errors increases rapidly below the 100 kbp threshold, and is well in excess of 50% below 40 kbp.

What about just a plasmid?

  • RAST is designed to handle complete or near-complete genomes, not plasmids or small fragments. We recommend that you upload the entire genome, even if you intend to ignore everything except your plasmid. We are developing a modification of the RAST algorithm that may perform adequately on plasmids, but it is not yet ready for production release.

What about Eukaryotes?

  • RAST does not currently handle eukaryotes, period, not even small ones! Currently, the RAST upload page requires that you specify whether your genome is a bacterium or an archaeon. If you try to submit a euk by misleading RAST about what domain it's from, RAST will most likely abort with errors. We are working on a proposal to apply the RAST methodology to annotate eukaryotic genomes whose CDSs have already been called, but so far no code has been written.

What about eukaryotic organelles, such as mitochondria or chloroplasts?

  • Again, RAST does not currently handle eukaryotes, period, not even eukaryotic organelles!

What about viral and phage genomes?

  • Once again, RAST is designed for prokaryotic genomes, not virus or phage. While a viral variant of the RAST Strategy is probably feasible, it will require a major development effort, since viral phylogeny is poorly understood, and viral genes are not yet well-catalogued by the SEED's Subsystems and FIGfams.

What about ESTs?

  • RAST is not designed to analyze ESTs, and will most likely abort with errors. You can try submitting EST data to the metagenomic version of RAST, but again, it is not really designed for them, so YMMV.

What about protein sequences?

  • RAST is designed to annotate complete or nearly-complete genomes, not protein sequences.

What about metagenomes?

  • MG-RAST is designed specifically to analyze the sort of massive, low-quality datasets typically generated by metagenomics projects.

How do I look at the RAST analysis of my genome?

  • You can browse your results and graphically compare them to other genomes (see the RAST tutorial), or download the analysis of your genome in GenBank, EMBL, GFF3, or GTF formats; you can also download the results (including precomputed similarities) as a SEED genome directory tarfile.

The RAST Strategy

How does RAST work?

  • RAST applies FIG's "Subsystem Approach" for High-Throughput Comparative Analysis to rapidly call and annotate the genes of an essentially complete genome using a "Highest Reliability First" strategy based on FIG's collection of manually curated Subsystems and subsystem-derived Protein Families. RAST's subsystem-based approach automatically ensures a high degree of annotation consistency, and delivers its data in a format designed to support high-throughput genome annotation projects. The RAST strategy proceeds as follows:
    • Find RNAs. (Currently, rRNAs and tRNAs; eventually, RFAM RNAs and other small RNAs.)
    • Find gene candidates for "Special Proteins." (Currently, selenoproteins and pyrrolysoproteins; eventually, candidates for genes with programmed frameshifts, etc.)
    • Find gene candidates for membership in the "Universal" Protein Families (tRNA Synthetases, etc.); as a side-effect, estimate phylogenetic neighborhood of genome.
    • Find gene candidates for membership in the FIGfams already seen in the set of neighboring genomes.
    • Find gene candidates for membership in FIGfams other than those found in the neighboring genomes.
    • Repair gene candidates showing evidence of frameshift errors.
    • Find gene candidates showing similarity to genes in neighboring genomes that are not in FIGfams.
    • Promote any remaining gene candidates to genes.
    • Examine suspiciously long gaps for possible "missing" genes previously found in neighboring genomes (AKA "Backfill Gaps").
  • At each stage, the genes so far found during all previous stages become the "training set" that will be used to find the candidates to be examined during the current stage. Gene candidates selected during each stage are only retained if they do not too severely overlap a gene called during a previous stage.

What is a "Subsystem"?

  • A "subsystem" is a set of abstract functional roles that an annotator has decided should be thought of as related in some way. A "subsystem" can represent the collection of functional roles that make up a metabolic pathway, a multi-subunit complex (e.g., the ribosome), a specific class of proteins (e.g., signal transduction), or any other set of functional roles that the annotator believes are related in some biologically meaningful way. The "subsystem approach" to high-throughput genome annotation allows an annotator who is an expert on a specific pathway or class of genes to concentrate on their specific domain of expertise as they annotate all detected instances and observed variants of that pathway or class of roles simultaneously, in the entire set of installed SEED genomes, in an automatically consistent fashion.

Common Problems using RAST

Why does RAST complain that it can't find the "phylogenetic neighborhood" of my submission?

  • Usually, this is because the sequence data submitted are too small, e.g., because you have only submitted a plasmid or small fragment of a genome. RAST is designed for complete or near-complete genomes, and it estimates the phylogenetic neighborhood of a submission and its initial training set for gene calls by first looking for members of the "Universal" protein families, or, failing that, members of other large, highly conserved protein families. Experience suggests that RAST needs at least 40 kbp of sequence data to find enough highly conserved genes to reliably place a submission's phylogenetic neighborhood and develop an initial training set. For submissions smaller than 40 kbp, RAST's performance degrades rapidly; we therefore strongly recommend that RAST submissions should be at least 100 kbp for safety.

RAST is complaining about "Duplicate contig IDs," but all my contig IDs appear unique to me. What's going on?

  • Usually, this is because your contig IDs contain "whitespace" characters. The FASTA standard specifies that the header must not contain "whitespace" between the ">" symbol and the contig ID, and that everything after the first "whitespace" character is a "comment," and not part of the identifier. Thus, the first FASTA header below is invalid (no ID, just comment), while the following two will be interpreted as a pair of "duplicate IDs," that are both named "B.":
    > E. coli main chromosome >B. subtilis main chromosome >B. subtilis plasmid 

Why does RAST complain about "invalid characters" in my input file?

  • Most likely one of two reasons:
    1. Your contig sequences contain characters other than the standard IUPAC ambiguity characters [ACGTUMRWSYKBDHVN] or the "vector masking" character "X."
    2. Your contig file uses nonstandard line terminators, is missing line terminators before or after a record header, or is otherwise malformed in some way.

I selected "Keep existing gene calls" and uploaded a GenBank file, but RAST failed with the cryptic error "Zero-size or non-existent FASTA file." What does this mean?

  • Most likely it means that your GenBank file contains either:
    • "Gene" entries but no "CDS" entries, or
    • "CDS" entries lacking a "/translation=" field.
  • Because the GenBank format tolerates many exception conditions that make unambiguous translation of a gene into protein sequence difficult (such as "fuzzy" gene boundaries, frameshifts at unspecified locations, and the notion of "conceptual translations"), and because the GenBank format does not clearly indicate when embedded STOP codons should be translated as selenocystine or pyrrolysine, the SEED GenBank parser used by RAST expects genes to be specified by "CDS" entries complete with explicit "/translation=" fields. The unqualified "gene" entry, with only location and name information, but no sequence information, leaves RAST uncertain about how the gene should be translated into protein sequence.

Contact Information

Who should I contact regarding questions about or problems using RAST?

  • All questions, comments, or problems regarding RAST should be directed to rast @ mcs.anl.gov . Likewise, all questions, comments, or problems regarding MG-RAST should be directed to mg-rast @ mcs.anl.gov .

-- Leslie Mc Neil - 08 Jan 2009

Topic revision: r4 - 02 Mar 2009 - 17:18:14 - Leslie Mc Neil
 
Notice to NMPDR Users - The NMPDR BRC contract has ended and bacterial data from NMPDR has been transferred to PATRIC (http://www.patricbrc.org), a new consolidated BRC for all NIAID category A-C priority pathogenic bacteria. NMPDR was a collaboration among researchers from the Computation Institute of the University of Chicago, the Fellowship for Interpretation of Genomes (FIG), Argonne National Laboratory, and the National Center for Supercomputing Applications (NCSA) at the University of Illinois. NMPDR is funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract HHSN266200400042C. Banner images are copyright © Dennis Kunkel.