GenBank format (file.gbk)

GenBank format is a flat file format for sequence data related to complete bacterial genomes. By convention, GenBank format files have the extension gbk.

Files in this format may be uploaded to the RAST server for genome annotation or re-annotation. Completed annotation jobs may be downloaded from RAST in this format as well.

GBK files are plain text files that are best viewed in a fixed-width font. It is easily parsed by computer programs because the fields containing different types of information are well-labeled. The header of the file contains information describing the sequence, such as its type, shape, length, and source. Features of the genome sequence follow the header, and include protein translations. The DNA sequence is the last element of the file, which ends with (and must include) a double slash.

A thorough description of all the fields and feature types included in a GenBank flat file are presented in a sample record at NCBI. Below is an abbreviated example of a complete bacterial genome sequence in GenBank format, showing the header information, the first and last features, and the first and last 180 or so nucleotides:

LOCUS       NC_000908             580076 bp    DNA     circular BCT 12-NOV-2008
DEFINITION  Mycoplasma genitalium G37, complete genome.
ACCESSION   NC_000908
VERSION     NC_000908.2  GI:108885074
KEYWORDS    .
SOURCE      Mycoplasma genitalium G37
  ORGANISM  Mycoplasma genitalium G37
            Bacteria; Tenericutes; Mollicutes; Mycoplasmataceae; Mycoplasma.
REFERENCE   1  (bases 1 to 580076)
  AUTHORS   Glass,J.I., Assad-Garcia,N., Alperovich,N., Yooseph,S., Lewis,M.R.,
            Maruf,M., Hutchison,C.A., Smith,H.O. and Venter,J.C.
  TITLE     Essential genes of a minimal bacterium
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 103 (2), 425-430 (2006)
   PUBMED   16407165
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence was derived from L43967.
            On Jun 13, 2006 this sequence version replaced gi:12044850.
            COMPLETENESS: full length.
FEATURES             Location/Qualifiers
     source          1..580076
                     /organism="Mycoplasma genitalium G37"
                     /mol_type="genomic DNA"
                     /strain="G37"
                     /db_xref="taxon:243273"
     gene            686..1828
                     /gene="dnaN"
                     /locus_tag="MG_001"
                     /db_xref="GeneID:875454"
     CDS             686..1828
                     /gene="dnaN"
                     /locus_tag="MG_001"
                     /EC_number="2.7.7.7"
                     /note="identified by sequence similarity; putative"
                     /codon_start=1
                     /transl_table=4
                     /product="DNA polymerase III, beta subunit"
                     /protein_id="NP_072661.2"
                     /db_xref="GI:108885075"
                     /db_xref="GeneID:875454"
                     /translation="MKILINKSELNKILKKMNNVIISNNKIKPHHSYFLIEAKEKEIN
                     FYANNEYFSVKCNLNKNIDILEQGSLIVKGKIFNDLINGIKEEIITIQEKDQTLLVKT
                     KKTSINLNTINVNEFPRIRFNEKNDLSEFNQFKINYSLLVKGIKKIFHSVSNNREISS
                     KFNGVNFNGSNGKEIFLEASDTYKLSVFEIKQETEPFDFILESNLLSFINSFNPEEDK
                     SIVFYYRKDNKDSFSTEMLISMDNFMISYTSVNEKFPEVNYFFEFEPETKIVVQKNEL
                     KDALQRIQTLAQNERTFLCDMQINSSELKIRAIVNNIGNSLEEISCLKFEGYKLNISF
                     NPSSLLDHIESFESNEINFDFQGNSKYFLITSKSEPELKQILVPSR"

                                        ...

     gene            complement(579224..580033)
                     /gene="soj"
                     /locus_tag="MG_470"
                     /db_xref="GeneID:875585"
     CDS             complement(579224..580033)
                     /gene="soj"
                     /locus_tag="MG_470"
                     /note="identified by sequence similarity; putative"
                     /codon_start=1
                     /transl_table=4
                     /product="CobQ/CobB/MinD/ParA nucleotide binding
                     domain-containing protein"
                     /protein_id="NP_073141.1"
                     /db_xref="GI:12045330"
                     /db_xref="GeneID:875585"
                     /translation="MIISFVNNKGGVLKTTMATNVAGSLVKLCPERRKVILDLDGQGN
                     VSASFGQNPERLNNTLIDILLKVPKFSGSNNFIEIDDCLLSVYEGLDILPCNFELNFA
                     DIDISRKKYKASDIAEIVKQLAKRYEFVLLDTPPNMATLVSTAMSLSDVIVIPFEPDQ
                     YSMLGLMRIVETIDTFKEKNTNLKTILVPTKVNVRTRLHNEVIDLAKTKAKKNNVAFS
                     KNFVSLTSKSSAAVGYEKLPISLVSSPSKKYLNEYLEITKEILNLANYNVH"
ORIGIN      
        1 taagttatta tttagttaat acttttaaca atattattaa ggtatttaaa aaatactatt
       61 atagtattta acatagttaa ataccttcct taatactgtt aaattatatt caatcaatac
      121 atatataata ttattaaaat acttgataag tattatttag atattagaca aatactaatt

                                        ...

   579901 cattcccctg cccgtcaaga tcaagaatga cttttcgcct ttctggacaa agtttaacca
   579961 atgatcctgc aacattagtt gccattgtag tttttaatac gccgccttta ttatttacaa
   580021 aagaaatgat catatattta aatgattata atatttcttt aatactaaaa aaatac
//

Complete genomes in this format are available at the NCBI FTP site.

Topic revision: r4 - 16 Jan 2009 - 15:07:35 - Bruce Parrello
 
Notice to NMPDR Users - The NMPDR BRC contract has ended and bacterial data from NMPDR has been transferred to PATRIC (http://www.patricbrc.org), a new consolidated BRC for all NIAID category A-C priority pathogenic bacteria. NMPDR was a collaboration among researchers from the Computation Institute of the University of Chicago, the Fellowship for Interpretation of Genomes (FIG), Argonne National Laboratory, and the National Center for Supercomputing Applications (NCSA) at the University of Illinois. NMPDR is funded by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract HHSN266200400042C. Banner images are copyright © Dennis Kunkel.