GenBank format (file.gbk)
GenBank format is a flat file format for sequence data related to complete bacterial genomes. By convention,
GenBank format files have the extension
gbk.
Files in this format may be uploaded to the
RAST server for genome annotation or re-annotation. Completed annotation jobs may be downloaded from RAST in this format as well.
GBK files are plain text files that are best viewed in a fixed-width font. It is easily parsed by computer programs because the fields containing different types of information are well-labeled. The header of the file contains information describing the sequence, such as its type, shape, length, and source.
Features of the genome sequence follow the header, and include protein translations. The DNA sequence is the last element of the file, which ends with (and must include) a double slash.
A thorough description of all the fields and feature types included in a GenBank flat file are presented in a
sample record at
NCBI. Below is an abbreviated example of a complete bacterial genome sequence in GenBank format, showing the header information, the first and last features, and the first and last 180 or so nucleotides:
LOCUS NC_000908 580076 bp DNA circular BCT 12-NOV-2008
DEFINITION Mycoplasma genitalium G37, complete genome.
ACCESSION NC_000908
VERSION NC_000908.2 GI:108885074
KEYWORDS .
SOURCE Mycoplasma genitalium G37
ORGANISM Mycoplasma genitalium G37
Bacteria; Tenericutes; Mollicutes; Mycoplasmataceae; Mycoplasma.
REFERENCE 1 (bases 1 to 580076)
AUTHORS Glass,J.I., Assad-Garcia,N., Alperovich,N., Yooseph,S., Lewis,M.R.,
Maruf,M., Hutchison,C.A., Smith,H.O. and Venter,J.C.
TITLE Essential genes of a minimal bacterium
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 103 (2), 425-430 (2006)
PUBMED 16407165
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence was derived from L43967.
On Jun 13, 2006 this sequence version replaced gi:12044850.
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..580076
/organism="Mycoplasma genitalium G37"
/mol_type="genomic DNA"
/strain="G37"
/db_xref="taxon:243273"
gene 686..1828
/gene="dnaN"
/locus_tag="MG_001"
/db_xref="GeneID:875454"
CDS 686..1828
/gene="dnaN"
/locus_tag="MG_001"
/EC_number="2.7.7.7"
/note="identified by sequence similarity; putative"
/codon_start=1
/transl_table=4
/product="DNA polymerase III, beta subunit"
/protein_id="NP_072661.2"
/db_xref="GI:108885075"
/db_xref="GeneID:875454"
/translation="MKILINKSELNKILKKMNNVIISNNKIKPHHSYFLIEAKEKEIN
FYANNEYFSVKCNLNKNIDILEQGSLIVKGKIFNDLINGIKEEIITIQEKDQTLLVKT
KKTSINLNTINVNEFPRIRFNEKNDLSEFNQFKINYSLLVKGIKKIFHSVSNNREISS
KFNGVNFNGSNGKEIFLEASDTYKLSVFEIKQETEPFDFILESNLLSFINSFNPEEDK
SIVFYYRKDNKDSFSTEMLISMDNFMISYTSVNEKFPEVNYFFEFEPETKIVVQKNEL
KDALQRIQTLAQNERTFLCDMQINSSELKIRAIVNNIGNSLEEISCLKFEGYKLNISF
NPSSLLDHIESFESNEINFDFQGNSKYFLITSKSEPELKQILVPSR"
...
gene complement(579224..580033)
/gene="soj"
/locus_tag="MG_470"
/db_xref="GeneID:875585"
CDS complement(579224..580033)
/gene="soj"
/locus_tag="MG_470"
/note="identified by sequence similarity; putative"
/codon_start=1
/transl_table=4
/product="CobQ/CobB/MinD/ParA nucleotide binding
domain-containing protein"
/protein_id="NP_073141.1"
/db_xref="GI:12045330"
/db_xref="GeneID:875585"
/translation="MIISFVNNKGGVLKTTMATNVAGSLVKLCPERRKVILDLDGQGN
VSASFGQNPERLNNTLIDILLKVPKFSGSNNFIEIDDCLLSVYEGLDILPCNFELNFA
DIDISRKKYKASDIAEIVKQLAKRYEFVLLDTPPNMATLVSTAMSLSDVIVIPFEPDQ
YSMLGLMRIVETIDTFKEKNTNLKTILVPTKVNVRTRLHNEVIDLAKTKAKKNNVAFS
KNFVSLTSKSSAAVGYEKLPISLVSSPSKKYLNEYLEITKEILNLANYNVH"
ORIGIN
1 taagttatta tttagttaat acttttaaca atattattaa ggtatttaaa aaatactatt
61 atagtattta acatagttaa ataccttcct taatactgtt aaattatatt caatcaatac
121 atatataata ttattaaaat acttgataag tattatttag atattagaca aatactaatt
...
579901 cattcccctg cccgtcaaga tcaagaatga cttttcgcct ttctggacaa agtttaacca
579961 atgatcctgc aacattagtt gccattgtag tttttaatac gccgccttta ttatttacaa
580021 aagaaatgat catatattta aatgattata atatttcttt aatactaaaa aaatac
//
Complete genomes in this format are available at the
NCBI FTP site.