GenBank

The GenBank (aka Genetic Sequence Data Bank) sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations.^[1]^[2] This database is produced at National Center for Biotechnology Information (NCBI).

Statistics

GenBank Flat File Release Statistics
Release	Date	Loci	Bases	Sequences¹	Size (GB)²	Size (GB)³
236.0	2020-02-15	n/a	399,376,854,872	216,214,215	n/a	1117
232.0	2019-06-15	213,383,758	329,835,282,370	213,383,758	n/a	1006
230.0	2019-02-15	212,260,377	303,709,510,632	212,260,377	n/a	n/a
225.0	2018-04-15	208,452,303	260,189,141,631	208,452,303	n/a	885
221.0	2017-08-15	203,180,606	240,343,378,258	203,180,606	n/a	841
213.0	2016-04-15	193,739,511	211,423,912,047	193,739,511	n/a	771
211.0	2015-12-15	189,232,925	203,939,111,071	189,232,925	n/a	749
209.0	2015-08-15	187,066,846	199,823,644,287	187,066,846	n/a	735
204.0	2014-10-15	178,322,253	181,563,676,918	178,322,253	n/a	680
203.0	2014-08-15	174,108,750	165,722,980,375	174,108,750	n/a	652
200.0	2014-02-15	171,123,749	157,943,793,171	171,123,749	n/a	625
198.0	2013-10-15	168,335,396	155,176,494,699	168,335,396	n/a	613
193.0	2012-12-15	161,140,325	148,390,863,904	161,140,325	579	624
192.0	2012-10-15	157,889,737	145,430,961,262	157,889,737	569	612
190.0	2012-06-15	154,130,210	141,343,240,755	154,130,210	553	595
189.0	2012-04-15	151,824,421	139,266,481,398	151,824,421	545	586
188.0	2012-02-15	149,819,246	137,384,889,783	149,819,246	539	580
187.0	2011-12-15	146,413,798	135,117,731,375	146,413,798	528	568
177.0	2010-04-15	119,112,251	114,348,888,771	119,112,251	439	471
174.0	2009-10-15	110,946,879	108,560,236,506	110,946,879	416	445
171.0	2009-04-15	103,335,421	102,980,268,709	103,335,421	395	422
170.0	2009-02-15	101,815,678	101,467,270,308	101,815,678	390	417
169.0	2008-12-15	98,868,465	99,116,431,942	98,868,465	381	407
168.0	2008-10-15	96,400,790	97,381,682,336	96,400,790	371	396
166.0	2008-06-15	88,554,578	92,008,611,867	88,554,578	343	366
164.0	2008-02-15	82,853,685	85,759,586,764	82,853,685	321	342
161.0	2007-08-15	76,146,236	79,525,559,650	76,146,236	299	319
158.0	2007-02-15	67,218,344	71,292,211,453	67,218,344	251	263

Source: gbrel.txt^[3]

¹ reported sequences
² Uncompressed flatfiles, sequence files only
³ Uncompressed flatfiles, including the 'short directory', 'index', and the *.txt files

Note: You can find the current release number by issuing either of the following commands:

$ curl -s ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number
$ lynx --dump ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number

Selected genomes

see: IDS for a list of files containing the IDs of completed genomes.

Prokayotes

see: summary.txt for a daily updated list of completed bacterial genomes.
see: plasmids.ids for a list of completed plasmid genomes.

Eukaryotes

Note: The following are not part of the main NCBI GenBank database.

Fungi
- Saccharomyces cerevisiae (Baker's Yeast)
- Schizosaccharomyces pombe (Fission Yeast)
Plants
- Arabidopsis thaliana
Vertebrates
- Canis familiaris (Dog)
- Gallus gallus (Chicken)
- Homo sapiens (Human)
- Mus musculus (Mouse)
- Rattus norvegicus (Rat)
Invertebrates
- Apis mellifera (Honey bee)
- Caenorhabditis elegans (Nematode)
- Drosophila melanogaster (Fruit fly)
Other
- Encephalitozoon cuniculi (an intracellular parasite)

GenBank entries in the eukaryotic database

For details please refer to the NCBI genome FTP site at: ftp://ftp.ncbi.nih.gov/genomes/ and the list of completed eukaryotic genomes (NCBI).

See the complete list here: contig list (73,867 entries; 4.5 MB).

Flat file features

The following documents describe in detail the features of various flat files:

EMBL Features and Qualifiers
User Manual — by UniProt Knowledgebase (release 10.4; 2007-05-01)
Protein naming guidelines — by UniProt - Swiss-Prot Protein Knowledgebase (release 52.4; 2007-05-01)

Index files

The index keys (accession numbers, keywords, authors, journals, and gene symbols.) of an index are sorted alphabetically. Following each index key, the identifiers of the sequence entries containing that key are listed (LOCUS name, division abbreviation, and primary accession number). The division abbreviations are:

PRI - primate sequences
ROD - rodent sequences
MAM - other mammalian sequences
VRT - other vertebrate sequences
INV - invertebrate sequences
PLN - plant, fungal, and algal sequences
BCT - bacterial sequences
VRL - viral sequences
PHG - bacteriophage sequences
SYN - synthetic sequences
UNA - unannotated sequences
EST - EST sequences (expressed sequence tags)
PAT - patent sequences
STS - STS sequences (sequence tagged sites)
GSS - GSS sequences (genome survey sequences)
HTG - HTGS sequences (high throughput genomic sequences)
HTC - HTC sequences (high throughput cDNA sequences)
ENV - Environmental sampling sequences
CON - Constructed sequences
WGS - Whole Genome Shotgun sequencing projects

References

↑ Benton D (1990). "Recent changes in the GenBank On-line Service". Nucleic Acids Research, 18(6):1517–1520.
↑ Benton D et al. (2006). "GenBank". Nucleic Acids Research, 34(Database):D16-D20.
↑ NCBI-GenBank Flat File - Distribution Release Notes ('gbrel.txt').

External links

GenBank (overview)
NCBI Resource Guide
FTP directory containing full GenBank flat file releases (NCBI)
Genomes (NCBI)
List of completed eukaryotic genomes (NCBI)
The DDBJ/EMBL/GenBank Feature Table: Definition — version 6.6, 2006-10.
Trace Archive

[Benton1990-1] Benton D (1990). "Recent changes in the GenBank On-line Service". Nucleic Acids Research, 18(6):1517–1520.

[Benton2006-2] Benton D et al. (2006). "GenBank". Nucleic Acids Research, 34(Database):D16-D20.

[gbrel-3] NCBI-GenBank Flat File - Distribution Release Notes ('gbrel.txt').

[1]

[2]

[3]

GenBank

Contents

Statistics

Selected genomes

Prokayotes

Eukaryotes

GenBank entries in the eukaryotic database

Flat file features

Index files

See also

References

External links

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools