Difference between revisions of "GenBank"

From Christoph's Personal Wiki
Jump to: navigation, search
(Flat file features)
(Statistics)
 
(36 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
==Statistics==
 
==Statistics==
*GenBank Flat File Release '''161.0''' (2007-08-15)
+
<div style="float:left; margin:0px 20px 20px 0px;">
**'''76,146,236''' loci, '''79,525,559,650''' bases, from '''76,146,236''' reported sequences.<ref name="gbrel">[ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt NCBI-GenBank Flat File Release 158.0 - Distribution Release Notes] ('<code>gbrel.txt</code>') &mdash; 2007-02-15.</ref>
+
{| align="center" style="border: 1px solid #999; background-color:#FFFFFF"
**Uncompressed, the Release 161.0 flatfiles require roughly '''299 GB''' (sequence files only) or '''319 GB''' (including the '<code>short directory</code>', '<code>index</code>', and the <code>*.txt</code> files).
+
|-
*GenBank Flat File Release '''158.0''' (2007-02-15)
+
! colspan="7" bgcolor="#EFEFEF" | '''GenBank Flat File Release Statistics'''
**'''67,218,344''' loci, '''71,292,211,453''' bases, from '''67,218,344''' reported sequences.<ref name="gbrel"/>
+
|-align="center" bgcolor="#1188ee"
**Uncompressed, the Release 158.0 flatfiles require roughly '''251 GB''' (sequence files only) or '''263 GB''' (including the '<code>short directory</code>', '<code>index</code>', and the <code>*.txt</code> files).
+
!Release
 +
!Date
 +
!Loci
 +
!Bases
 +
!Sequences<sup>1</sup>
 +
!Size (GB)<sup>2</sup>
 +
!Size (GB)<sup>3</sup>
 +
|- align="right"
 +
|236.0  || 2020-02-15 || n/a || 399,376,854,872 || 216,214,215 || n/a || 1117
 +
|--bgcolor="#eeeeee" align="right"
 +
|232.0 || 2019-06-15 || 213,383,758 || 329,835,282,370 || 213,383,758 || n/a || 1006
 +
|- align="right"
 +
|230.0 || 2019-02-15 || 212,260,377 || 303,709,510,632 || 212,260,377 || n/a || n/a
 +
|--bgcolor="#eeeeee" align="right"
 +
|225.0 || 2018-04-15 || 208,452,303 || 260,189,141,631 || 208,452,303 || n/a || 885
 +
|- align="right"
 +
|221.0 || 2017-08-15 || 203,180,606 || 240,343,378,258 || 203,180,606 || n/a || 841
 +
|--bgcolor="#eeeeee" align="right"
 +
|213.0 || 2016-04-15 || 193,739,511 || 211,423,912,047 || 193,739,511 || n/a || 771
 +
|- align="right"
 +
|211.0 || 2015-12-15 || 189,232,925 || 203,939,111,071 || 189,232,925 || n/a || 749
 +
|--bgcolor="#eeeeee" align="right"
 +
|209.0 || 2015-08-15 || 187,066,846 || 199,823,644,287 || 187,066,846 || n/a || 735
 +
|- align="right"
 +
|204.0 || 2014-10-15 || 178,322,253 || 181,563,676,918 || 178,322,253 || n/a || 680
 +
|--bgcolor="#eeeeee" align="right"
 +
|203.0 || 2014-08-15 || 174,108,750 || 165,722,980,375 || 174,108,750 || n/a || 652
 +
|- align="right"
 +
|200.0 || 2014-02-15 || 171,123,749 || 157,943,793,171 || 171,123,749 || n/a || 625
 +
|--bgcolor="#eeeeee" align="right"
 +
|198.0 || 2013-10-15 || 168,335,396 || 155,176,494,699 || 168,335,396 || n/a || 613
 +
|- align="right"
 +
|193.0 || 2012-12-15 || 161,140,325 || 148,390,863,904 || 161,140,325 || 579 || 624
 +
|--bgcolor="#eeeeee" align="right"
 +
|192.0 || 2012-10-15 || 157,889,737 || 145,430,961,262 || 157,889,737 || 569 || 612
 +
|- align="right"
 +
|190.0 || 2012-06-15 || 154,130,210 || 141,343,240,755 || 154,130,210 || 553 || 595
 +
|--bgcolor="#eeeeee" align="right"
 +
|189.0 || 2012-04-15 || 151,824,421 || 139,266,481,398 || 151,824,421 || 545 || 586
 +
|- align="right"
 +
|188.0 || 2012-02-15 || 149,819,246 || 137,384,889,783 || 149,819,246 || 539 || 580
 +
|--bgcolor="#eeeeee" align="right"
 +
|187.0 || 2011-12-15 || 146,413,798 || 135,117,731,375 || 146,413,798 || 528 || 568
 +
|- align="right"
 +
|177.0 || 2010-04-15 || 119,112,251 || 114,348,888,771 || 119,112,251 || 439 || 471
 +
|--bgcolor="#eeeeee" align="right"
 +
|174.0 || 2009-10-15 || 110,946,879 || 108,560,236,506 || 110,946,879 || 416 || 445
 +
|- align="right"
 +
|171.0 || 2009-04-15 || 103,335,421 || 102,980,268,709 || 103,335,421 || 395 || 422
 +
|--bgcolor="#eeeeee" align="right"
 +
|170.0 || 2009-02-15 || 101,815,678 || 101,467,270,308 || 101,815,678 || 390 || 417
 +
|- align="right"
 +
|169.0 || 2008-12-15 || 98,868,465 || 99,116,431,942 || 98,868,465 || 381 || 407
 +
|--bgcolor="#eeeeee" align="right"
 +
|168.0 || 2008-10-15 || 96,400,790 || 97,381,682,336 || 96,400,790 || 371 || 396
 +
|- align="right"
 +
|166.0 || 2008-06-15 || 88,554,578 || 92,008,611,867 || 88,554,578 || 343 || 366
 +
|--bgcolor="#eeeeee" align="right"
 +
|164.0 || 2008-02-15 || 82,853,685 || 85,759,586,764 || 82,853,685 || 321 || 342
 +
|- align="right"
 +
|161.0 || 2007-08-15 || 76,146,236 || 79,525,559,650 || 76,146,236 || 299 || 319
 +
|--bgcolor="#eeeeee" align="right"
 +
|158.0 || 2007-02-15 || 67,218,344 || 71,292,211,453 || 67,218,344 || 251 || 263
 +
|}
 +
<div align="left">''Source: <code>gbrel.txt</code><ref name="gbrel">[ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt NCBI-GenBank Flat File - Distribution Release Notes] ('<code>gbrel.txt</code>').</ref>''<br/>
 +
<sup>1</sup> reported sequences<br/>
 +
<sup>2</sup> Uncompressed flatfiles, sequence files only<br/>
 +
<sup>3</sup> Uncompressed flatfiles, including the '<code>short directory</code>', '<code>index</code>', and the <code>*.txt</code> files
 +
</div>
 +
</div>
 +
<br clear="all"/>
 +
Note: You can find the current release number by issuing either of the following commands:
 +
$ curl -s <nowiki>ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number</nowiki>
 +
$ lynx --dump <nowiki>ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number</nowiki>
  
Note: You can find the current release number by issuing the following commmand:
+
==Selected genomes==
  lynx --dump ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number
+
see: [ftp://ftp.ncbi.nih.gov/genomes/IDS/ IDS] for a list of files containing the IDs of completed genomes.
 +
===Prokayotes===
 +
see: [ftp://ftp.ncbi.nih.gov/genomes/Bacteria/summary.txt summary.txt] for a daily updated list of completed bacterial genomes.
 +
  see: [ftp://ftp.ncbi.nih.gov/genomes/Plasmids/Plasmids.ids plasmids.ids] for a list of completed plasmid genomes.
  
==Selected Eukaryotic genomes==
+
===Eukaryotes===
 
''Note: The following are not part of the main NCBI GenBank database.''
 
''Note: The following are not part of the main NCBI GenBank database.''
 
*Fungi
 
*Fungi
Line 46: Line 122:
 
The index keys (accession numbers, keywords, authors, journals, and gene symbols.) of an index are sorted alphabetically. Following each index key, the identifiers of the sequence entries containing that key are listed (LOCUS name, division abbreviation, and primary accession
 
The index keys (accession numbers, keywords, authors, journals, and gene symbols.) of an index are sorted alphabetically. Following each index key, the identifiers of the sequence entries containing that key are listed (LOCUS name, division abbreviation, and primary accession
 
number). The division abbreviations are:
 
number). The division abbreviations are:
#PRI - primate sequences
+
PRI - primate sequences
#ROD - rodent sequences
+
ROD - rodent sequences
#MAM - other mammalian sequences
+
MAM - other mammalian sequences
#VRT - other vertebrate sequences
+
VRT - other vertebrate sequences
#INV - invertebrate sequences
+
INV - invertebrate sequences
#PLN - plant, fungal, and algal sequences
+
PLN - plant, fungal, and algal sequences
#BCT - bacterial sequences
+
BCT - bacterial sequences
#VRL - viral sequences
+
VRL - viral sequences
#PHG - bacteriophage sequences
+
PHG - bacteriophage sequences
#SYN - synthetic sequences
+
SYN - synthetic sequences
#UNA - unannotated sequences
+
UNA - unannotated sequences
#EST - EST sequences (expressed sequence tags)
+
EST - EST sequences (expressed sequence tags)
#PAT - patent sequences
+
PAT - patent sequences
#STS - STS sequences (sequence tagged sites)
+
STS - STS sequences (sequence tagged sites)
#GSS - GSS sequences (genome survey sequences)
+
GSS - GSS sequences (genome survey sequences)
#HTG - HTGS sequences (high throughput genomic sequences)
+
HTG - HTGS sequences (high throughput genomic sequences)
#HTC - HTC sequences (high throughput cDNA sequences)
+
HTC - HTC sequences (high throughput cDNA sequences)
#ENV - Environmental sampling sequences
+
ENV - Environmental sampling sequences
#CON - Constructed sequences
+
CON - Constructed sequences
 +
WGS - Whole Genome Shotgun sequencing projects
  
 
==See also==
 
==See also==
 
*[[Genome projects]]
 
*[[Genome projects]]
 +
*[http://www.sanger.ac.uk/Software/formats/GFF/ GFF file format] &mdash; format for describing genes and other features associated with DNA, RNA and Protein sequences.
 
*[[TAB file format]] (aka "<code>gb2tab</code>")
 
*[[TAB file format]] (aka "<code>gb2tab</code>")
 
*[ftp://ftp.ncbi.nih.gov/genbank/tools/build_gbff_cu.pl build_gbff_cu.pl] &mdash; Build a non-redundant cumulative GenBank flatfile from a set of GenBank Incremental Update (GIU) flatfiles provided by the NCBI. Documentation can be found [ftp://ftp.ncbi.nih.gov/genbank/tools/doc.build_gbff_cu.html here].
 
*[ftp://ftp.ncbi.nih.gov/genbank/tools/build_gbff_cu.pl build_gbff_cu.pl] &mdash; Build a non-redundant cumulative GenBank flatfile from a set of GenBank Incremental Update (GIU) flatfiles provided by the NCBI. Documentation can be found [ftp://ftp.ncbi.nih.gov/genbank/tools/doc.build_gbff_cu.html here].
 
*[ftp://ftp.ncbi.nih.gov/genbank/tools/ffidx.pl ffidx.pl] &mdash; Generate an index file containing the sequence identifier and byte-offset of each record in a flatfile which contains biological sequence data.
 
*[ftp://ftp.ncbi.nih.gov/genbank/tools/ffidx.pl ffidx.pl] &mdash; Generate an index file containing the sequence identifier and byte-offset of each record in a flatfile which contains biological sequence data.
 +
*[http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html EMBL Nucleotide Sequence Database: Release Notes]
  
 
==References==
 
==References==
Line 76: Line 155:
 
==External links==
 
==External links==
 
*[http://www.ncbi.nlm.nih.gov/Genbank/index.html GenBank] (overview)
 
*[http://www.ncbi.nlm.nih.gov/Genbank/index.html GenBank] (overview)
*[ftp://ftp.ncbi.nih.gov/genbank Directory containing full GenBank flat file releases] (NCBI)
+
*[http://www.ncbi.nlm.nih.gov/Sitemap/ResourceGuide.html NCBI Resource Guide]
 +
*[ftp://ftp.ncbi.nih.gov/genbank/ FTP directory containing full GenBank flat file releases] (NCBI)
 
*[ftp://ftp.ncbi.nih.gov/genomes/ Genomes] (NCBI)
 
*[ftp://ftp.ncbi.nih.gov/genomes/ Genomes] (NCBI)
 
*[http://www.ncbi.nlm.nih.gov/genomes/static/euk_g.html List of completed eukaryotic genomes] (NCBI)
 
*[http://www.ncbi.nlm.nih.gov/genomes/static/euk_g.html List of completed eukaryotic genomes] (NCBI)
 
*[ftp://ftp.ncbi.nih.gov/genbank/docs/FTv6_6.html The DDBJ/EMBL/GenBank Feature Table: Definition] &mdash; version 6.6, 2006-10.
 
*[ftp://ftp.ncbi.nih.gov/genbank/docs/FTv6_6.html The DDBJ/EMBL/GenBank Feature Table: Definition] &mdash; version 6.6, 2006-10.
 +
*[http://www.ncbi.nlm.nih.gov/Traces/trace.cgi? Trace Archive]
  
 
[[Category:Bioinformatics]]
 
[[Category:Bioinformatics]]

Latest revision as of 18:10, 1 April 2020

The GenBank (aka Genetic Sequence Data Bank) sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations.[1][2] This database is produced at National Center for Biotechnology Information (NCBI).

Statistics

GenBank Flat File Release Statistics
Release Date Loci Bases Sequences1 Size (GB)2 Size (GB)3
236.0 2020-02-15 n/a 399,376,854,872 216,214,215 n/a 1117
232.0 2019-06-15 213,383,758 329,835,282,370 213,383,758 n/a 1006
230.0 2019-02-15 212,260,377 303,709,510,632 212,260,377 n/a n/a
225.0 2018-04-15 208,452,303 260,189,141,631 208,452,303 n/a 885
221.0 2017-08-15 203,180,606 240,343,378,258 203,180,606 n/a 841
213.0 2016-04-15 193,739,511 211,423,912,047 193,739,511 n/a 771
211.0 2015-12-15 189,232,925 203,939,111,071 189,232,925 n/a 749
209.0 2015-08-15 187,066,846 199,823,644,287 187,066,846 n/a 735
204.0 2014-10-15 178,322,253 181,563,676,918 178,322,253 n/a 680
203.0 2014-08-15 174,108,750 165,722,980,375 174,108,750 n/a 652
200.0 2014-02-15 171,123,749 157,943,793,171 171,123,749 n/a 625
198.0 2013-10-15 168,335,396 155,176,494,699 168,335,396 n/a 613
193.0 2012-12-15 161,140,325 148,390,863,904 161,140,325 579 624
192.0 2012-10-15 157,889,737 145,430,961,262 157,889,737 569 612
190.0 2012-06-15 154,130,210 141,343,240,755 154,130,210 553 595
189.0 2012-04-15 151,824,421 139,266,481,398 151,824,421 545 586
188.0 2012-02-15 149,819,246 137,384,889,783 149,819,246 539 580
187.0 2011-12-15 146,413,798 135,117,731,375 146,413,798 528 568
177.0 2010-04-15 119,112,251 114,348,888,771 119,112,251 439 471
174.0 2009-10-15 110,946,879 108,560,236,506 110,946,879 416 445
171.0 2009-04-15 103,335,421 102,980,268,709 103,335,421 395 422
170.0 2009-02-15 101,815,678 101,467,270,308 101,815,678 390 417
169.0 2008-12-15 98,868,465 99,116,431,942 98,868,465 381 407
168.0 2008-10-15 96,400,790 97,381,682,336 96,400,790 371 396
166.0 2008-06-15 88,554,578 92,008,611,867 88,554,578 343 366
164.0 2008-02-15 82,853,685 85,759,586,764 82,853,685 321 342
161.0 2007-08-15 76,146,236 79,525,559,650 76,146,236 299 319
158.0 2007-02-15 67,218,344 71,292,211,453 67,218,344 251 263
Source: gbrel.txt[3]

1 reported sequences
2 Uncompressed flatfiles, sequence files only
3 Uncompressed flatfiles, including the 'short directory', 'index', and the *.txt files


Note: You can find the current release number by issuing either of the following commands:

$ curl -s ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number
$ lynx --dump ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number

Selected genomes

see: IDS for a list of files containing the IDs of completed genomes.

Prokayotes

see: summary.txt for a daily updated list of completed bacterial genomes.
see: plasmids.ids for a list of completed plasmid genomes.

Eukaryotes

Note: The following are not part of the main NCBI GenBank database.

  • Fungi
    • Saccharomyces cerevisiae (Baker's Yeast)
    • Schizosaccharomyces pombe (Fission Yeast)
  • Plants
    • Arabidopsis thaliana
  • Vertebrates
    • Canis familiaris (Dog)
    • Gallus gallus (Chicken)
    • Homo sapiens (Human)
    • Mus musculus (Mouse)
    • Rattus norvegicus (Rat)
  • Invertebrates
    • Apis mellifera (Honey bee)
    • Caenorhabditis elegans (Nematode)
    • Drosophila melanogaster (Fruit fly)
  • Other
    • Encephalitozoon cuniculi (an intracellular parasite)

GenBank entries in the eukaryotic database

For details please refer to the NCBI genome FTP site at: ftp://ftp.ncbi.nih.gov/genomes/ and the list of completed eukaryotic genomes (NCBI).

See the complete list here: contig list (73,867 entries; 4.5 MB).

Flat file features

The following documents describe in detail the features of various flat files:

Index files

The index keys (accession numbers, keywords, authors, journals, and gene symbols.) of an index are sorted alphabetically. Following each index key, the identifiers of the sequence entries containing that key are listed (LOCUS name, division abbreviation, and primary accession number). The division abbreviations are:

PRI - primate sequences
ROD - rodent sequences
MAM - other mammalian sequences
VRT - other vertebrate sequences
INV - invertebrate sequences
PLN - plant, fungal, and algal sequences
BCT - bacterial sequences
VRL - viral sequences
PHG - bacteriophage sequences
SYN - synthetic sequences
UNA - unannotated sequences
EST - EST sequences (expressed sequence tags)
PAT - patent sequences
STS - STS sequences (sequence tagged sites)
GSS - GSS sequences (genome survey sequences)
HTG - HTGS sequences (high throughput genomic sequences)
HTC - HTC sequences (high throughput cDNA sequences)
ENV - Environmental sampling sequences
CON - Constructed sequences
WGS - Whole Genome Shotgun sequencing projects

See also

References

  1. Benton D (1990). "Recent changes in the GenBank On-line Service". Nucleic Acids Research, 18(6):1517–1520.
  2. Benton D et al. (2006). "GenBank". Nucleic Acids Research, 34(Database):D16-D20.
  3. NCBI-GenBank Flat File - Distribution Release Notes ('gbrel.txt').

External links