Difference between revisions of "GenBank"
(→See also) |
(→Statistics) |
||
(43 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
==Statistics== | ==Statistics== | ||
− | + | <div style="float:left; margin:0px 20px 20px 0px;"> | |
− | + | {| align="center" style="border: 1px solid #999; background-color:#FFFFFF" | |
− | + | |- | |
+ | ! colspan="7" bgcolor="#EFEFEF" | '''GenBank Flat File Release Statistics''' | ||
+ | |-align="center" bgcolor="#1188ee" | ||
+ | !Release | ||
+ | !Date | ||
+ | !Loci | ||
+ | !Bases | ||
+ | !Sequences<sup>1</sup> | ||
+ | !Size (GB)<sup>2</sup> | ||
+ | !Size (GB)<sup>3</sup> | ||
+ | |- align="right" | ||
+ | |236.0 || 2020-02-15 || n/a || 399,376,854,872 || 216,214,215 || n/a || 1117 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |232.0 || 2019-06-15 || 213,383,758 || 329,835,282,370 || 213,383,758 || n/a || 1006 | ||
+ | |- align="right" | ||
+ | |230.0 || 2019-02-15 || 212,260,377 || 303,709,510,632 || 212,260,377 || n/a || n/a | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |225.0 || 2018-04-15 || 208,452,303 || 260,189,141,631 || 208,452,303 || n/a || 885 | ||
+ | |- align="right" | ||
+ | |221.0 || 2017-08-15 || 203,180,606 || 240,343,378,258 || 203,180,606 || n/a || 841 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |213.0 || 2016-04-15 || 193,739,511 || 211,423,912,047 || 193,739,511 || n/a || 771 | ||
+ | |- align="right" | ||
+ | |211.0 || 2015-12-15 || 189,232,925 || 203,939,111,071 || 189,232,925 || n/a || 749 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |209.0 || 2015-08-15 || 187,066,846 || 199,823,644,287 || 187,066,846 || n/a || 735 | ||
+ | |- align="right" | ||
+ | |204.0 || 2014-10-15 || 178,322,253 || 181,563,676,918 || 178,322,253 || n/a || 680 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |203.0 || 2014-08-15 || 174,108,750 || 165,722,980,375 || 174,108,750 || n/a || 652 | ||
+ | |- align="right" | ||
+ | |200.0 || 2014-02-15 || 171,123,749 || 157,943,793,171 || 171,123,749 || n/a || 625 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |198.0 || 2013-10-15 || 168,335,396 || 155,176,494,699 || 168,335,396 || n/a || 613 | ||
+ | |- align="right" | ||
+ | |193.0 || 2012-12-15 || 161,140,325 || 148,390,863,904 || 161,140,325 || 579 || 624 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |192.0 || 2012-10-15 || 157,889,737 || 145,430,961,262 || 157,889,737 || 569 || 612 | ||
+ | |- align="right" | ||
+ | |190.0 || 2012-06-15 || 154,130,210 || 141,343,240,755 || 154,130,210 || 553 || 595 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |189.0 || 2012-04-15 || 151,824,421 || 139,266,481,398 || 151,824,421 || 545 || 586 | ||
+ | |- align="right" | ||
+ | |188.0 || 2012-02-15 || 149,819,246 || 137,384,889,783 || 149,819,246 || 539 || 580 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |187.0 || 2011-12-15 || 146,413,798 || 135,117,731,375 || 146,413,798 || 528 || 568 | ||
+ | |- align="right" | ||
+ | |177.0 || 2010-04-15 || 119,112,251 || 114,348,888,771 || 119,112,251 || 439 || 471 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |174.0 || 2009-10-15 || 110,946,879 || 108,560,236,506 || 110,946,879 || 416 || 445 | ||
+ | |- align="right" | ||
+ | |171.0 || 2009-04-15 || 103,335,421 || 102,980,268,709 || 103,335,421 || 395 || 422 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |170.0 || 2009-02-15 || 101,815,678 || 101,467,270,308 || 101,815,678 || 390 || 417 | ||
+ | |- align="right" | ||
+ | |169.0 || 2008-12-15 || 98,868,465 || 99,116,431,942 || 98,868,465 || 381 || 407 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |168.0 || 2008-10-15 || 96,400,790 || 97,381,682,336 || 96,400,790 || 371 || 396 | ||
+ | |- align="right" | ||
+ | |166.0 || 2008-06-15 || 88,554,578 || 92,008,611,867 || 88,554,578 || 343 || 366 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |164.0 || 2008-02-15 || 82,853,685 || 85,759,586,764 || 82,853,685 || 321 || 342 | ||
+ | |- align="right" | ||
+ | |161.0 || 2007-08-15 || 76,146,236 || 79,525,559,650 || 76,146,236 || 299 || 319 | ||
+ | |--bgcolor="#eeeeee" align="right" | ||
+ | |158.0 || 2007-02-15 || 67,218,344 || 71,292,211,453 || 67,218,344 || 251 || 263 | ||
+ | |} | ||
+ | <div align="left">''Source: <code>gbrel.txt</code><ref name="gbrel">[ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt NCBI-GenBank Flat File - Distribution Release Notes] ('<code>gbrel.txt</code>').</ref>''<br/> | ||
+ | <sup>1</sup> reported sequences<br/> | ||
+ | <sup>2</sup> Uncompressed flatfiles, sequence files only<br/> | ||
+ | <sup>3</sup> Uncompressed flatfiles, including the '<code>short directory</code>', '<code>index</code>', and the <code>*.txt</code> files | ||
+ | </div> | ||
+ | </div> | ||
+ | <br clear="all"/> | ||
+ | Note: You can find the current release number by issuing either of the following commands: | ||
+ | $ curl -s <nowiki>ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number</nowiki> | ||
+ | $ lynx --dump <nowiki>ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number</nowiki> | ||
− | + | ==Selected genomes== | |
− | + | see: [ftp://ftp.ncbi.nih.gov/genomes/IDS/ IDS] for a list of files containing the IDs of completed genomes. | |
+ | ===Prokayotes=== | ||
+ | see: [ftp://ftp.ncbi.nih.gov/genomes/Bacteria/summary.txt summary.txt] for a daily updated list of completed bacterial genomes. | ||
+ | see: [ftp://ftp.ncbi.nih.gov/genomes/Plasmids/Plasmids.ids plasmids.ids] for a list of completed plasmid genomes. | ||
− | == | + | ===Eukaryotes=== |
''Note: The following are not part of the main NCBI GenBank database.'' | ''Note: The following are not part of the main NCBI GenBank database.'' | ||
*Fungi | *Fungi | ||
Line 27: | Line 106: | ||
**''Drosophila melanogaster'' (Fruit fly) | **''Drosophila melanogaster'' (Fruit fly) | ||
*Other | *Other | ||
− | **''Encephalitozoon cuniculi'' ( | + | **''Encephalitozoon cuniculi'' (an intracellular parasite) |
===GenBank entries in the eukaryotic database=== | ===GenBank entries in the eukaryotic database=== | ||
For details please refer to the NCBI genome FTP site at: ftp://ftp.ncbi.nih.gov/genomes/ and the [http://www.ncbi.nlm.nih.gov/genomes/static/euk_g.html list of completed eukaryotic genomes] (NCBI). | For details please refer to the NCBI genome FTP site at: ftp://ftp.ncbi.nih.gov/genomes/ and the [http://www.ncbi.nlm.nih.gov/genomes/static/euk_g.html list of completed eukaryotic genomes] (NCBI). | ||
− | See the complete list here: [http://www.cbs.dtu.dk/services/FeatureExtract/contig_sum.txt contig list] (73,867 entries; 4. | + | See the complete list here: [http://www.cbs.dtu.dk/services/FeatureExtract/contig_sum.txt contig list] (73,867 entries; 4.5 MB). |
+ | |||
+ | ==Flat file features== | ||
+ | The following documents describe in detail the features of various flat files: | ||
+ | *[http://www3.ebi.ac.uk/Services/WebFeat/ EMBL Features and Qualifiers] | ||
+ | *[http://www.expasy.org/sprot/userman.html User Manual] — by UniProt Knowledgebase (release 10.4; 2007-05-01) | ||
+ | *[http://www.expasy.org/cgi-bin/lists?nameprot.txt Protein naming guidelines] — by UniProt - Swiss-Prot Protein Knowledgebase (release 52.4; 2007-05-01) | ||
+ | |||
+ | ===Index files=== | ||
+ | The index keys (accession numbers, keywords, authors, journals, and gene symbols.) of an index are sorted alphabetically. Following each index key, the identifiers of the sequence entries containing that key are listed (LOCUS name, division abbreviation, and primary accession | ||
+ | number). The division abbreviations are: | ||
+ | PRI - primate sequences | ||
+ | ROD - rodent sequences | ||
+ | MAM - other mammalian sequences | ||
+ | VRT - other vertebrate sequences | ||
+ | INV - invertebrate sequences | ||
+ | PLN - plant, fungal, and algal sequences | ||
+ | BCT - bacterial sequences | ||
+ | VRL - viral sequences | ||
+ | PHG - bacteriophage sequences | ||
+ | SYN - synthetic sequences | ||
+ | UNA - unannotated sequences | ||
+ | EST - EST sequences (expressed sequence tags) | ||
+ | PAT - patent sequences | ||
+ | STS - STS sequences (sequence tagged sites) | ||
+ | GSS - GSS sequences (genome survey sequences) | ||
+ | HTG - HTGS sequences (high throughput genomic sequences) | ||
+ | HTC - HTC sequences (high throughput cDNA sequences) | ||
+ | ENV - Environmental sampling sequences | ||
+ | CON - Constructed sequences | ||
+ | WGS - Whole Genome Shotgun sequencing projects | ||
==See also== | ==See also== | ||
− | *[[ | + | *[[Genome projects]] |
+ | *[http://www.sanger.ac.uk/Software/formats/GFF/ GFF file format] — format for describing genes and other features associated with DNA, RNA and Protein sequences. | ||
+ | *[[TAB file format]] (aka "<code>gb2tab</code>") | ||
*[ftp://ftp.ncbi.nih.gov/genbank/tools/build_gbff_cu.pl build_gbff_cu.pl] — Build a non-redundant cumulative GenBank flatfile from a set of GenBank Incremental Update (GIU) flatfiles provided by the NCBI. Documentation can be found [ftp://ftp.ncbi.nih.gov/genbank/tools/doc.build_gbff_cu.html here]. | *[ftp://ftp.ncbi.nih.gov/genbank/tools/build_gbff_cu.pl build_gbff_cu.pl] — Build a non-redundant cumulative GenBank flatfile from a set of GenBank Incremental Update (GIU) flatfiles provided by the NCBI. Documentation can be found [ftp://ftp.ncbi.nih.gov/genbank/tools/doc.build_gbff_cu.html here]. | ||
*[ftp://ftp.ncbi.nih.gov/genbank/tools/ffidx.pl ffidx.pl] — Generate an index file containing the sequence identifier and byte-offset of each record in a flatfile which contains biological sequence data. | *[ftp://ftp.ncbi.nih.gov/genbank/tools/ffidx.pl ffidx.pl] — Generate an index file containing the sequence identifier and byte-offset of each record in a flatfile which contains biological sequence data. | ||
+ | *[http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html EMBL Nucleotide Sequence Database: Release Notes] | ||
==References== | ==References== | ||
Line 43: | Line 155: | ||
==External links== | ==External links== | ||
*[http://www.ncbi.nlm.nih.gov/Genbank/index.html GenBank] (overview) | *[http://www.ncbi.nlm.nih.gov/Genbank/index.html GenBank] (overview) | ||
− | *[ftp://ftp.ncbi.nih.gov/genbank | + | *[http://www.ncbi.nlm.nih.gov/Sitemap/ResourceGuide.html NCBI Resource Guide] |
+ | *[ftp://ftp.ncbi.nih.gov/genbank/ FTP directory containing full GenBank flat file releases] (NCBI) | ||
*[ftp://ftp.ncbi.nih.gov/genomes/ Genomes] (NCBI) | *[ftp://ftp.ncbi.nih.gov/genomes/ Genomes] (NCBI) | ||
*[http://www.ncbi.nlm.nih.gov/genomes/static/euk_g.html List of completed eukaryotic genomes] (NCBI) | *[http://www.ncbi.nlm.nih.gov/genomes/static/euk_g.html List of completed eukaryotic genomes] (NCBI) | ||
*[ftp://ftp.ncbi.nih.gov/genbank/docs/FTv6_6.html The DDBJ/EMBL/GenBank Feature Table: Definition] — version 6.6, 2006-10. | *[ftp://ftp.ncbi.nih.gov/genbank/docs/FTv6_6.html The DDBJ/EMBL/GenBank Feature Table: Definition] — version 6.6, 2006-10. | ||
+ | *[http://www.ncbi.nlm.nih.gov/Traces/trace.cgi? Trace Archive] | ||
[[Category:Bioinformatics]] | [[Category:Bioinformatics]] |
Latest revision as of 18:10, 1 April 2020
The GenBank (aka Genetic Sequence Data Bank) sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations.[1][2] This database is produced at National Center for Biotechnology Information (NCBI).
Contents
Statistics
GenBank Flat File Release Statistics | ||||||
---|---|---|---|---|---|---|
Release | Date | Loci | Bases | Sequences1 | Size (GB)2 | Size (GB)3 |
236.0 | 2020-02-15 | n/a | 399,376,854,872 | 216,214,215 | n/a | 1117 |
232.0 | 2019-06-15 | 213,383,758 | 329,835,282,370 | 213,383,758 | n/a | 1006 |
230.0 | 2019-02-15 | 212,260,377 | 303,709,510,632 | 212,260,377 | n/a | n/a |
225.0 | 2018-04-15 | 208,452,303 | 260,189,141,631 | 208,452,303 | n/a | 885 |
221.0 | 2017-08-15 | 203,180,606 | 240,343,378,258 | 203,180,606 | n/a | 841 |
213.0 | 2016-04-15 | 193,739,511 | 211,423,912,047 | 193,739,511 | n/a | 771 |
211.0 | 2015-12-15 | 189,232,925 | 203,939,111,071 | 189,232,925 | n/a | 749 |
209.0 | 2015-08-15 | 187,066,846 | 199,823,644,287 | 187,066,846 | n/a | 735 |
204.0 | 2014-10-15 | 178,322,253 | 181,563,676,918 | 178,322,253 | n/a | 680 |
203.0 | 2014-08-15 | 174,108,750 | 165,722,980,375 | 174,108,750 | n/a | 652 |
200.0 | 2014-02-15 | 171,123,749 | 157,943,793,171 | 171,123,749 | n/a | 625 |
198.0 | 2013-10-15 | 168,335,396 | 155,176,494,699 | 168,335,396 | n/a | 613 |
193.0 | 2012-12-15 | 161,140,325 | 148,390,863,904 | 161,140,325 | 579 | 624 |
192.0 | 2012-10-15 | 157,889,737 | 145,430,961,262 | 157,889,737 | 569 | 612 |
190.0 | 2012-06-15 | 154,130,210 | 141,343,240,755 | 154,130,210 | 553 | 595 |
189.0 | 2012-04-15 | 151,824,421 | 139,266,481,398 | 151,824,421 | 545 | 586 |
188.0 | 2012-02-15 | 149,819,246 | 137,384,889,783 | 149,819,246 | 539 | 580 |
187.0 | 2011-12-15 | 146,413,798 | 135,117,731,375 | 146,413,798 | 528 | 568 |
177.0 | 2010-04-15 | 119,112,251 | 114,348,888,771 | 119,112,251 | 439 | 471 |
174.0 | 2009-10-15 | 110,946,879 | 108,560,236,506 | 110,946,879 | 416 | 445 |
171.0 | 2009-04-15 | 103,335,421 | 102,980,268,709 | 103,335,421 | 395 | 422 |
170.0 | 2009-02-15 | 101,815,678 | 101,467,270,308 | 101,815,678 | 390 | 417 |
169.0 | 2008-12-15 | 98,868,465 | 99,116,431,942 | 98,868,465 | 381 | 407 |
168.0 | 2008-10-15 | 96,400,790 | 97,381,682,336 | 96,400,790 | 371 | 396 |
166.0 | 2008-06-15 | 88,554,578 | 92,008,611,867 | 88,554,578 | 343 | 366 |
164.0 | 2008-02-15 | 82,853,685 | 85,759,586,764 | 82,853,685 | 321 | 342 |
161.0 | 2007-08-15 | 76,146,236 | 79,525,559,650 | 76,146,236 | 299 | 319 |
158.0 | 2007-02-15 | 67,218,344 | 71,292,211,453 | 67,218,344 | 251 | 263 |
gbrel.txt
[3]1 reported sequences
2 Uncompressed flatfiles, sequence files only
3 Uncompressed flatfiles, including the 'short directory
', 'index
', and the *.txt
files
Note: You can find the current release number by issuing either of the following commands:
$ curl -s ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number $ lynx --dump ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number
Selected genomes
see: IDS for a list of files containing the IDs of completed genomes.
Prokayotes
see: summary.txt for a daily updated list of completed bacterial genomes. see: plasmids.ids for a list of completed plasmid genomes.
Eukaryotes
Note: The following are not part of the main NCBI GenBank database.
- Fungi
- Saccharomyces cerevisiae (Baker's Yeast)
- Schizosaccharomyces pombe (Fission Yeast)
- Plants
- Arabidopsis thaliana
- Vertebrates
- Canis familiaris (Dog)
- Gallus gallus (Chicken)
- Homo sapiens (Human)
- Mus musculus (Mouse)
- Rattus norvegicus (Rat)
- Invertebrates
- Apis mellifera (Honey bee)
- Caenorhabditis elegans (Nematode)
- Drosophila melanogaster (Fruit fly)
- Other
- Encephalitozoon cuniculi (an intracellular parasite)
GenBank entries in the eukaryotic database
For details please refer to the NCBI genome FTP site at: ftp://ftp.ncbi.nih.gov/genomes/ and the list of completed eukaryotic genomes (NCBI).
See the complete list here: contig list (73,867 entries; 4.5 MB).
Flat file features
The following documents describe in detail the features of various flat files:
- EMBL Features and Qualifiers
- User Manual — by UniProt Knowledgebase (release 10.4; 2007-05-01)
- Protein naming guidelines — by UniProt - Swiss-Prot Protein Knowledgebase (release 52.4; 2007-05-01)
Index files
The index keys (accession numbers, keywords, authors, journals, and gene symbols.) of an index are sorted alphabetically. Following each index key, the identifiers of the sequence entries containing that key are listed (LOCUS name, division abbreviation, and primary accession number). The division abbreviations are:
PRI - primate sequences ROD - rodent sequences MAM - other mammalian sequences VRT - other vertebrate sequences INV - invertebrate sequences PLN - plant, fungal, and algal sequences BCT - bacterial sequences VRL - viral sequences PHG - bacteriophage sequences SYN - synthetic sequences UNA - unannotated sequences EST - EST sequences (expressed sequence tags) PAT - patent sequences STS - STS sequences (sequence tagged sites) GSS - GSS sequences (genome survey sequences) HTG - HTGS sequences (high throughput genomic sequences) HTC - HTC sequences (high throughput cDNA sequences) ENV - Environmental sampling sequences CON - Constructed sequences WGS - Whole Genome Shotgun sequencing projects
See also
- Genome projects
- GFF file format — format for describing genes and other features associated with DNA, RNA and Protein sequences.
- TAB file format (aka "
gb2tab
") - build_gbff_cu.pl — Build a non-redundant cumulative GenBank flatfile from a set of GenBank Incremental Update (GIU) flatfiles provided by the NCBI. Documentation can be found here.
- ffidx.pl — Generate an index file containing the sequence identifier and byte-offset of each record in a flatfile which contains biological sequence data.
- EMBL Nucleotide Sequence Database: Release Notes
References
- ↑ Benton D (1990). "Recent changes in the GenBank On-line Service". Nucleic Acids Research, 18(6):1517–1520.
- ↑ Benton D et al. (2006). "GenBank". Nucleic Acids Research, 34(Database):D16-D20.
- ↑ NCBI-GenBank Flat File - Distribution Release Notes ('
gbrel.txt
').
External links
- GenBank (overview)
- NCBI Resource Guide
- FTP directory containing full GenBank flat file releases (NCBI)
- Genomes (NCBI)
- List of completed eukaryotic genomes (NCBI)
- The DDBJ/EMBL/GenBank Feature Table: Definition — version 6.6, 2006-10.
- Trace Archive