Difference between revisions of "GenBank"

From Christoph's Personal Wiki
Jump to: navigation, search
(Statistics)
(Statistics)
Line 2: Line 2:
  
 
==Statistics==
 
==Statistics==
*GenBank Flat File Release '''164.0''' (2008-02-15)
+
<div style="float:left; margin:0px 20px 20px 0px;">
**'''82,853,685''' loci, '''85,759,586,764''' bases, from '''82,853,685''' reported sequences.<ref name="gbrel">[ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt NCBI-GenBank Flat File - Distribution Release Notes] ('<code>gbrel.txt</code>') &mdash; 2007-02-15.</ref>
+
{| align="center" style="border: 1px solid #999; background-color:#FFFFFF"
**Uncompressed, the Release '''164.0''' flatfiles require roughly '''321 GB''' (sequence files only) or '''342 GB''' (including the '<code>short directory</code>', '<code>index</code>', and the <code>*.txt</code> files).
+
|-
*GenBank Flat File Release '''161.0''' (2007-08-15)
+
! colspan="7" bgcolor="#EFEFEF" | '''GenBank Flat File Release Statistics'''
**'''76,146,236''' loci, '''79,525,559,650''' bases, from '''76,146,236''' reported sequences.<ref name="gbrel"/>
+
|-align="center" bgcolor="#1188ee"
**Uncompressed, the Release 161.0 flatfiles require roughly '''299 GB''' (sequence files only) or '''319 GB''' (including the '<code>short directory</code>', '<code>index</code>', and the <code>*.txt</code> files).
+
!Release
*GenBank Flat File Release '''158.0''' (2007-02-15)
+
!Date
**'''67,218,344''' loci, '''71,292,211,453''' bases, from '''67,218,344''' reported sequences.<ref name="gbrel"/>
+
!Loci
**Uncompressed, the Release 158.0 flatfiles require roughly '''251 GB''' (sequence files only) or '''263 GB''' (including the '<code>short directory</code>', '<code>index</code>', and the <code>*.txt</code> files).
+
!Bases
 +
!Sequences<sup>1</sup>
 +
!Size (GB)<sup>2</sup>
 +
!Size (GB)<sup>3</sup>
 +
|- align="right"
 +
|158.0 || 2007-02-15 || 67,218,344 || 71,292,211,453 || 67,218,344 || 251 || 263
 +
|--bgcolor="#eeeeee" align="right"
 +
|161.0 || 2007-08-15 || 76,146,236 || 79,525,559,650 || 76,146,236 || 299 || 319
 +
|- align="right"
 +
|164.0 || 2008-02-15 || 82,853,685 || 85,759,586,764 || 82,853,685 || 321 || 342
 +
|--bgcolor="#eeeeee" align="right"
 +
|166.0 || 2008-06-15 || 88,554,578 || 92,008,611,867 || 88,554,578 || 343 || 366
 +
|}
 +
<div align="left">''Source: <ref name="gbrel"/>''<br/>
 +
<sup>1</sup> reported sequences<br/>
 +
<sup>2</sup> Uncompressed flatfiles, sequence files only<br/>
 +
<sup>2</sup> Uncompressed flatfiles, including the '<code>short directory</code>', '<code>index</code>', and the <code>*.txt</code> files
 +
</div>
 +
</div>
  
 
Note: You can find the current release number by issuing the following commmand:
 
Note: You can find the current release number by issuing the following commmand:

Revision as of 00:54, 29 June 2008

The GenBank (aka Genetic Sequence Data Bank) sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations.[1][2] This database is produced at National Center for Biotechnology Information (NCBI).

Statistics

GenBank Flat File Release Statistics
Release Date Loci Bases Sequences1 Size (GB)2 Size (GB)3
158.0 2007-02-15 67,218,344 71,292,211,453 67,218,344 251 263
161.0 2007-08-15 76,146,236 79,525,559,650 76,146,236 299 319
164.0 2008-02-15 82,853,685 85,759,586,764 82,853,685 321 342
166.0 2008-06-15 88,554,578 92,008,611,867 88,554,578 343 366
Source: [3]

1 reported sequences
2 Uncompressed flatfiles, sequence files only
2 Uncompressed flatfiles, including the 'short directory', 'index', and the *.txt files

Note: You can find the current release number by issuing the following commmand:

lynx --dump ftp://ftp.ncbi.nih.gov/genbank/GB_Release_Number

Selected Eukaryotic genomes

Note: The following are not part of the main NCBI GenBank database.

  • Fungi
    • Saccharomyces cerevisiae (Baker's Yeast)
    • Schizosaccharomyces pombe (Fission Yeast)
  • Plants
    • Arabidopsis thaliana
  • Vertebrates
    • Canis familiaris (Dog)
    • Gallus gallus (Chicken)
    • Homo sapiens (Human)
    • Mus musculus (Mouse)
    • Rattus norvegicus (Rat)
  • Invertebrates
    • Apis mellifera (Honey bee)
    • Caenorhabditis elegans (Nematode)
    • Drosophila melanogaster (Fruit fly)
  • Other
    • Encephalitozoon cuniculi (an intracellular parasite)

GenBank entries in the eukaryotic database

For details please refer to the NCBI genome FTP site at: ftp://ftp.ncbi.nih.gov/genomes/ and the list of completed eukaryotic genomes (NCBI).

See the complete list here: contig list (73,867 entries; 4.5 MB).

Flat file features

The following documents describe in detail the features of various flat files:

Index files

The index keys (accession numbers, keywords, authors, journals, and gene symbols.) of an index are sorted alphabetically. Following each index key, the identifiers of the sequence entries containing that key are listed (LOCUS name, division abbreviation, and primary accession number). The division abbreviations are:

  1. PRI - primate sequences
  2. ROD - rodent sequences
  3. MAM - other mammalian sequences
  4. VRT - other vertebrate sequences
  5. INV - invertebrate sequences
  6. PLN - plant, fungal, and algal sequences
  7. BCT - bacterial sequences
  8. VRL - viral sequences
  9. PHG - bacteriophage sequences
  10. SYN - synthetic sequences
  11. UNA - unannotated sequences
  12. EST - EST sequences (expressed sequence tags)
  13. PAT - patent sequences
  14. STS - STS sequences (sequence tagged sites)
  15. GSS - GSS sequences (genome survey sequences)
  16. HTG - HTGS sequences (high throughput genomic sequences)
  17. HTC - HTC sequences (high throughput cDNA sequences)
  18. ENV - Environmental sampling sequences
  19. CON - Constructed sequences

See also

  • Genome projects
  • TAB file format (aka "gb2tab")
  • build_gbff_cu.pl — Build a non-redundant cumulative GenBank flatfile from a set of GenBank Incremental Update (GIU) flatfiles provided by the NCBI. Documentation can be found here.
  • ffidx.pl — Generate an index file containing the sequence identifier and byte-offset of each record in a flatfile which contains biological sequence data.

References

  1. Benton D (1990). "Recent changes in the GenBank On-line Service". Nucleic Acids Research, 18(6):1517–1520.
  2. Benton D et al. (2006). "GenBank". Nucleic Acids Research, 34(Database):D16-D20.
  3. Cite error: Invalid <ref> tag; no text was provided for refs named gbrel

External links