Formatdb

From Christoph's Personal Wiki
Jump to: navigation, search

Formatdb must be used in order to format protein or nucleotide source databases before these databases can be searched by blastall, blastpgp, or MegaBLAST. formatdb must be used in order to format protein or nucleotide source databases before these databases can be searched by BLAST.[1] The source database may be in either FASTA or ASN.1 format. Although the FASTA format is most often used as input to formatdb, the use of ASN.1 is advantageous for those who are using ASN.1 as the common source for other formats such as the GenBank report. Once a source database file has been formatted by formatdb it is not needed by BLAST. Please note that formatdb does not create non-redundant blast databases.

If you are having problems formatting a BLAST databases please scroll down to the "Formatdb Notes/Troubleshooting" section below. Or contact the BLAST Desk at blast-help@ncbi.nlm.nih.gov

Command Line Options

A list of the command line options and the current version for formatdb may be obtained by executing formatdb without options, as in:

formatdb -

The formatdb options are summarized below:

formatdb 2.2.5 arguments:

-t  Title for database file [String]  
       Optional
-i  Input file(s) for formatting (this parameter must be set)
       [File In]
-l  Logfile name: [File Out]  
       Optional
       default = formatdb.log
-p  Type of file
       T - protein
       F - nucleotide [T/F]  Optional
       default = T
-o  Parse options
       T - True: Parse SeqId and create indexes.
       F - False: Do not parse SeqId. Do not create indexes.
       [T/F]  Optional default = F

If the "-o" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format. Please see section "F Note on creating custom databases" below.

-a  Input file is database in ASN.1 format (otherwise FASTA is expected)
       T - True,
       F - False.
       [T/F]  Optional default = F
-b  ASN.1 database in binary mode
       T - binary,
       F - text mode.
       [T/F]  Optional default = F

A source ASN.1 database may be represented in two formats: ASCII text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.

-e  Input is a Seq-entry [T/F]  
       Optional
       default = F

A source ASN.1 database (either text ASCII or binary) may contain a Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE.

-n  Base name for BLAST files [String]  
       Optional

This options allows one to produce BLAST databases with a different name than that of the original FASTA file. For instance, one could have a file named 'ecoli.nuc.txt' and and format it as 'ecoli':

formatdb -i ecoli.nuc.txt -p F -o T -n ecoli
uncompress -c nr.z | formatdb -i stdin -o T -n nr

This can be used in situations where the original FASTA file is not required other than by formatdb. This can help in a situation where disk-space is tight.

-v  Database volume size in millions of letters [Integer] Optional
       default = 0
       range from 0 to <NULL>

This option breaks up large FASTA files into 'volumes' (each with a maximum size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal'.

-s  Create indexes limited only to accessions - sparse [T/F]  
       Optional
       default = F

This option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.

-L  Create an alias file with this name
       use the gifile arg (below) if set to calculate db size
       use the BLAST db specified with -i (above) [File Out]  Optional

This option produces a BLAST database alias file using a specified database, but limiting the sequences searched to those in the GI list given by the -F argument. See the section "Note on creating an alias file for a GI list" for more information.

-F  Gifile (file containing list of gi's) [File In]  Optional

This option can be used to specify the GI list for the alias file construction (-L option above) or to produce a binary GI list if the -B option (below) is set.

-B  Binary Gifile produced from the Gifile specified above [File Out]  Optional

This option specifies the name of a binary GI list file. This option should be used with the -F option. A text GI list may be specified with the -F option and the -B option will produce that GI list in binary format. The binary file is smaller and BLAST does not need to convert it, so it can be read faster.

-T  Taxid file to set the taxonomy ids in ASN.1 deflines [File In]  Optional

This file specifies a text file containing Seq-id string/numeric taxonomy id pairs, separated by a single white space character (or tab), one per line. Gi numbers can also be used in place of Seq-id strings. Examples:

% cat seqid-taxid.txt
  lcl|hmm271 4                                                               
  lcl|hmm273 6                                                               
  lcl|hmm276 9                                                               
% cat gi-taxid.txt
  129295 9031                                                         
  129296 9031
  68738 9031

Examples

  • Simple examples:
formatdb –i input_db –p F –o T   #for nucleotide	
formatdb –i input_db –p T –o T   #for protein

Type the following at the command prompt:

formatdb -i databasefile -p F -o -n basename

The output files will be:

  • basename.nhr
  • basename.nin
  • basename.nsq
  • formatdb.log

See also

References

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). "Basic local alignment search tool". J Mol Biol 215(3):403-410. PMID:2231712.

External links

Documentation

Other