Formatdb
Formatdb must be used in order to format protein or nucleotide source databases before these databases can be searched by blastall, blastpgp, or MegaBLAST. formatdb must be used in order to format protein or nucleotide source databases before these databases can be searched by BLAST.[1] The source database may be in either FASTA or ASN.1 format. Although the FASTA format is most often used as input to formatdb, the use of ASN.1 is advantageous for those who are using ASN.1 as the common source for other formats such as the GenBank report. Once a source database file has been formatted by formatdb it is not needed by BLAST. Please note that formatdb does not create non-redundant blast databases.
If you are having problems formatting a BLAST databases please scroll down to the "Formatdb Notes/Troubleshooting" section below. Or contact the BLAST Desk at blast-help@ncbi.nlm.nih.gov
Contents
Command Line Options
A list of the command line options and the current version for formatdb may be obtained by executing formatdb without options, as in:
formatdb -
The formatdb options are summarized below:
formatdb 2.2.5 arguments:
-t Title for database file [String] Optional -i Input file(s) for formatting (this parameter must be set) [File In] -l Logfile name: [File Out] Optional default = formatdb.log -p Type of file T - protein F - nucleotide [T/F] Optional default = T
-o Parse options T - True: Parse SeqId and create indexes. F - False: Do not parse SeqId. Do not create indexes. [T/F] Optional default = F
If the "-o
" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format. Please see section "F Note on creating custom databases" below.
-a Input file is database in ASN.1 format (otherwise FASTA is expected) T - True, F - False. [T/F] Optional default = F
-b ASN.1 database in binary mode T - binary, F - text mode. [T/F] Optional default = F
A source ASN.1 database may be represented in two formats: ASCII text and binary. The "-b
" option, if TRUE
, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.
-e Input is a Seq-entry [T/F] Optional default = F
A source ASN.1 database (either text ASCII or binary) may contain a Bioseq-set or just one Bioseq. In the latter case the "-e
" switch should be set to TRUE
.
-n Base name for BLAST files [String] Optional
This options allows one to produce BLAST databases with a different name than that of the original FASTA file. For instance, one could have a file named 'ecoli.nuc.txt
' and and format it as 'ecoli
':
formatdb -i ecoli.nuc.txt -p F -o T -n ecoli uncompress -c nr.z | formatdb -i stdin -o T -n nr
This can be used in situations where the original FASTA file is not required other than by formatdb. This can help in a situation where disk-space is tight.
-v Database volume size in millions of letters [Integer] Optional default = 0 range from 0 to <NULL>
This option breaks up large FASTA files into 'volumes' (each with a maximum size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal'.
-s Create indexes limited only to accessions - sparse [T/F] Optional default = F
This option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.
-L Create an alias file with this name use the gifile arg (below) if set to calculate db size use the BLAST db specified with -i (above) [File Out] Optional
This option produces a BLAST database alias file using a specified database, but limiting the sequences searched to those in the GI list given by the -F
argument. See the section "Note on creating an alias file for a GI list" for more information.
-F Gifile (file containing list of gi's) [File In] Optional
This option can be used to specify the GI list for the alias file construction (-L
option above) or to produce a binary GI list if the -B
option (below) is set.
-B Binary Gifile produced from the Gifile specified above [File Out] Optional
This option specifies the name of a binary GI list file. This option should be used with the -F
option. A text GI list may be specified with the -F
option and the -B
option will produce that GI list in binary format. The binary file is smaller and BLAST does not need to convert it, so it can be read faster.
-T Taxid file to set the taxonomy ids in ASN.1 deflines [File In] Optional
This file specifies a text file containing Seq-id string/numeric taxonomy id pairs, separated by a single white space character (or tab), one per line. Gi numbers can also be used in place of Seq-id strings. Examples:
% cat seqid-taxid.txt lcl|hmm271 4 lcl|hmm273 6 lcl|hmm276 9 % cat gi-taxid.txt 129295 9031 129296 9031 68738 9031
Examples
- Simple examples:
formatdb –i input_db –p F –o T #for nucleotide formatdb –i input_db –p T –o T #for protein
Type the following at the command prompt:
formatdb -i databasefile -p F -o -n basename
The output files will be:
basename.nhr
basename.nin
basename.nsq
formatdb.log
See also
References
- ↑ Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). "Basic local alignment search tool". J Mol Biol 215(3):403-410. PMID:2231712.
External links
- download formatdb (and other related software)
Documentation
- official documentation
- Using the Advanced Features of Formatdb — by NCBI
- Formatdb README
- Program Parameters for formatdb and fastacmd
Other
- Sciences/Biology packages — found on Mandriva Club