Difference between revisions of "Formatdb"

From Christoph's Personal Wiki
Jump to: navigation, search
(Started article)
 
(External links)
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''Formatdb''' must be used in order to format protein or nucleotide source databases before these databases can be searched by blastall, blastpgp, or MegaBLAST. The source database may be in either FASTA or ASN.1 format. Although the FASTA format is most often used as input to formatdb, the use of ASN.1 is advantageous for those who are using ASN.1 as the common source for other formats such as the GenBank report. Once a source database file has been formatted by formatdb it is not needed by BLAST. Please note that formatdb does not create non-redundant blast databases.
+
'''Formatdb''' must be used in order to format protein or nucleotide source databases before these databases can be searched by [[blastall]], <tt>blastpgp</tt>, or <tt>MegaBLAST</tt>. <tt>formatdb</tt> must be used in order to format protein or nucleotide source databases before these databases can be searched by [[BLAST]].<ref name="Altschul1990">Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). "Basic local alignment search tool". ''J Mol Biol'' '''215'''(3):403-410. PMID:2231712.</ref> The source database may be in either FASTA or ASN.1 format. Although the [[FASTA format]] is most often used as input to <tt>formatdb</tt>, the use of ASN.1 is advantageous for those who are using ASN.1 as the common source for other formats such as the GenBank report. Once a source database file has been formatted by <tt>formatdb</tt> it is not needed by BLAST. Please note that <tt>formatdb</tt> does not create non-redundant blast databases.
  
 
If you are having problems formatting a BLAST databases please scroll down to the "Formatdb Notes/Troubleshooting" section below. Or contact the BLAST Desk at blast-help@ncbi.nlm.nih.gov
 
If you are having problems formatting a BLAST databases please scroll down to the "Formatdb Notes/Troubleshooting" section below. Or contact the BLAST Desk at blast-help@ncbi.nlm.nih.gov
  
== Command Line Options ==
+
==Command Line Options==
 +
A list of the command line options and the current version for <tt>formatdb</tt> may be obtained by executing <tt>formatdb</tt> without options, as in:
  
A list of the command line options and the current version for formatdb may be obtained by executing formatdb without options, as in:
+
formatdb -
  
    formatdb -
+
The <tt>formatdb</tt> options are summarized below:
  
The formatdb options are summarized below:
+
<tt>formatdb</tt> 2.2.5 arguments:
  
formatdb 2.2.5 arguments:
+
-t  Title for database file [String]   
 
+
    -t  Title for database file [String]   
+
 
         Optional
 
         Optional
    -i  Input file(s) for formatting (this parameter must be set)
+
-i  Input file(s) for formatting (this parameter must be set)
 
         [File In]
 
         [File In]
    -l  Logfile name: [File Out]   
+
-l  Logfile name: [File Out]   
 
         Optional
 
         Optional
            default = formatdb.log
+
        default = formatdb.log
    -p  Type of file
+
-p  Type of file
 
         T - protein
 
         T - protein
 
         F - nucleotide [T/F]  Optional
 
         F - nucleotide [T/F]  Optional
 
         default = T
 
         default = T
  
    -o  Parse options
+
-o  Parse options
 
         T - True: Parse SeqId and create indexes.
 
         T - True: Parse SeqId and create indexes.
 
         F - False: Do not parse SeqId. Do not create indexes.
 
         F - False: Do not parse SeqId. Do not create indexes.
 
         [T/F]  Optional default = F
 
         [T/F]  Optional default = F
  
If the "-o" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format. Please see section "F Note on creating custom databases" below.
+
If the "<code>-o</code>" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format. Please see section "F Note on creating custom databases" below.
  
    -a  Input file is database in ASN.1 format (otherwise FASTA is expected)
+
-a  Input file is database in ASN.1 format (otherwise FASTA is expected)
 
         T - True,
 
         T - True,
 
         F - False.
 
         F - False.
 
         [T/F]  Optional default = F
 
         [T/F]  Optional default = F
  
    -b  ASN.1 database in binary mode
+
-b  ASN.1 database in binary mode
 
         T - binary,
 
         T - binary,
 
         F - text mode.
 
         F - text mode.
 
         [T/F]  Optional default = F
 
         [T/F]  Optional default = F
  
A source ASN.1 database may be represented in two formats: ascii text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.
+
A source ASN.1 database may be represented in two formats: ASCII text and binary. The "<code>-b</code>" option, if <code>TRUE</code>, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.
  
    -e  Input is a Seq-entry [T/F]   
+
-e  Input is a Seq-entry [T/F]   
 
         Optional
 
         Optional
 
         default = F
 
         default = F
  
A source ASN.1 database (either text ascii or binary) may contain a Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE.
+
A source ASN.1 database (either text ASCII or binary) may contain a Bioseq-set or just one Bioseq. In the latter case the "<code>-e</code>" switch should be set to <code>TRUE</code>.
  
    -n  Base name for BLAST files [String]   
+
-n  Base name for BLAST files [String]   
 
         Optional
 
         Optional
  
This options allows one to produce BLAST databases with a different name than that of the original FASTA file. For instance, one could have a file named 'ecoli.nuc.txt' and and format it as 'ecoli':
+
This options allows one to produce BLAST databases with a different name than that of the original FASTA file. For instance, one could have a file named '<code>ecoli.nuc.txt</code>' and and format it as '<code>ecoli</code>':
  
        formatdb -i ecoli.nuc.txt -p F -o T -n ecoli
+
formatdb -i ecoli.nuc.txt -p F -o T -n ecoli
 +
uncompress -c nr.z | formatdb -i stdin -o T -n nr
  
        uncompress -c nr.z | formatdb -i stdin -o T -n nr
+
This can be used in situations where the original FASTA file is not required other than by <tt>formatdb</tt>. This can help in a situation where disk-space is tight.
  
This can be used in situations where the original FASTA file is not required other than by formatdb. This can help in a situation where disk-space is tight.
+
-v  Database volume size in millions of letters [Integer] Optional
 
+
    -v  Database volume size in millions of letters [Integer] Optional
+
 
         default = 0
 
         default = 0
 
         range from 0 to <NULL>
 
         range from 0 to <NULL>
  
This option breaks up large FASTA files into 'volumes' (each with a maximum  size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal'.
+
This option breaks up large FASTA files into 'volumes' (each with a maximum  size of 2 billion letters). As part of the creation of a volume <tt>formatdb</tt> writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal'.
  
    -s  Create indexes limited only to accessions - sparse [T/F]   
+
-s  Create indexes limited only to accessions - sparse [T/F]   
 
         Optional
 
         Optional
 
         default = F
 
         default = F
  
This option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.
+
This option limits the indices for the string identifiers (used by <tt>formatdb</tt>) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. <tt>Formatdb</tt> runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.
  
    -L  Create an alias file with this name
+
-L  Create an alias file with this name
 
         use the gifile arg (below) if set to calculate db size
 
         use the gifile arg (below) if set to calculate db size
 
         use the BLAST db specified with -i (above) [File Out]  Optional
 
         use the BLAST db specified with -i (above) [File Out]  Optional
  
This option produces a BLAST database alias file using a specified database, but limiting the sequences searched to those in the GI list given by the -F argument. See the section "Note on creating an alias file for a GI list" for more information.
+
This option produces a BLAST database alias file using a specified database, but limiting the sequences searched to those in the GI list given by the <code>-F</code> argument. See the section "Note on creating an alias file for a GI list" for more information.
 +
 
 +
-F  Gifile (file containing list of gi's) [File In]  Optional
 +
 
 +
This option can be used to specify the GI list for the alias file construction (<code>-L</code> option above) or to produce a binary GI list if the <code>-B</code> option (below) is set.
 +
 
 +
-B  Binary Gifile produced from the Gifile specified above [File Out]  Optional
 +
 
 +
This option specifies the name of a binary GI list file. This option should be used with the <code>-F</code> option. A text GI list may be specified with the <code>-F</code> option and the <code>-B</code> option will produce that GI list in binary format. The binary file is smaller and BLAST does not need to convert it, so it can be read faster.
 +
 
 +
-T  Taxid file to set the taxonomy ids in ASN.1 deflines [File In]  Optional
 +
 
 +
This file specifies a text file containing Seq-id string/numeric taxonomy id pairs, separated by a single white space character (or tab), one per line. Gi numbers can also be used in place of Seq-id strings. Examples:
 +
 
 +
% cat seqid-taxid.txt
 +
  lcl|hmm271 4                                                             
 +
  lcl|hmm273 6                                                             
 +
  lcl|hmm276 9                                                             
 +
% cat gi-taxid.txt
 +
  129295 9031                                                       
 +
  129296 9031
 +
  68738 9031
  
    -F Gifile (file containing list of gi's) [File In] Optional
+
==Examples==
 +
*Simple examples:
 +
  formatdb –i input_db –p F –o T  #for nucleotide
 +
  formatdb –i input_db –p T –o T  #for protein
  
This option can be used to specify the GI list for the alias file construction (-L option above) or to produce a binary GI list if the -B option (below) is set.
+
Type the following at the command prompt:
 +
formatdb -i databasefile -p F -o -n basename
  
    -B  Binary Gifile produced from the Gifile specified above [File Out]  Optional
+
The output files will be:
 +
*<code>basename.nhr</code>
 +
*<code>basename.nin</code>
 +
*<code>basename.nsq</code>
 +
*<code>formatdb.log</code>
  
This option specifies the name of a binary GI list file. This option should be used with the -F option. A text GI list may be specified with the -F option and the -B option will produce that GI list in binary format. The binary file is smaller and BLAST does not need to convert it, so it can be read faster.
+
==See also==
 +
*[[BLAST]]
 +
*[[FASTA format]]
 +
*[[Blastall]]
  
== External links ==
+
==References==
* [[http://www.ncbi.nlm.nih.gov/Web/Newsltr/Summer03/blast.html Using the Advanced Features of Formatdb]] &mdash; by NCBI
+
<references/>
 +
==External links==
 +
*[ftp://ftp.ncbi.nlm.nih.gov/blast/ download formatdb] (and other related software)
 +
===Documentation===
 +
*[ftp://ftp.ncbi.nlm.nih.gov/blast/documents/formatdb.html official documentation]
 +
*[http://www.ncbi.nlm.nih.gov/Web/Newsltr/Summer03/blast.html Using the Advanced Features of Formatdb] &mdash; by NCBI
 +
*[http://biowulf.nih.gov/apps/blast/doc/formatdb.html Formatdb README]
 +
*[http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html Program Parameters for formatdb and fastacmd]
 +
===Other===
 +
*[http://club.mandriva.com/xwiki/bin/view/rpms/Category/Sciences/Biology Sciences/Biology packages] &mdash; found on Mandriva Club
  
[[Category:Academic Research]]
 
 
[[Category:Bioinformatics]]
 
[[Category:Bioinformatics]]

Latest revision as of 01:29, 24 July 2008

Formatdb must be used in order to format protein or nucleotide source databases before these databases can be searched by blastall, blastpgp, or MegaBLAST. formatdb must be used in order to format protein or nucleotide source databases before these databases can be searched by BLAST.[1] The source database may be in either FASTA or ASN.1 format. Although the FASTA format is most often used as input to formatdb, the use of ASN.1 is advantageous for those who are using ASN.1 as the common source for other formats such as the GenBank report. Once a source database file has been formatted by formatdb it is not needed by BLAST. Please note that formatdb does not create non-redundant blast databases.

If you are having problems formatting a BLAST databases please scroll down to the "Formatdb Notes/Troubleshooting" section below. Or contact the BLAST Desk at blast-help@ncbi.nlm.nih.gov

Command Line Options

A list of the command line options and the current version for formatdb may be obtained by executing formatdb without options, as in:

formatdb -

The formatdb options are summarized below:

formatdb 2.2.5 arguments:

-t  Title for database file [String]  
       Optional
-i  Input file(s) for formatting (this parameter must be set)
       [File In]
-l  Logfile name: [File Out]  
       Optional
       default = formatdb.log
-p  Type of file
       T - protein
       F - nucleotide [T/F]  Optional
       default = T
-o  Parse options
       T - True: Parse SeqId and create indexes.
       F - False: Do not parse SeqId. Do not create indexes.
       [T/F]  Optional default = F

If the "-o" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format. Please see section "F Note on creating custom databases" below.

-a  Input file is database in ASN.1 format (otherwise FASTA is expected)
       T - True,
       F - False.
       [T/F]  Optional default = F
-b  ASN.1 database in binary mode
       T - binary,
       F - text mode.
       [T/F]  Optional default = F

A source ASN.1 database may be represented in two formats: ASCII text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.

-e  Input is a Seq-entry [T/F]  
       Optional
       default = F

A source ASN.1 database (either text ASCII or binary) may contain a Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE.

-n  Base name for BLAST files [String]  
       Optional

This options allows one to produce BLAST databases with a different name than that of the original FASTA file. For instance, one could have a file named 'ecoli.nuc.txt' and and format it as 'ecoli':

formatdb -i ecoli.nuc.txt -p F -o T -n ecoli
uncompress -c nr.z | formatdb -i stdin -o T -n nr

This can be used in situations where the original FASTA file is not required other than by formatdb. This can help in a situation where disk-space is tight.

-v  Database volume size in millions of letters [Integer] Optional
       default = 0
       range from 0 to <NULL>

This option breaks up large FASTA files into 'volumes' (each with a maximum size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal'.

-s  Create indexes limited only to accessions - sparse [T/F]  
       Optional
       default = F

This option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.

-L  Create an alias file with this name
       use the gifile arg (below) if set to calculate db size
       use the BLAST db specified with -i (above) [File Out]  Optional

This option produces a BLAST database alias file using a specified database, but limiting the sequences searched to those in the GI list given by the -F argument. See the section "Note on creating an alias file for a GI list" for more information.

-F  Gifile (file containing list of gi's) [File In]  Optional

This option can be used to specify the GI list for the alias file construction (-L option above) or to produce a binary GI list if the -B option (below) is set.

-B  Binary Gifile produced from the Gifile specified above [File Out]  Optional

This option specifies the name of a binary GI list file. This option should be used with the -F option. A text GI list may be specified with the -F option and the -B option will produce that GI list in binary format. The binary file is smaller and BLAST does not need to convert it, so it can be read faster.

-T  Taxid file to set the taxonomy ids in ASN.1 deflines [File In]  Optional

This file specifies a text file containing Seq-id string/numeric taxonomy id pairs, separated by a single white space character (or tab), one per line. Gi numbers can also be used in place of Seq-id strings. Examples:

% cat seqid-taxid.txt
  lcl|hmm271 4                                                               
  lcl|hmm273 6                                                               
  lcl|hmm276 9                                                               
% cat gi-taxid.txt
  129295 9031                                                         
  129296 9031
  68738 9031

Examples

  • Simple examples:
formatdb –i input_db –p F –o T   #for nucleotide	
formatdb –i input_db –p T –o T   #for protein

Type the following at the command prompt:

formatdb -i databasefile -p F -o -n basename

The output files will be:

  • basename.nhr
  • basename.nin
  • basename.nsq
  • formatdb.log

See also

References

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). "Basic local alignment search tool". J Mol Biol 215(3):403-410. PMID:2231712.

External links

Documentation

Other