Difference between revisions of "Formatdb"

From Christoph's Personal Wiki
Jump to: navigation, search
(Added "Example")
Line 1: Line 1:
'''Formatdb''' must be used in order to format protein or nucleotide source databases before these databases can be searched by blastall, blastpgp, or MegaBLAST. The source database may be in either FASTA or ASN.1 format. Although the FASTA format is most often used as input to formatdb, the use of ASN.1 is advantageous for those who are using ASN.1 as the common source for other formats such as the GenBank report. Once a source database file has been formatted by formatdb it is not needed by BLAST. Please note that formatdb does not create non-redundant blast databases.
+
'''Formatdb''' must be used in order to format protein or nucleotide source databases before these databases can be searched by [[blastall]], <tt>blastpgp</tt>, or <tt>MegaBLAST</tt>. The source database may be in either FASTA or ASN.1 format. Although the [[FASTA format]] is most often used as input to <tt>formatdb</tt>, the use of ASN.1 is advantageous for those who are using ASN.1 as the common source for other formats such as the GenBank report. Once a source database file has been formatted by <tt>formatdb</tt> it is not needed by BLAST. Please note that <tt>formatdb</tt> does not create non-redundant blast databases.
  
 
If you are having problems formatting a BLAST databases please scroll down to the "Formatdb Notes/Troubleshooting" section below. Or contact the BLAST Desk at blast-help@ncbi.nlm.nih.gov
 
If you are having problems formatting a BLAST databases please scroll down to the "Formatdb Notes/Troubleshooting" section below. Or contact the BLAST Desk at blast-help@ncbi.nlm.nih.gov
Line 5: Line 5:
 
== Command Line Options ==
 
== Command Line Options ==
  
A list of the command line options and the current version for formatdb may be obtained by executing formatdb without options, as in:
+
A list of the command line options and the current version for <tt>formatdb</tt> may be obtained by executing <tt>formatdb</tt> without options, as in:
  
    formatdb -
+
formatdb -
  
The formatdb options are summarized below:
+
The <tt>formatdb</tt> options are summarized below:
  
formatdb 2.2.5 arguments:
+
<tt>formatdb</tt> 2.2.5 arguments:
  
    -t  Title for database file [String]   
+
-t  Title for database file [String]   
 
         Optional
 
         Optional
    -i  Input file(s) for formatting (this parameter must be set)
+
-i  Input file(s) for formatting (this parameter must be set)
 
         [File In]
 
         [File In]
    -l  Logfile name: [File Out]   
+
-l  Logfile name: [File Out]   
 
         Optional
 
         Optional
            default = formatdb.log
+
        default = formatdb.log
    -p  Type of file
+
-p  Type of file
 
         T - protein
 
         T - protein
 
         F - nucleotide [T/F]  Optional
 
         F - nucleotide [T/F]  Optional
 
         default = T
 
         default = T
  
    -o  Parse options
+
-o  Parse options
 
         T - True: Parse SeqId and create indexes.
 
         T - True: Parse SeqId and create indexes.
 
         F - False: Do not parse SeqId. Do not create indexes.
 
         F - False: Do not parse SeqId. Do not create indexes.
 
         [T/F]  Optional default = F
 
         [T/F]  Optional default = F
  
If the "-o" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format. Please see section "F Note on creating custom databases" below.
+
If the "<code>-o</code>" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format. Please see section "F Note on creating custom databases" below.
  
    -a  Input file is database in ASN.1 format (otherwise FASTA is expected)
+
-a  Input file is database in ASN.1 format (otherwise FASTA is expected)
 
         T - True,
 
         T - True,
 
         F - False.
 
         F - False.
 
         [T/F]  Optional default = F
 
         [T/F]  Optional default = F
  
    -b  ASN.1 database in binary mode
+
-b  ASN.1 database in binary mode
 
         T - binary,
 
         T - binary,
 
         F - text mode.
 
         F - text mode.
 
         [T/F]  Optional default = F
 
         [T/F]  Optional default = F
  
A source ASN.1 database may be represented in two formats: ascii text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.
+
A source ASN.1 database may be represented in two formats: ASCII text and binary. The "<code>-b</code>" option, if <code>TRUE</code>, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.
  
    -e  Input is a Seq-entry [T/F]   
+
-e  Input is a Seq-entry [T/F]   
 
         Optional
 
         Optional
 
         default = F
 
         default = F
  
A source ASN.1 database (either text ascii or binary) may contain a Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE.
+
A source ASN.1 database (either text ASCII or binary) may contain a Bioseq-set or just one Bioseq. In the latter case the "<code>-e</code>" switch should be set to <code>TRUE</code>.
  
    -n  Base name for BLAST files [String]   
+
-n  Base name for BLAST files [String]   
 
         Optional
 
         Optional
  
This options allows one to produce BLAST databases with a different name than that of the original FASTA file. For instance, one could have a file named 'ecoli.nuc.txt' and and format it as 'ecoli':
+
This options allows one to produce BLAST databases with a different name than that of the original FASTA file. For instance, one could have a file named '<code>ecoli.nuc.txt</code>' and and format it as '<code>ecoli</code>':
  
        formatdb -i ecoli.nuc.txt -p F -o T -n ecoli
+
formatdb -i ecoli.nuc.txt -p F -o T -n ecoli
 +
uncompress -c nr.z | formatdb -i stdin -o T -n nr
  
        uncompress -c nr.z | formatdb -i stdin -o T -n nr
+
This can be used in situations where the original FASTA file is not required other than by <tt>formatdb</tt>. This can help in a situation where disk-space is tight.
  
This can be used in situations where the original FASTA file is not required other than by formatdb. This can help in a situation where disk-space is tight.
+
-v  Database volume size in millions of letters [Integer] Optional
 
+
    -v  Database volume size in millions of letters [Integer] Optional
+
 
         default = 0
 
         default = 0
 
         range from 0 to <NULL>
 
         range from 0 to <NULL>
  
This option breaks up large FASTA files into 'volumes' (each with a maximum  size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal'.
+
This option breaks up large FASTA files into 'volumes' (each with a maximum  size of 2 billion letters). As part of the creation of a volume <tt>formatdb</tt> writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal'.
  
    -s  Create indexes limited only to accessions - sparse [T/F]   
+
-s  Create indexes limited only to accessions - sparse [T/F]   
 
         Optional
 
         Optional
 
         default = F
 
         default = F
  
This option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.
+
This option limits the indices for the string identifiers (used by <tt>formatdb</tt>) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. <tt>Formatdb</tt> runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.
  
    -L  Create an alias file with this name
+
-L  Create an alias file with this name
 
         use the gifile arg (below) if set to calculate db size
 
         use the gifile arg (below) if set to calculate db size
 
         use the BLAST db specified with -i (above) [File Out]  Optional
 
         use the BLAST db specified with -i (above) [File Out]  Optional
  
This option produces a BLAST database alias file using a specified database, but limiting the sequences searched to those in the GI list given by the -F argument. See the section "Note on creating an alias file for a GI list" for more information.
+
This option produces a BLAST database alias file using a specified database, but limiting the sequences searched to those in the GI list given by the <code>-F</code> argument. See the section "Note on creating an alias file for a GI list" for more information.
  
    -F  Gifile (file containing list of gi's) [File In]  Optional
+
-F  Gifile (file containing list of gi's) [File In]  Optional
  
This option can be used to specify the GI list for the alias file construction (-L option above) or to produce a binary GI list if the -B option (below) is set.
+
This option can be used to specify the GI list for the alias file construction (<code>-L</code> option above) or to produce a binary GI list if the <code>-B</code> option (below) is set.
  
    -B  Binary Gifile produced from the Gifile specified above [File Out]  Optional
+
-B  Binary Gifile produced from the Gifile specified above [File Out]  Optional
  
This option specifies the name of a binary GI list file. This option should be used with the -F option. A text GI list may be specified with the -F option and the -B option will produce that GI list in binary format. The binary file is smaller and BLAST does not need to convert it, so it can be read faster.
+
This option specifies the name of a binary GI list file. This option should be used with the <code>-F</code> option. A text GI list may be specified with the <code>-F</code> option and the <code>-B</code> option will produce that GI list in binary format. The binary file is smaller and BLAST does not need to convert it, so it can be read faster.
  
 
== Example ==
 
== Example ==
Line 91: Line 90:
 
Type the following at the command prompt:
 
Type the following at the command prompt:
  
<pre>formatdb -i databasefile -p F -o -n basename</pre>
+
formatdb -i databasefile -p F -o -n basename
  
 
The output files will be:
 
The output files will be:
  
* <tt>basename.nhr</tt>
+
* <code>basename.nhr</code>
* <tt>basename.nin</tt>
+
* <code>basename.nin</code>
* <tt>basename.nsq</tt>
+
* <code>basename.nsq</code>
* <tt>formatdb.log</tt>
+
* <code>formatdb.log</code>
 +
 
 +
==See also==
 +
* [[BLAST]]
 +
* [[FASTA format]]
 +
* [[Blastall]]
  
 
== External links ==
 
== External links ==
 
* [http://www.ncbi.nlm.nih.gov/Web/Newsltr/Summer03/blast.html Using the Advanced Features of Formatdb] &mdash; by NCBI
 
* [http://www.ncbi.nlm.nih.gov/Web/Newsltr/Summer03/blast.html Using the Advanced Features of Formatdb] &mdash; by NCBI
  
[[Category:Academic Research]]
 
 
[[Category:Bioinformatics]]
 
[[Category:Bioinformatics]]

Revision as of 08:35, 29 December 2006

Formatdb must be used in order to format protein or nucleotide source databases before these databases can be searched by blastall, blastpgp, or MegaBLAST. The source database may be in either FASTA or ASN.1 format. Although the FASTA format is most often used as input to formatdb, the use of ASN.1 is advantageous for those who are using ASN.1 as the common source for other formats such as the GenBank report. Once a source database file has been formatted by formatdb it is not needed by BLAST. Please note that formatdb does not create non-redundant blast databases.

If you are having problems formatting a BLAST databases please scroll down to the "Formatdb Notes/Troubleshooting" section below. Or contact the BLAST Desk at blast-help@ncbi.nlm.nih.gov

Command Line Options

A list of the command line options and the current version for formatdb may be obtained by executing formatdb without options, as in:

formatdb -

The formatdb options are summarized below:

formatdb 2.2.5 arguments:

-t  Title for database file [String]  
       Optional
-i  Input file(s) for formatting (this parameter must be set)
       [File In]
-l  Logfile name: [File Out]  
       Optional
       default = formatdb.log
-p  Type of file
       T - protein
       F - nucleotide [T/F]  Optional
       default = T
-o  Parse options
       T - True: Parse SeqId and create indexes.
       F - False: Do not parse SeqId. Do not create indexes.
       [T/F]  Optional default = F

If the "-o" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format. Please see section "F Note on creating custom databases" below.

-a  Input file is database in ASN.1 format (otherwise FASTA is expected)
       T - True,
       F - False.
       [T/F]  Optional default = F
-b  ASN.1 database in binary mode
       T - binary,
       F - text mode.
       [T/F]  Optional default = F

A source ASN.1 database may be represented in two formats: ASCII text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database.

-e  Input is a Seq-entry [T/F]  
       Optional
       default = F

A source ASN.1 database (either text ASCII or binary) may contain a Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE.

-n  Base name for BLAST files [String]  
       Optional

This options allows one to produce BLAST databases with a different name than that of the original FASTA file. For instance, one could have a file named 'ecoli.nuc.txt' and and format it as 'ecoli':

formatdb -i ecoli.nuc.txt -p F -o T -n ecoli
uncompress -c nr.z | formatdb -i stdin -o T -n nr

This can be used in situations where the original FASTA file is not required other than by formatdb. This can help in a situation where disk-space is tight.

-v  Database volume size in millions of letters [Integer] Optional
       default = 0
       range from 0 to <NULL>

This option breaks up large FASTA files into 'volumes' (each with a maximum size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal'.

-s  Create indexes limited only to accessions - sparse [T/F]  
       Optional
       default = F

This option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.

-L  Create an alias file with this name
       use the gifile arg (below) if set to calculate db size
       use the BLAST db specified with -i (above) [File Out]  Optional

This option produces a BLAST database alias file using a specified database, but limiting the sequences searched to those in the GI list given by the -F argument. See the section "Note on creating an alias file for a GI list" for more information.

-F  Gifile (file containing list of gi's) [File In]  Optional

This option can be used to specify the GI list for the alias file construction (-L option above) or to produce a binary GI list if the -B option (below) is set.

-B  Binary Gifile produced from the Gifile specified above [File Out]  Optional

This option specifies the name of a binary GI list file. This option should be used with the -F option. A text GI list may be specified with the -F option and the -B option will produce that GI list in binary format. The binary file is smaller and BLAST does not need to convert it, so it can be read faster.

Example

Type the following at the command prompt:

formatdb -i databasefile -p F -o -n basename

The output files will be:

  • basename.nhr
  • basename.nin
  • basename.nsq
  • formatdb.log

See also

External links