Difference between revisions of "Blastall"

From Christoph's Personal Wiki
Jump to: navigation, search
(External links)
Line 3: Line 3:
 
== Introduction ==
 
== Introduction ==
  
Blastall may be used to perform all five flavuors of blast comparison. One may obtain the <tt>blastall</tt> options by executing "<tt>blastall -</tt>" (note the dash). A typical use of <tt>blastall</tt> would be to perform a <tt>blastn</tt> search (nucl. vs. nucl.) of a file called <tt>QUERY</tt> would be:
+
Blastall may be used to perform all five flavours of blast comparison. One may obtain the <tt>blastall</tt> options by executing "<tt>blastall -</tt>" (note the dash). A typical use of <tt>blastall</tt> would be to perform a <tt>blastn</tt> search (nucleotide ''vs.'' nucleotide) of a file called <code>QUERY</code> would be:
  
 
<pre>blastall -p blastn -d nr -i QUERY -o out.QUERY</pre>
 
<pre>blastall -p blastn -d nr -i QUERY -o out.QUERY</pre>
  
The output is placed into the output file <tt>out.QUERY</tt> and the search is performed against the "<tt>nr</tt>" database. If a protein vs. protein search is desired, then "<tt>blastn</tt>" should be replaced with "<tt>blastp</tt>", etc.
+
The output is placed into the output file <code>out.QUERY</code> and the search is performed against the "<code>nr</code>" database. If a protein ''vs.'' protein search is desired, then "<tt>blastn</tt>" should be replaced with "<tt>blastp</tt>", etc.
  
 
== Blastall arguments / options ==
 
== Blastall arguments / options ==
Line 14: Line 14:
  
 
<pre>
 
<pre>
  -p  Program Name [String]
+
-p  Program Name [String]
 
         Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx".
 
         Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx".
 
</pre>
 
</pre>
  
 
<pre>
 
<pre>
  -d  Database [String]
+
-d  Database [String]
 
         default = nr
 
         default = nr
 
</pre>
 
</pre>
  
The database specified must first be formatted with formatdb. Multiple database names (bracketed by quotations) will be accepted.
+
The database specified must first be formatted with [[formatdb]]. Multiple database names (bracketed by quotations) will be accepted.
  
 
An example would be
 
An example would be
  
  -d "nr est"
+
-d "nr est"
  
which will search both the nr and est databases, presenting the results as if one 'virtual' database consisting of all the entries from both were searched. The statistics are based on the 'virtual' database of nr and est.
+
which will search both the <code>nr</code> and <code>est</code> databases, presenting the results as if one 'virtual' database consisting of all the entries from both were searched. The statistics are based on the 'virtual' database of <code>nr</code> and <code>est</code>.
  
 
<pre>
 
<pre>
  -i  Query File [File In]
+
-i  Query File [File In]
 
         default = stdin
 
         default = stdin
 
</pre>
 
</pre>
  
The query should be in FASTA format. If multiple FASTA entries are in the input file, all queries will be searched.
+
The query should be in [[FASTA format]]. If multiple FASTA entries are in the input file, all queries will be searched.
  
 
<pre>
 
<pre>
  -e  Expectation value (E) [Real]
+
-e  Expectation value (E) [Real]
 
         default = 10.0
 
         default = 10.0
 
</pre>
 
</pre>
  
 
<pre>
 
<pre>
  -o  BLAST report Output File [File Out]  Optional
+
-o  BLAST report Output File [File Out]  Optional
 
         default = stdout
 
         default = stdout
 
</pre>
 
</pre>
  
 
<pre>
 
<pre>
  -F  Filter query sequence (DUST with blastn, SEG with others) [String]
+
-F  Filter query sequence (DUST with blastn, SEG with others) [String]
 
         default = T
 
         default = T
 
</pre>
 
</pre>
  
BLAST 2.0 and 2.1 uses the dust low-complexity filter for blastn and seg for the other programs. Both 'dust' and 'seg' are integral parts of the NCBI toolkit and are accessed automatically.
+
BLAST 2.0 and 2.1 uses the dust low-complexity filter for <tt>blastn</tt> and <code>>seg</code> for the other programs. Both '<code>dust</code>' and '<code>seg</code>' are integral parts of the NCBI toolkit and are accessed automatically.
  
If one uses "-F T" then normal filtering by seg or dust (for blastn) occurs (likewise "-F F" means no filtering whatsoever).   
+
If one uses "<code>-F T</code>" then normal filtering by <code>seg</code> or <code>dust</code> (for <tt>blastn</tt>) occurs (likewise "<code>-F F</code>" means no filtering whatsoever).   
  
This options also takes a string as an argument. One may use such a string to change the specific parameters of seg or invoke other filters. Please see the "Filtering Strings" section (below) for details.
+
This options also takes a string as an argument. One may use such a string to change the specific parameters of /code>seg</code> or invoke other filters. Please see the "Filtering Strings" section (below) for details.
  
 
<pre>
 
<pre>
  -S  Query strands to search against database (for blast[nx], and tblastx). 3 is both, 1 is top, 2 is bottom [Integer]
+
-S  Query strands to search against database (for blast[nx], and tblastx).
 +
        3 is both, 1 is top, 2 is bottom [Integer]
 
         default = 3
 
         default = 3
 
</pre>
 
</pre>
  
 
<pre>
 
<pre>
  -T  Produce HTML output [T/F]
+
-T  Produce HTML output [T/F]
 
         default = F
 
         default = F
 
</pre>
 
</pre>
  
<pre>
+
-l  Restrict search of database to list of GI's [String]  Optional
  -l  Restrict search of database to list of GI's [String]  Optional
+
</pre>
+
  
 
This option specifies that only a subset of the database should be searched, determined by the list of gi's (i.e., NCBI identifiers) in a file. One can obtain a list of gi's for a given Entrez query from http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should be in the same directory as the database, or in the directory that BLAST is called from.
 
This option specifies that only a subset of the database should be searched, determined by the list of gi's (i.e., NCBI identifiers) in a file. One can obtain a list of gi's for a given Entrez query from http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should be in the same directory as the database, or in the directory that BLAST is called from.
  
 
<pre>
 
<pre>
  -U  Use lower case filtering of FASTA sequence [T/F]  Optional
+
-U  Use lower case filtering of FASTA sequence [T/F]  Optional
 
         default = F
 
         default = F
 
</pre>
 
</pre>
Line 84: Line 83:
 
=== Enhancements ===
 
=== Enhancements ===
  
A new option has been added to search multiple queries at once for the '''blastn''' and '''tblastn''' program options of '''blastall'''.
+
A new option has been added to search multiple queries at once for the <tt>blastn</tt> and <tt>tblastn</tt> program options of <tt>blastall</tt>.
  
 
<pre>
 
<pre>
  -B  Number of concatenated queries, for blastn and tblastn [Integer]
+
-B  Number of concatenated queries, for blastn and tblastn [Integer]
 
         Optional
 
         Optional
 
         default = 0
 
         default = 0
 
</pre>
 
</pre>
  
This new feature similar in principle, but different in implementation from the support for multiple queries already existing in megablast. The combination of ungapped search (-g F) and multiple queries (-B N) is not supported. The argument to -B option must be equal to the number of sequences in the FASTA input file.
+
This new feature similar in principle, but different in implementation from the support for multiple queries already existing in megablast. The combination of ungapped search (<code>-g F</code>) and multiple queries (<code>-B N</code>) is not supported. The argument to <code>-B</code> option must be equal to the number of sequences in the FASTA input file.
  
Processing multiple query sequences in one run can be much faster than processing them with separate runs because the database is scanned only 1 time for the entire set of queries. When the -B option is used, the results may differ from the ones produced with individual queries. Usually results will be at least as good or better (in terms of score/evalue) than the results of corresponding individual queries; exceptions occur due to the heuristic nature of BLAST. Additional alignments may appear. It is guaranteed that matching sequences will appear in the same order when they are tied in evalue and are part of the output both with and without -B.
+
Processing multiple query sequences in one run can be much faster than processing them with separate runs because the database is scanned only one time for the entire set of queries. When the <code>-B</code> option is used, the results may differ from the ones produced with individual queries. Usually results will be at least as good or better (in terms of score/evalue) than the results of corresponding individual queries; exceptions occur due to the heuristic nature of BLAST. Additional alignments may appear. It is guaranteed that matching sequences will appear in the same order when they are tied in evalue and are part of the output both with and without <code>-B</code>. When the <code>-B</code> option is used, the summary statistics at the bottom of the output are for the combined set of queries; at present, the summary statistics are not tabulated for the individual queries in a multiple-query input.
When the -B option is used, the summary statistics at the bottom of the output are for the combined set of queries; at present,
+
the summary statistics are not tabulated for the individual queries in a multiple-query input.
+
  
 
== Blastall Tutorial ==
 
== Blastall Tutorial ==
  
In the [ftp://ftp.ncbi.nih.gov/blast/db/ BLAST database FTP directory] you will find the downloadable BLAST database files.  For your first search we recommend downloading something relatively small like <tt>ecoli.nt.gz</tt> (1349 Kb). This is a FASTA formatted file of nucleotide sequences which is also compressed. Once uncompressed, you will need to format the database using the '''[[formatdb]]''' program which comes with your Standalone BLAST executable. The list of arguments for this program and all other BLAST programs are located at the end of the README in the [ftp://ftp.ncbi.nih.gov/blast/executable/ Standalone BLAST FTP directory]. Or you can get these arguments by running each of the BLAST programs (<tt>formatdb</tt>, <tt>blastall</tt>, etc.) with a single hyphen as the argument (Example: <tt>formatdb -</tt>). For this document we are just going to show you the basic commands for formatting the database and running your first search.
+
In the [ftp://ftp.ncbi.nih.gov/blast/db/ BLAST database FTP directory] you will find the downloadable BLAST database files.  For your first search, it is recommended to download something relatively small like <code>ecoli.nt.gz</code> (1349 Kb). This is a FASTA formatted file of nucleotide sequences which is also compressed. Once uncompressed, you will need to format the database using the '''[[formatdb]]''' program which comes with your Standalone BLAST executable. The list of arguments for this program and all other BLAST programs are located at the end of the README in the [ftp://ftp.ncbi.nih.gov/blast/executable/ Standalone BLAST FTP directory]. Or you can get these arguments by running each of the BLAST programs (<tt>formatdb</tt>, <tt>blastall</tt>, etc.) with a single hyphen as the argument (Example: <tt>formatdb -</tt>). For this document we are just going to show you the basic commands for formatting the database and running your first search.
  
First, download <tt>ecoli.nt.gz</tt>. For an example, issue the following commands:
+
First, download <code>ecoli.nt.gz</code>. For an example, issue the following commands:
  
<pre>
+
wget <nowiki>ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ecoli.nt.gz</nowiki> .
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ecoli.nt.gz .
+
gunzip ecoli.nt.gz
gunzip ecoli.nt.gz
+
</pre>
+
  
To format the <tt>ecoli.nt</tt> database run the following from the command line:
+
To format the <code>ecoli.nt</code> database run the following from the command line:
  
<pre>formatdb -i ecoli.nt -p F -o T</pre>
+
formatdb -i ecoli.nt -p F -o T
  
 
This will create the following seven index files that Standalone BLAST needs to perform the searches and produce results (as well as a log file):
 
This will create the following seven index files that Standalone BLAST needs to perform the searches and produce results (as well as a log file):
* <tt>ecoli.nt.nhr</tt>
+
* <code>ecoli.nt.nhr</code>
* <tt>ecoli.nt.nin</tt>
+
* <code>ecoli.nt.nin</code>
* <tt>ecoli.nt.nnd</tt>
+
* <code>ecoli.nt.nnd</code>
* <tt>ecoli.nt.nni</tt>
+
* <code>ecoli.nt.nni</code>
* <tt>ecoli.nt.nsd</tt>
+
* <code>ecoli.nt.nsd</code>
* <tt>ecoli.nt.nsi</tt>
+
* <code>ecoli.nt.nsi</code>
* <tt>ecoli.nt.nsq</tt>
+
* <code>ecoli.nt.nsq</code>
* <tt>formatdb.log</tt>
+
* <code>formatdb.log</code>
  
The <tt>ecoli.nt</tt> file is not needed after <tt>formatdb</tt> has been done and you can delete this.
+
The <code>ecoli.nt</code> file is not needed after <tt>formatdb</tt> has been done and you can delete this.
  
Next create a test nucleotide file to run against the new database. It may be easier to 'cheat' here and just extract a portion of a
+
Next create a test nucleotide file to run against the new database. It may be easier to 'cheat' here and just extract a portion of a nucleotide sequence you know is in the downloaded <code>ecoli.nt</code> database.
nucleotide sequence you know is in the downloaded <tt>ecoli.nt</tt> database.
+
  
Make a text file called <tt>test.txt</tt> with the following sequence:
+
Make a text file called <code>test.txt</code> with the following sequence:
  
 
<pre>
 
<pre>
Line 144: Line 138:
 
To run the first search enter the following command from the UNIX command line in your BLAST directory:
 
To run the first search enter the following command from the UNIX command line in your BLAST directory:
  
<pre>blastall -p blastn -d ecoli.nt -i test.txt -o test.out</pre>
+
blastall -p blastn -d ecoli.nt -i test.txt -o test.out
  
This should generate a results file called <tt>test.out</tt> in the Standalone BLAST directory.
+
This should generate a results file called <code>test.out</code> in the Standalone BLAST directory.
  
 
Now you are ready to create your own databases and run BLAST searches.
 
Now you are ready to create your own databases and run BLAST searches.

Revision as of 08:27, 29 December 2006

Blastall allows the use of all BLAST programs (blastn, blastp, blastx, tblastx, and tblastn).

Introduction

Blastall may be used to perform all five flavours of blast comparison. One may obtain the blastall options by executing "blastall -" (note the dash). A typical use of blastall would be to perform a blastn search (nucleotide vs. nucleotide) of a file called QUERY would be:

blastall -p blastn -d nr -i QUERY -o out.QUERY

The output is placed into the output file out.QUERY and the search is performed against the "nr" database. If a protein vs. protein search is desired, then "blastn" should be replaced with "blastp", etc.

Blastall arguments / options

Some of the most commonly used blastall options are:

-p  Program Name [String]
        Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx".
-d  Database [String]
        default = nr

The database specified must first be formatted with formatdb. Multiple database names (bracketed by quotations) will be accepted.

An example would be

-d "nr est"

which will search both the nr and est databases, presenting the results as if one 'virtual' database consisting of all the entries from both were searched. The statistics are based on the 'virtual' database of nr and est.

-i  Query File [File In]
        default = stdin

The query should be in FASTA format. If multiple FASTA entries are in the input file, all queries will be searched.

-e  Expectation value (E) [Real]
        default = 10.0
-o  BLAST report Output File [File Out]  Optional
        default = stdout
-F  Filter query sequence (DUST with blastn, SEG with others) [String]
        default = T

BLAST 2.0 and 2.1 uses the dust low-complexity filter for blastn and >seg for the other programs. Both 'dust' and 'seg' are integral parts of the NCBI toolkit and are accessed automatically.

If one uses "-F T" then normal filtering by seg or dust (for blastn) occurs (likewise "-F F" means no filtering whatsoever).

This options also takes a string as an argument. One may use such a string to change the specific parameters of /code>seg</code> or invoke other filters. Please see the "Filtering Strings" section (below) for details.

-S  Query strands to search against database (for blast[nx], and tblastx).
        3 is both, 1 is top, 2 is bottom [Integer]
        default = 3
-T  Produce HTML output [T/F]
        default = F
-l  Restrict search of database to list of GI's [String]  Optional

This option specifies that only a subset of the database should be searched, determined by the list of gi's (i.e., NCBI identifiers) in a file. One can obtain a list of gi's for a given Entrez query from http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should be in the same directory as the database, or in the directory that BLAST is called from.

-U  Use lower case filtering of FASTA sequence [T/F]  Optional
        default = F

This option specifies that any lower-case letters in the input FASTA file should be masked.

Enhancements

A new option has been added to search multiple queries at once for the blastn and tblastn program options of blastall.

-B  Number of concatenated queries, for blastn and tblastn [Integer]
        Optional
        default = 0

This new feature similar in principle, but different in implementation from the support for multiple queries already existing in megablast. The combination of ungapped search (-g F) and multiple queries (-B N) is not supported. The argument to -B option must be equal to the number of sequences in the FASTA input file.

Processing multiple query sequences in one run can be much faster than processing them with separate runs because the database is scanned only one time for the entire set of queries. When the -B option is used, the results may differ from the ones produced with individual queries. Usually results will be at least as good or better (in terms of score/evalue) than the results of corresponding individual queries; exceptions occur due to the heuristic nature of BLAST. Additional alignments may appear. It is guaranteed that matching sequences will appear in the same order when they are tied in evalue and are part of the output both with and without -B. When the -B option is used, the summary statistics at the bottom of the output are for the combined set of queries; at present, the summary statistics are not tabulated for the individual queries in a multiple-query input.

Blastall Tutorial

In the BLAST database FTP directory you will find the downloadable BLAST database files. For your first search, it is recommended to download something relatively small like ecoli.nt.gz (1349 Kb). This is a FASTA formatted file of nucleotide sequences which is also compressed. Once uncompressed, you will need to format the database using the formatdb program which comes with your Standalone BLAST executable. The list of arguments for this program and all other BLAST programs are located at the end of the README in the Standalone BLAST FTP directory. Or you can get these arguments by running each of the BLAST programs (formatdb, blastall, etc.) with a single hyphen as the argument (Example: formatdb -). For this document we are just going to show you the basic commands for formatting the database and running your first search.

First, download ecoli.nt.gz. For an example, issue the following commands:

wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/ecoli.nt.gz .
gunzip ecoli.nt.gz

To format the ecoli.nt database run the following from the command line:

formatdb -i ecoli.nt -p F -o T

This will create the following seven index files that Standalone BLAST needs to perform the searches and produce results (as well as a log file):

  • ecoli.nt.nhr
  • ecoli.nt.nin
  • ecoli.nt.nnd
  • ecoli.nt.nni
  • ecoli.nt.nsd
  • ecoli.nt.nsi
  • ecoli.nt.nsq
  • formatdb.log

The ecoli.nt file is not needed after formatdb has been done and you can delete this.

Next create a test nucleotide file to run against the new database. It may be easier to 'cheat' here and just extract a portion of a nucleotide sequence you know is in the downloaded ecoli.nt database.

Make a text file called test.txt with the following sequence:

>Test
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

To run the first search enter the following command from the UNIX command line in your BLAST directory:

blastall -p blastn -d ecoli.nt -i test.txt -o test.out

This should generate a results file called test.out in the Standalone BLAST directory.

Now you are ready to create your own databases and run BLAST searches.

See also

External links