Difference between revisions of "BLAST+"
(→Example usage) |
|||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
In bioinformatics, '''''B'''asic '''L'''ocal '''A'''lignment '''S'''earch '''T'''ool'' (or '''BLAST'''), is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. | In bioinformatics, '''''B'''asic '''L'''ocal '''A'''lignment '''S'''earch '''T'''ool'' (or '''BLAST'''), is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. | ||
− | This article focuses on the NCBI "new" BLAST, or '''blast+''' (and starting from version 2.2.26+, released on 3 March 2012). | + | ==The "old" or "legacy" BLAST vs. the "new" BLAST+== |
+ | This article focuses on the NCBI "new" BLAST, or '''blast+''' (and starting from version 2.2.26+, released on 3 March 2012). The NCBI "new" BLAST+ was released in 2009. This replaces the old [[BLAST|NCBI "legacy" BLAST]] package, starting with version 2.2.18+. | ||
+ | |||
+ | NCBI "legacy" BLAST included command line tools <code>blastall</code>, <code>blastpgp</code>, and <code>rpsblast</code>. This was the most widely used standalone BLAST tool up until its replacement BLAST+ was released by the NCBI. The "new" BLAST+ tools like <code>blastn</code>, <code>blastp</code>, <code>blastx</code>, <code>tblastn</code>, <code>tblastx</code> (which all used to be handled by <code>blastall</code>), <code>psiblast</code> (replacing <code>blastpgp</code>) and <code>rpsblast</code> and <code>rpstblastn</code> (which replace the old <code>rpsblast</code>). Also, <code>makeblastdb</code> replaces the "legacy" <code>formatdb</code> package. | ||
The latest stable version is: '''2.2.26+''' (2012-03-03) | The latest stable version is: '''2.2.26+''' (2012-03-03) | ||
Line 70: | Line 73: | ||
For more details, refer to the ''BLAST Command Line Applications User Manual'' section titled [http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.I43_Backwards_compatib Backwards compatibility script]. | For more details, refer to the ''BLAST Command Line Applications User Manual'' section titled [http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.I43_Backwards_compatib Backwards compatibility script]. | ||
− | ===Extract all human sequences from the | + | ===A simple BLASTX run=== |
+ | As an example, taking a FASTA file of gene nucleotide sequences, one might wish to run a BLASTX (translation) search against the non-redundant (NR) protein database. Assuming the NR database has been downloaded and installed, one might run: | ||
+ | |||
+ | blastx -query foo.fasta -db nr -out foo.xml -evalue 0.001 -outfmt 5 | ||
+ | |||
+ | This should run BLASTX against the NR database, using an expectation cut-off value of 0.001, and produce XML output to the specified file (which we can then parse). On my fairly new laptop, this takes about six minutes - a good reason to save the output to a file so you and repeat any analysis as needed. | ||
+ | |||
+ | ===Extract all human sequences from the NR database=== | ||
Although one cannot select GIs by taxonomy from a database, a combination of Linux command line tools will accomplish this: | Although one cannot select GIs by taxonomy from a database, a combination of Linux command line tools will accomplish this: | ||
Line 78: | Line 88: | ||
The first <code>blastdbcmd</code> invocation produces 2 entries per sequence (GI and taxonomy ID), the [[awk]] command selects from the output of that command those sequences which have a taxonomy ID of '9606' (i.e., human) and prints its GIs, and finally the second <code>blastdbcmd</code> invocation uses those GIs to print the sequence data for the human sequences in the <code>nr</code> database. | The first <code>blastdbcmd</code> invocation produces 2 entries per sequence (GI and taxonomy ID), the [[awk]] command selects from the output of that command those sequences which have a taxonomy ID of '9606' (i.e., human) and prints its GIs, and finally the second <code>blastdbcmd</code> invocation uses those GIs to print the sequence data for the human sequences in the <code>nr</code> database. | ||
+ | |||
+ | ===Custom data extraction and formatting from a BLAST database=== | ||
+ | |||
+ | The following examples show how to extract selected information from a BLAST database and how to format it: | ||
+ | |||
+ | Extract the accession, sequence length, and masked locations for GI 71022837: | ||
+ | $ blastdbcmd -entry 71022837 -db Test/mask-data-db -outfmt "%a %l %m" | ||
+ | XP_761648.1 1292 119-139;140-144;147-152;154-160;161-216; | ||
+ | |||
+ | |||
+ | Extract the masked FASTA for GI 71022837: | ||
+ | $ blastdbcmd -entry 71022837 -db Test/mask-data-db \ | ||
+ | -mask_sequence_with 20 -target_only | ||
+ | >gi|71022837|ref|XP_761648.1| hypothetical protein UM05501.1 | ||
+ | MPPSARHSAHPSHHPHAGGRDLHHAAGGPPPQGGPGMPPGPGNGPMHHPHSSYAQSMPPPPGLPPHAMNGINGPP | ||
+ | PSTHGGPPPRMVMADGPGGAGGPPPPPPPHIPRSSSAQSRIMEAaggpagpppagppastspavqklslaNEaaw | ||
+ | vsiGsaaetmedydralsayeaalrhnpysvpalsaiagvhrtldnfekavdyfqrvlnivpengdtWGSMGHCY | ||
+ | LMMDDLQRAYTAYQQALYHLPNPKEPKLWYGIGILYDRYGSLEHAEEAFASVVRMDPNYEKANEIYFRLGIIYKQ | ||
+ | QNKFPASLECFRYILDNPPRPLTEIDIWFQIGHVYEQQKEFNAAKEAYERVLAENPNHAKVLQQLGWLYHLSNAG | ||
+ | FNNQERAIQFLTKSLESDPNDAQSWYLLGRAYMAGQNYNKAYEAYQQAVYRDGKNPTFWCSIGVLYYQINQYRDA | ||
+ | LDAYSRAIRLNPYISEVWFDLGSLYEACNNQISDAIHAYERAADLDPDNPQIQQRLQLLRNAEAKGGELPEAPVP | ||
+ | QDVHPTAYANNNGMAPGPPTQIGGGPGPSYPPPLVGPQLAGNGGGRGDLSDRDLPGPGHLGSSHSPPPFRGPPGT | ||
+ | DDRGARGPPHGALAPMVGGPGGPEPLGRGGFSHSRGPSPGPPRMDPYGRRLGSPPRRSPPPPLRSDVHDGHGAPP | ||
+ | HVHGQGHGQGHGQGHGQGHGQGHGQSHGHSHGGEFRGPPPLAAAGPGGPPPPLDHYGRPMGGPMSEREREMEWER | ||
+ | EREREREREQAARGYPASGRITPKNEPGYARSQHGGSNAPSPAFGRPPVYGRDEGRDYYNNSHPGSGPGGPRGGY | ||
+ | ERGPGAPHAPAPGMRHDERGPPPAPFEHERGPPPPHQAGDLRYDSYSDGRDGPFRGPPPGLGRPTPDWERTRAGE | ||
+ | YGPPSLHDGAEGRNAGGSASKSRRGPKAKDELEAAPAPPSPVPSSAGKKGKTTSSRAGSPWSAKGGVAAPGKNGK | ||
+ | ASTPFGTGVGAPVAAAGVGGGVGSKKGAAISLRPQEDQPDSRPGSPQSRRDASPASSDGSNEPLAARAPSSRMVD | ||
+ | EDYDEGAADALMGLAGAASASSASVATAAPAPVSPVATSDRASSAEKRAESSLGKRPYAEEERAVDEPEDSYKRA | ||
+ | KSGSAAEIEADATSGGRLNGVSVSAKPEATAAEGTEQPKETRTETPPLAVAQATSPEAINGKAESESAVQPMDVD | ||
+ | GREPSKAPSESATAMKDSPSTANPVVAAKASEPSPTAAPPATSMATSEAQPAKADSCEKNNNDEDEREEEEGQIH | ||
+ | EDPIDAPAKRADEDGAK | ||
+ | |||
+ | ===Example of custom output format=== | ||
+ | |||
+ | The following example shows how to display the results of a BLAST search using a custom output format. The tabular output format with comments is used, but only the query accession, subject accession, evalue, query start, query stop, subject start, and subject stop are requested. For brevity, only the first 10 lines of output are shown: | ||
+ | |||
+ | $ echo 1786181 | ./blastn -db ecoli -outfmt "7 qacc sacc evalue qstart qend sstart send" | ||
+ | # BLASTN 2.2.18+ | ||
+ | # Query: gi|1786181|gb|AE000111.1|AE000111 | ||
+ | # Database: ecoli | ||
+ | # Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end | ||
+ | # 85 hits found | ||
+ | AE000111 AE000111 0.0 1 10596 1 10596 | ||
+ | AE000111 AE000174 8e-30 5565 5671 6928 6821 | ||
+ | AE000111 AE000394 1e-27 5587 5671 135 219 | ||
+ | AE000111 AE000425 6e-26 5587 5671 8552 8468 | ||
+ | AE000111 AE000171 3e-24 5587 5671 2214 2130 | ||
+ | |||
+ | ===Query a BLAST database with a GI, but exclude that GI from the results=== | ||
+ | |||
+ | Extract a GI from the ecoli database: | ||
+ | $ blastdbcmd -entry all -db ecoli -dbtype nucl -outfmt %g | head -1 | \ | ||
+ | tee exclude_me | ||
+ | 1786181 | ||
+ | |||
+ | Run the restricted database search, which shows there are no self-hits: | ||
+ | $ blastn -db ecoli -negative_gilist exclude_me -show_gis -num_alignments 0 \ | ||
+ | -query exclude_me | grep `cat exclude_me` | ||
+ | Query= gi|1786181|gb|AE000111.1|AE000111 | ||
+ | $ | ||
+ | |||
+ | ==Exit codes== | ||
+ | |||
+ | All BLAST+ applications have consistent exit codes to signify the exit status of the application. The possible exit codes along with their meaning are detailed in the table below: | ||
+ | <div style="float:left; margin:0px 20px 0px 0px;"> | ||
+ | {| align="center" style="border: 1px solid #999; background-color:#FFFFFF" | ||
+ | |- | ||
+ | ! colspan="4" bgcolor="#EFEFEF" | '''BLAST+ Exit Codes''' | ||
+ | |-align="center" bgcolor="#1188ee" | ||
+ | !Exit Code | ||
+ | !Meaning | ||
+ | |- | ||
+ | | align="right" |0 || Success | ||
+ | |--bgcolor="#eeeeee" | ||
+ | | align="right" | 1 || Error in query sequence(s) or BLAST options | ||
+ | |- | ||
+ | | align="right" |2 || Error in BLAST database | ||
+ | |--bgcolor="#eeeeee" | ||
+ | | align="right" |3 || Error in BLAST engine | ||
+ | |- | ||
+ | | align="right" |4 || Out of memory | ||
+ | |--bgcolor="#eeeeee" | ||
+ | | align="right" |255 || Unknown error | ||
+ | |} | ||
+ | <div align="center">''Source: [http://www.ncbi.nlm.nih.gov/books/NBK1763/ BLAST Command Line Applications User Manual]''</div> | ||
+ | </div> | ||
+ | {{Clear}} | ||
+ | In the case of BLAST+ database applications, the possible exit codes are 0 (indicating success) and 1 (indicating failure). | ||
==See also== | ==See also== |
Latest revision as of 00:05, 10 July 2012
In bioinformatics, Basic Local Alignment Search Tool (or BLAST), is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.
Contents
The "old" or "legacy" BLAST vs. the "new" BLAST+
This article focuses on the NCBI "new" BLAST, or blast+ (and starting from version 2.2.26+, released on 3 March 2012). The NCBI "new" BLAST+ was released in 2009. This replaces the old NCBI "legacy" BLAST package, starting with version 2.2.18+.
NCBI "legacy" BLAST included command line tools blastall
, blastpgp
, and rpsblast
. This was the most widely used standalone BLAST tool up until its replacement BLAST+ was released by the NCBI. The "new" BLAST+ tools like blastn
, blastp
, blastx
, tblastn
, tblastx
(which all used to be handled by blastall
), psiblast
(replacing blastpgp
) and rpsblast
and rpstblastn
(which replace the old rpsblast
). Also, makeblastdb
replaces the "legacy" formatdb
package.
The latest stable version is: 2.2.26+ (2012-03-03)
see: BLAST for legacy ("old") versions.
Utilities
- Programs contained in blast+ package:
- blastdbcheck
- Checks database integrity
- blastdbcmd
- Retrieves sequences or other information from a BLAST database
- blastdb_aliastool
- Creates database alias
- Blastn
- Searches a nucleotide query against a nucleotide database
- blastp
- Searches a protein query against a protein database
- blastx
- Searches a nucleotide query, dynamically translated in all six frames, against a protein database
- blast_formatter
- Formats a web blast result using its assigned request ID (RID)
- convert2blastmask
- Converts lowercase masking into makeblastdb readable data
- dustmasker
- Masks the low complexity regions in the input nucleotide sequences
- legacy_blast.pl
- Converts a legacy blast search command line into blast+ counterpart and execute it
- makeblastdb
- Formats input FASTA file(s) into a BLAST database
- makembindex
- Indexes an existing nucleotide database for use with megablast
- psiblast
- Finds members of a protein family, identifies proteins distantly related to the query, or builds position specific scoring matrix for the query
- rpsblast
- Searches a protein against a conserved domain database (CDD) to identify functional domains present in the query
- rpstblastn
- Searches a nucleotide query, by dynamically translated it in all six-frames first, against a conserved domain database (CDD)
- segmasker
- Masks the low complexity regions in input protein sequences
- tblastn
- Searches a protein query against a nucleotide database dynamically translated in all six frames
- tblastx
- Searches a nucleotide query, dynamically translated in all six frames, against a nucleotide database similarly translated
- update_blastdb.pl
- Downloads preformatted blast databases from NCBI
- windowmasker
- Masks repeats found in input nucleotide sequences
Legacy utilities
- Programs contained in the legacy blast package:
- bl2seq [1]
- Directly comparing two FASTA sequences
- blastall [1]
- legacy blast containing the subfunction of blastn, blastp, blastx, tblastn, and tblastx
- blastclust [2]
- Clusters input FASTA sequences into related groups
- blastpgp [1]
- Standalone PSI-BLAST for search of distantly related protein sequences and generate position-specific matrices
- copymat [2]
- Copies blastpgp output for input to makemat
- fastacmd [1]
- Retrieves specific sequence or dumps the sequences from a formatted blast database
- formatdb [1]
- Convert FASTA formatted seqeucne file into BLAST database
- formatrpsdb [2]
- Format scoremat files into an RPSBLAST database
- impala [2]
- protein profile search program, mostly replaced by rpsblast
- makemat [2]
- Convert the copymat files into scoremat format, no loger needed by new blastpgp output
- megablast [1]
- Faster batch blastn program that uses greedy-algorithm. Works in contiguous or more sensitive discontiguous mode
- rpsblast [1]
- reverse PSI-BLAST program for searching against conserved domain database
- seedtop [2]
- Pattern search program
Note:
- Those programs are re-organized into blastn, blastp, blastx, tblastn, tblastx, rpsblast, rpsblastx, psiblast, blastdbcmd and makeblastdb
- Those programs have no blast+ counterpart at this time.
The commands for legacy blast, comparable to those given for blast+ in section 6, are:
blastall - fastacmd -d refseq_rna -s nm_000249 -o test_query.fa blastall -p blastn -i test_query.fa -d refseq_rna -F F -m 9 -b 2 -v 2
Example usage
For users of NCBI C Toolkit BLAST
The easiest way to get started using these command line applications is by means of the legacy_blast.pl PERL script which is bundled along with the BLAST+ applications. To utilize this script, simply prefix it to the invocation of the C toolkit BLAST command line application and append the --path option pointing to the installation directory of the BLAST+ applications.
For example, instead of using:
blastall -i query -d nr -o blast.out
use
legacy_blast.pl blastall -i query -d nr -o blast.out --path /opt/blast/bin
For more details, refer to the BLAST Command Line Applications User Manual section titled Backwards compatibility script.
A simple BLASTX run
As an example, taking a FASTA file of gene nucleotide sequences, one might wish to run a BLASTX (translation) search against the non-redundant (NR) protein database. Assuming the NR database has been downloaded and installed, one might run:
blastx -query foo.fasta -db nr -out foo.xml -evalue 0.001 -outfmt 5
This should run BLASTX against the NR database, using an expectation cut-off value of 0.001, and produce XML output to the specified file (which we can then parse). On my fairly new laptop, this takes about six minutes - a good reason to save the output to a file so you and repeat any analysis as needed.
Extract all human sequences from the NR database
Although one cannot select GIs by taxonomy from a database, a combination of Linux command line tools will accomplish this:
$ blastdbcmd -db nr -entry all -outfmt "%g %T" | \ awk ' { if ($2 == 9606) { print $1 } } ' | \ blastdbcmd -db nr -entry_batch - -out human_sequences.txt
The first blastdbcmd
invocation produces 2 entries per sequence (GI and taxonomy ID), the awk command selects from the output of that command those sequences which have a taxonomy ID of '9606' (i.e., human) and prints its GIs, and finally the second blastdbcmd
invocation uses those GIs to print the sequence data for the human sequences in the nr
database.
Custom data extraction and formatting from a BLAST database
The following examples show how to extract selected information from a BLAST database and how to format it:
Extract the accession, sequence length, and masked locations for GI 71022837:
$ blastdbcmd -entry 71022837 -db Test/mask-data-db -outfmt "%a %l %m" XP_761648.1 1292 119-139;140-144;147-152;154-160;161-216;
Extract the masked FASTA for GI 71022837:
$ blastdbcmd -entry 71022837 -db Test/mask-data-db \ -mask_sequence_with 20 -target_only >gi|71022837|ref|XP_761648.1| hypothetical protein UM05501.1 MPPSARHSAHPSHHPHAGGRDLHHAAGGPPPQGGPGMPPGPGNGPMHHPHSSYAQSMPPPPGLPPHAMNGINGPP PSTHGGPPPRMVMADGPGGAGGPPPPPPPHIPRSSSAQSRIMEAaggpagpppagppastspavqklslaNEaaw vsiGsaaetmedydralsayeaalrhnpysvpalsaiagvhrtldnfekavdyfqrvlnivpengdtWGSMGHCY LMMDDLQRAYTAYQQALYHLPNPKEPKLWYGIGILYDRYGSLEHAEEAFASVVRMDPNYEKANEIYFRLGIIYKQ QNKFPASLECFRYILDNPPRPLTEIDIWFQIGHVYEQQKEFNAAKEAYERVLAENPNHAKVLQQLGWLYHLSNAG FNNQERAIQFLTKSLESDPNDAQSWYLLGRAYMAGQNYNKAYEAYQQAVYRDGKNPTFWCSIGVLYYQINQYRDA LDAYSRAIRLNPYISEVWFDLGSLYEACNNQISDAIHAYERAADLDPDNPQIQQRLQLLRNAEAKGGELPEAPVP QDVHPTAYANNNGMAPGPPTQIGGGPGPSYPPPLVGPQLAGNGGGRGDLSDRDLPGPGHLGSSHSPPPFRGPPGT DDRGARGPPHGALAPMVGGPGGPEPLGRGGFSHSRGPSPGPPRMDPYGRRLGSPPRRSPPPPLRSDVHDGHGAPP HVHGQGHGQGHGQGHGQGHGQGHGQSHGHSHGGEFRGPPPLAAAGPGGPPPPLDHYGRPMGGPMSEREREMEWER EREREREREQAARGYPASGRITPKNEPGYARSQHGGSNAPSPAFGRPPVYGRDEGRDYYNNSHPGSGPGGPRGGY ERGPGAPHAPAPGMRHDERGPPPAPFEHERGPPPPHQAGDLRYDSYSDGRDGPFRGPPPGLGRPTPDWERTRAGE YGPPSLHDGAEGRNAGGSASKSRRGPKAKDELEAAPAPPSPVPSSAGKKGKTTSSRAGSPWSAKGGVAAPGKNGK ASTPFGTGVGAPVAAAGVGGGVGSKKGAAISLRPQEDQPDSRPGSPQSRRDASPASSDGSNEPLAARAPSSRMVD EDYDEGAADALMGLAGAASASSASVATAAPAPVSPVATSDRASSAEKRAESSLGKRPYAEEERAVDEPEDSYKRA KSGSAAEIEADATSGGRLNGVSVSAKPEATAAEGTEQPKETRTETPPLAVAQATSPEAINGKAESESAVQPMDVD GREPSKAPSESATAMKDSPSTANPVVAAKASEPSPTAAPPATSMATSEAQPAKADSCEKNNNDEDEREEEEGQIH EDPIDAPAKRADEDGAK
Example of custom output format
The following example shows how to display the results of a BLAST search using a custom output format. The tabular output format with comments is used, but only the query accession, subject accession, evalue, query start, query stop, subject start, and subject stop are requested. For brevity, only the first 10 lines of output are shown:
$ echo 1786181 | ./blastn -db ecoli -outfmt "7 qacc sacc evalue qstart qend sstart send" # BLASTN 2.2.18+ # Query: gi|1786181|gb|AE000111.1|AE000111 # Database: ecoli # Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end # 85 hits found AE000111 AE000111 0.0 1 10596 1 10596 AE000111 AE000174 8e-30 5565 5671 6928 6821 AE000111 AE000394 1e-27 5587 5671 135 219 AE000111 AE000425 6e-26 5587 5671 8552 8468 AE000111 AE000171 3e-24 5587 5671 2214 2130
Query a BLAST database with a GI, but exclude that GI from the results
Extract a GI from the ecoli database:
$ blastdbcmd -entry all -db ecoli -dbtype nucl -outfmt %g | head -1 | \ tee exclude_me 1786181
Run the restricted database search, which shows there are no self-hits:
$ blastn -db ecoli -negative_gilist exclude_me -show_gis -num_alignments 0 \ -query exclude_me | grep `cat exclude_me` Query= gi|1786181|gb|AE000111.1|AE000111 $
Exit codes
All BLAST+ applications have consistent exit codes to signify the exit status of the application. The possible exit codes along with their meaning are detailed in the table below:
BLAST+ Exit Codes | |||
---|---|---|---|
Exit Code | Meaning | ||
0 | Success | ||
1 | Error in query sequence(s) or BLAST options | ||
2 | Error in BLAST database | ||
3 | Error in BLAST engine | ||
4 | Out of memory | ||
255 | Unknown error |
In the case of BLAST+ database applications, the possible exit codes are 0 (indicating success) and 1 (indicating failure).
See also
- BioPython — makes extensive use of blast+
External links
- Official website
- BLAST executables — free source downloads
- Standalone BLAST Setup for Unix