BLAST+
In bioinformatics, Basic Local Alignment Search Tool (or BLAST), is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.
This article focuses on the NCBI "new" BLAST, or blast+ (and starting from version 2.2.26+, released on 3 March 2012).
The latest stable version is: 2.2.26+ (2012-03-03)
see: BLAST for legacy ("old") versions.
Contents
Utilities
- Programs contained in blast+ package:
- blastdbcheck
- Checks database integrity
- blastdbcmd
- Retrieves sequences or other information from a BLAST database
- blastdb_aliastool
- Creates database alias
- Blastn
- Searches a nucleotide query against a nucleotide database
- blastp
- Searches a protein query against a protein database
- blastx
- Searches a nucleotide query, dynamically translated in all six frames, against a protein database
- blast_formatter
- Formats a web blast result using its assigned request ID (RID)
- convert2blastmask
- Converts lowercase masking into makeblastdb readable data
- dustmasker
- Masks the low complexity regions in the input nucleotide sequences
- legacy_blast.pl
- Converts a legacy blast search command line into blast+ counterpart and execute it
- makeblastdb
- Formats input FASTA file(s) into a BLAST database
- makembindex
- Indexes an existing nucleotide database for use with megablast
- psiblast
- Finds members of a protein family, identifies proteins distantly related to the query, or builds position specific scoring matrix for the query
- rpsblast
- Searches a protein against a conserved domain database (CDD) to identify functional domains present in the query
- rpstblastn
- Searches a nucleotide query, by dynamically translated it in all six-frames first, against a conserved domain database (CDD)
- segmasker
- Masks the low complexity regions in input protein sequences
- tblastn
- Searches a protein query against a nucleotide database dynamically translated in all six frames
- tblastx
- Searches a nucleotide query, dynamically translated in all six frames, against a nucleotide database similarly translated
- update_blastdb.pl
- Downloads preformatted blast databases from NCBI
- windowmasker
- Masks repeats found in input nucleotide sequences
Legacy utilities
- Programs contained in the legacy blast package:
- bl2seq [1]
- Directly comparing two FASTA sequences
- blastall [1]
- legacy blast containing the subfunction of blastn, blastp, blastx, tblastn, and tblastx
- blastclust [2]
- Clusters input FASTA sequences into related groups
- blastpgp [1]
- Standalone PSI-BLAST for search of distantly related protein sequences and generate position-specific matrices
- copymat [2]
- Copies blastpgp output for input to makemat
- fastacmd [1]
- Retrieves specific sequence or dumps the sequences from a formatted blast database
- formatdb [1]
- Convert FASTA formatted seqeucne file into BLAST database
- formatrpsdb [2]
- Format scoremat files into an RPSBLAST database
- impala [2]
- protein profile search program, mostly replaced by rpsblast
- makemat [2]
- Convert the copymat files into scoremat format, no loger needed by new blastpgp output
- megablast [1]
- Faster batch blastn program that uses greedy-algorithm. Works in contiguous or more sensitive discontiguous mode
- rpsblast [1]
- reverse PSI-BLAST program for searching against conserved domain database
- seedtop [2]
- Pattern search program
Note:
- Those programs are re-organized into blastn, blastp, blastx, tblastn, tblastx, rpsblast, rpsblastx, psiblast, blastdbcmd and makeblastdb
- Those programs have no blast+ counterpart at this time.
The commands for legacy blast, comparable to those given for blast+ in section 6, are:
blastall - fastacmd -d refseq_rna -s nm_000249 -o test_query.fa blastall -p blastn -i test_query.fa -d refseq_rna -F F -m 9 -b 2 -v 2
Example usage
For users of NCBI C Toolkit BLAST
The easiest way to get started using these command line applications is by means of the legacy_blast.pl PERL script which is bundled along with the BLAST+ applications. To utilize this script, simply prefix it to the invocation of the C toolkit BLAST command line application and append the --path option pointing to the installation directory of the BLAST+ applications.
For example, instead of using:
blastall -i query -d nr -o blast.out
use
legacy_blast.pl blastall -i query -d nr -o blast.out --path /opt/blast/bin
For more details, refer to the BLAST Command Line Applications User Manual section titled Backwards compatibility script.
Extract all human sequences from the nr database
Although one cannot select GIs by taxonomy from a database, a combination of Linux command line tools will accomplish this:
$ blastdbcmd -db nr -entry all -outfmt "%g %T" | \ awk ' { if ($2 == 9606) { print $1 } } ' | \ blastdbcmd -db nr -entry_batch - -out human_sequences.txt
The first blastdbcmd
invocation produces 2 entries per sequence (GI and taxonomy ID), the awk command selects from the output of that command those sequences which have a taxonomy ID of '9606' (i.e., human) and prints its GIs, and finally the second blastdbcmd
invocation uses those GIs to print the sequence data for the human sequences in the nr
database.
See also
- BioPython — makes extensive use of blast+
External links
- Official website
- BLAST executables — free source downloads
- Standalone BLAST Setup for Unix