BLAST+

From Christoph's Personal Wiki
Jump to: navigation, search

In bioinformatics, Basic Local Alignment Search Tool (or BLAST), is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.

The "old" or "legacy" BLAST vs. the "new" BLAST+

This article focuses on the NCBI "new" BLAST, or blast+ (and starting from version 2.2.26+, released on 3 March 2012). The NCBI "new" BLAST+ was released in 2009. This replaces the old NCBI "legacy" BLAST package, starting with version 2.2.18+.

NCBI "legacy" BLAST included command line tools blastall, blastpgp, and rpsblast. This was the most widely used standalone BLAST tool up until its replacement BLAST+ was released by the NCBI. The "new" BLAST+ tools like blastn, blastp, blastx, tblastn, tblastx (which all used to be handled by blastall), psiblast (replacing blastpgp) and rpsblast and rpstblastn (which replace the old rpsblast). Also, makeblastdb replaces the "legacy" formatdb package.

The latest stable version is: 2.2.26+ (2012-03-03)

see: BLAST for legacy ("old") versions.

Utilities

  • Programs contained in blast+ package:
blastdbcheck 
Checks database integrity
blastdbcmd 
Retrieves sequences or other information from a BLAST database
blastdb_aliastool 
Creates database alias
Blastn 
Searches a nucleotide query against a nucleotide database
blastp 
Searches a protein query against a protein database
blastx 
Searches a nucleotide query, dynamically translated in all six frames, against a protein database
blast_formatter 
Formats a web blast result using its assigned request ID (RID)
convert2blastmask 
Converts lowercase masking into makeblastdb readable data
dustmasker 
Masks the low complexity regions in the input nucleotide sequences
legacy_blast.pl 
Converts a legacy blast search command line into blast+ counterpart and execute it
makeblastdb 
Formats input FASTA file(s) into a BLAST database
makembindex 
Indexes an existing nucleotide database for use with megablast
psiblast 
Finds members of a protein family, identifies proteins distantly related to the query, or builds position specific scoring matrix for the query
rpsblast 
Searches a protein against a conserved domain database (CDD) to identify functional domains present in the query
rpstblastn 
Searches a nucleotide query, by dynamically translated it in all six-frames first, against a conserved domain database (CDD)
segmasker 
Masks the low complexity regions in input protein sequences
tblastn 
Searches a protein query against a nucleotide database dynamically translated in all six frames
tblastx 
Searches a nucleotide query, dynamically translated in all six frames, against a nucleotide database similarly translated
update_blastdb.pl 
Downloads preformatted blast databases from NCBI
windowmasker 
Masks repeats found in input nucleotide sequences

Legacy utilities

  • Programs contained in the legacy blast package:
bl2seq [1] 
Directly comparing two FASTA sequences
blastall [1] 
legacy blast containing the subfunction of blastn, blastp, blastx, tblastn, and tblastx
blastclust [2] 
Clusters input FASTA sequences into related groups
blastpgp [1] 
Standalone PSI-BLAST for search of distantly related protein sequences and generate position-specific matrices
copymat [2] 
Copies blastpgp output for input to makemat
fastacmd [1] 
Retrieves specific sequence or dumps the sequences from a formatted blast database
formatdb [1] 
Convert FASTA formatted seqeucne file into BLAST database
formatrpsdb [2] 
Format scoremat files into an RPSBLAST database
impala [2] 
protein profile search program, mostly replaced by rpsblast
makemat [2] 
Convert the copymat files into scoremat format, no loger needed by new blastpgp output
megablast [1] 
Faster batch blastn program that uses greedy-algorithm. Works in contiguous or more sensitive discontiguous mode
rpsblast [1] 
reverse PSI-BLAST program for searching against conserved domain database
seedtop [2] 
Pattern search program

Note:

  1. Those programs are re-organized into blastn, blastp, blastx, tblastn, tblastx, rpsblast, rpsblastx, psiblast, blastdbcmd and makeblastdb
  2. Those programs have no blast+ counterpart at this time.

The commands for legacy blast, comparable to those given for blast+ in section 6, are:

blastall -
fastacmd -d refseq_rna -s nm_000249 -o test_query.fa
blastall -p blastn -i test_query.fa -d refseq_rna -F F -m 9 -b 2 -v 2

Example usage

For users of NCBI C Toolkit BLAST

The easiest way to get started using these command line applications is by means of the legacy_blast.pl PERL script which is bundled along with the BLAST+ applications. To utilize this script, simply prefix it to the invocation of the C toolkit BLAST command line application and append the --path option pointing to the installation directory of the BLAST+ applications.

For example, instead of using:

blastall -i query -d nr -o blast.out 

use

legacy_blast.pl blastall -i query -d nr -o blast.out --path /opt/blast/bin 

For more details, refer to the BLAST Command Line Applications User Manual section titled Backwards compatibility script.

A simple BLASTX run

As an example, taking a FASTA file of gene nucleotide sequences, one might wish to run a BLASTX (translation) search against the non-redundant (NR) protein database. Assuming the NR database has been downloaded and installed, one might run:

blastx -query foo.fasta -db nr -out foo.xml -evalue 0.001 -outfmt 5

This should run BLASTX against the NR database, using an expectation cut-off value of 0.001, and produce XML output to the specified file (which we can then parse). On my fairly new laptop, this takes about six minutes - a good reason to save the output to a file so you and repeat any analysis as needed.

Extract all human sequences from the NR database

Although one cannot select GIs by taxonomy from a database, a combination of Linux command line tools will accomplish this:

$ blastdbcmd -db nr -entry all -outfmt "%g %T" | \
   awk ' { if ($2 == 9606) { print $1 } } ' | \
   blastdbcmd -db nr -entry_batch - -out human_sequences.txt

The first blastdbcmd invocation produces 2 entries per sequence (GI and taxonomy ID), the awk command selects from the output of that command those sequences which have a taxonomy ID of '9606' (i.e., human) and prints its GIs, and finally the second blastdbcmd invocation uses those GIs to print the sequence data for the human sequences in the nr database.

Custom data extraction and formatting from a BLAST database

The following examples show how to extract selected information from a BLAST database and how to format it:

Extract the accession, sequence length, and masked locations for GI 71022837:

$ blastdbcmd -entry 71022837 -db Test/mask-data-db  -outfmt "%a %l %m"
XP_761648.1 1292 119-139;140-144;147-152;154-160;161-216;


Extract the masked FASTA for GI 71022837:

$ blastdbcmd -entry 71022837 -db Test/mask-data-db \
-mask_sequence_with 20 -target_only
>gi|71022837|ref|XP_761648.1| hypothetical protein UM05501.1
MPPSARHSAHPSHHPHAGGRDLHHAAGGPPPQGGPGMPPGPGNGPMHHPHSSYAQSMPPPPGLPPHAMNGINGPP
PSTHGGPPPRMVMADGPGGAGGPPPPPPPHIPRSSSAQSRIMEAaggpagpppagppastspavqklslaNEaaw
vsiGsaaetmedydralsayeaalrhnpysvpalsaiagvhrtldnfekavdyfqrvlnivpengdtWGSMGHCY
LMMDDLQRAYTAYQQALYHLPNPKEPKLWYGIGILYDRYGSLEHAEEAFASVVRMDPNYEKANEIYFRLGIIYKQ
QNKFPASLECFRYILDNPPRPLTEIDIWFQIGHVYEQQKEFNAAKEAYERVLAENPNHAKVLQQLGWLYHLSNAG
FNNQERAIQFLTKSLESDPNDAQSWYLLGRAYMAGQNYNKAYEAYQQAVYRDGKNPTFWCSIGVLYYQINQYRDA
LDAYSRAIRLNPYISEVWFDLGSLYEACNNQISDAIHAYERAADLDPDNPQIQQRLQLLRNAEAKGGELPEAPVP
QDVHPTAYANNNGMAPGPPTQIGGGPGPSYPPPLVGPQLAGNGGGRGDLSDRDLPGPGHLGSSHSPPPFRGPPGT
DDRGARGPPHGALAPMVGGPGGPEPLGRGGFSHSRGPSPGPPRMDPYGRRLGSPPRRSPPPPLRSDVHDGHGAPP
HVHGQGHGQGHGQGHGQGHGQGHGQSHGHSHGGEFRGPPPLAAAGPGGPPPPLDHYGRPMGGPMSEREREMEWER
EREREREREQAARGYPASGRITPKNEPGYARSQHGGSNAPSPAFGRPPVYGRDEGRDYYNNSHPGSGPGGPRGGY
ERGPGAPHAPAPGMRHDERGPPPAPFEHERGPPPPHQAGDLRYDSYSDGRDGPFRGPPPGLGRPTPDWERTRAGE
YGPPSLHDGAEGRNAGGSASKSRRGPKAKDELEAAPAPPSPVPSSAGKKGKTTSSRAGSPWSAKGGVAAPGKNGK
ASTPFGTGVGAPVAAAGVGGGVGSKKGAAISLRPQEDQPDSRPGSPQSRRDASPASSDGSNEPLAARAPSSRMVD
EDYDEGAADALMGLAGAASASSASVATAAPAPVSPVATSDRASSAEKRAESSLGKRPYAEEERAVDEPEDSYKRA
KSGSAAEIEADATSGGRLNGVSVSAKPEATAAEGTEQPKETRTETPPLAVAQATSPEAINGKAESESAVQPMDVD
GREPSKAPSESATAMKDSPSTANPVVAAKASEPSPTAAPPATSMATSEAQPAKADSCEKNNNDEDEREEEEGQIH
EDPIDAPAKRADEDGAK

Example of custom output format

The following example shows how to display the results of a BLAST search using a custom output format. The tabular output format with comments is used, but only the query accession, subject accession, evalue, query start, query stop, subject start, and subject stop are requested. For brevity, only the first 10 lines of output are shown:

$ echo 1786181 | ./blastn -db ecoli -outfmt "7 qacc sacc evalue qstart qend sstart send" 
# BLASTN 2.2.18+
# Query: gi|1786181|gb|AE000111.1|AE000111 
# Database: ecoli
# Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end
# 85 hits found
AE000111        AE000111        0.0     1       10596   1       10596
AE000111        AE000174        8e-30   5565    5671    6928    6821
AE000111        AE000394        1e-27   5587    5671    135     219
AE000111        AE000425        6e-26   5587    5671    8552    8468
AE000111        AE000171        3e-24   5587    5671    2214    2130

Query a BLAST database with a GI, but exclude that GI from the results

Extract a GI from the ecoli database:

$ blastdbcmd -entry all -db ecoli -dbtype nucl -outfmt %g | head -1 | \
  tee exclude_me 
1786181

Run the restricted database search, which shows there are no self-hits:

$ blastn -db ecoli -negative_gilist exclude_me -show_gis -num_alignments 0 \
  -query exclude_me | grep `cat exclude_me`
Query= gi|1786181|gb|AE000111.1|AE000111 
$

Exit codes

All BLAST+ applications have consistent exit codes to signify the exit status of the application. The possible exit codes along with their meaning are detailed in the table below:

BLAST+ Exit Codes
Exit Code Meaning
0 Success
1 Error in query sequence(s) or BLAST options
2 Error in BLAST database
3 Error in BLAST engine
4 Out of memory
255 Unknown error
Source: BLAST Command Line Applications User Manual

In the case of BLAST+ database applications, the possible exit codes are 0 (indicating success) and 1 (indicating failure).

See also

External links