BLAST+

From Christoph's Personal Wiki
Revision as of 22:30, 9 July 2012 by Christoph (Talk | contribs)

Jump to: navigation, search

In bioinformatics, Basic Local Alignment Search Tool (or BLAST), is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.

This article focuses on the NCBI "new" BLAST, or blast+ (and starting from version 2.2.26+, released on 3 March 2012).

The latest stable version is: 2.2.26+ (2012-03-03)

see: BLAST for legacy ("old") versions.

Utilities

  • Programs contained in blast+ package:
blastdbcheck 
Checks database integrity
blastdbcmd 
Retrieves sequences or other information from a BLAST database
blastdb_aliastool 
Creates database alias
Blastn 
Searches a nucleotide query against a nucleotide database
blastp 
Searches a protein query against a protein database
blastx 
Searches a nucleotide query, dynamically translated in all six frames, against a protein database
blast_formatter 
Formats a web blast result using its assigned request ID (RID)
convert2blastmask 
Converts lowercase masking into makeblastdb readable data
dustmasker 
Masks the low complexity regions in the input nucleotide sequences
legacy_blast.pl 
Converts a legacy blast search command line into blast+ counterpart and execute it
makeblastdb 
Formats input FASTA file(s) into a BLAST database
makembindex 
Indexes an existing nucleotide database for use with megablast
psiblast 
Finds members of a protein family, identifies proteins distantly related to the query, or builds position specific scoring matrix for the query
rpsblast 
Searches a protein against a conserved domain database (CDD) to identify functional domains present in the query
rpstblastn 
Searches a nucleotide query, by dynamically translated it in all six-frames first, against a conserved domain database (CDD)
segmasker 
Masks the low complexity regions in input protein sequences
tblastn 
Searches a protein query against a nucleotide database dynamically translated in all six frames
tblastx 
Searches a nucleotide query, dynamically translated in all six frames, against a nucleotide database similarly translated
update_blastdb.pl 
Downloads preformatted blast databases from NCBI
windowmasker 
Masks repeats found in input nucleotide sequences

Legacy utilities

  • Programs contained in the legacy blast package:
bl2seq [1] 
Directly comparing two FASTA sequences
blastall [1] 
legacy blast containing the subfunction of blastn, blastp, blastx, tblastn, and tblastx
blastclust [2] 
Clusters input FASTA sequences into related groups
blastpgp [1] 
Standalone PSI-BLAST for search of distantly related protein sequences and generate position-specific matrices
copymat [2] 
Copies blastpgp output for input to makemat
fastacmd [1] 
Retrieves specific sequence or dumps the sequences from a formatted blast database
formatdb [1] 
Convert FASTA formatted seqeucne file into BLAST database
formatrpsdb [2] 
Format scoremat files into an RPSBLAST database
impala [2] 
protein profile search program, mostly replaced by rpsblast
makemat [2] 
Convert the copymat files into scoremat format, no loger needed by new blastpgp output
megablast [1] 
Faster batch blastn program that uses greedy-algorithm. Works in contiguous or more sensitive discontiguous mode
rpsblast [1] 
reverse PSI-BLAST program for searching against conserved domain database
seedtop [2] 
Pattern search program

Note:

  1. Those programs are re-organized into blastn, blastp, blastx, tblastn, tblastx, rpsblast, rpsblastx, psiblast, blastdbcmd and makeblastdb
  2. Those programs have no blast+ counterpart at this time.

The commands for legacy blast, comparable to those given for blast+ in section 6, are:

blastall -
fastacmd -d refseq_rna -s nm_000249 -o test_query.fa
blastall -p blastn -i test_query.fa -d refseq_rna -F F -m 9 -b 2 -v 2

Example usage

Extract all human sequences from the nr database

Although one cannot select GIs by taxonomy from a database, a combination of Linux command line tools will accomplish this:

$ blastdbcmd -db nr -entry all -outfmt "%g %T" | \
   awk ' { if ($2 == 9606) { print $1 } } ' | \
   blastdbcmd -db nr -entry_batch - -out human_sequences.txt

The first blastdbcmd invocation produces 2 entries per sequence (GI and taxonomy ID), the awk command selects from the output of that command those sequences which have a taxonomy ID of '9606' (i.e., human) and prints its GIs, and finally the second blastdbcmd invocation uses those GIs to print the sequence data for the human sequences in the nr database.

See also

External links