BLAST

From Christoph's Personal Wiki
Jump to: navigation, search

The Basic Local Alignment Search Tool (or BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.[1]

It is a service of the National Center for Biotechnology Information (NCBI). A nucleotide or protein sequence sent to the BLAST server is compared against databases at the NCBI and a summary of matches is returned to the user.

The latest stable version is: 2.2.25 (2011-03-31)

see also: BLAST+, formatdb, BLAST/matrices

Installation

Note: The following is based on the "README for stand-alone BLAST" document. It will only cover the process for Linux systems (and assume 32-bit OS).

Download

Download the latest version of BLAST for Linux from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ (the archive will be called something like ncbi-blast-2.2.25+-ia32-linux.tar.gz for 32-bit versions).

Configuration file

In order for Standalone BLAST to operate, you have will need to have a .ncbirc file that contains the following lines:

[NCBI] 
Data="path/data/"

where "path/data/" is the path to the location of the Standalone BLAST "data" subdirectory. For Example:

Data=/home/blast/data

The data subdirectory should automatically appear in the directory where the downloaded file was extracted. Please note that in many cases it may be necessary to delimit the entire path including the machine name and or the net work you are located on. Your systems administrator can help you if you do not know the entire path to the data subdirectory.

Make sure that your .ncbirc file is either in the directory that you call the Standalone BLAST program from or in your root directory.

Format your BLAST database files

The main advantage of Standalone BLAST is to be able to create your own BLAST databases. This can be done with any file of FASTA formatted protein or nucleotide sequences. If you are interested in creating your own database files you should refer to the sections "Non-redundant defline syntax" and "Appendix 1: Sequence Identifier Syntax" of the README in the BLAST database directory (ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA description available from the BLAST search pages (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).

However, for a testing purposes you should download one of the NCBI databases and run a search against it.

In the BLAST database FTP directory (ftp://ftp.ncbi.nih.gov/blast/db/) you will find the downloadable BLAST database files. For your first search we recommend downloading something relatively small like FASTA/ecoli.nt.gz (1349 Kb). This is a FASTA formatted file of nucleotide sequences which is also compressed. Once uncompressed, you will need to format the database using the formatdb program which comes with your Standalone BLAST executable. The list of arguments for this program and all other BLAST programs are located at the end of the README in the Standalone BLAST FTP directory (ftp://ftp.ncbi.nih.gov/blast/executable/). Or you can get these arguments by running each of the BLAST programs (formatdb, blastall, etc.) with a single hyphen as the argument (Example: formatdb -). For this article, we are just going to show you the basic commands for formatting the database and running your first search.

To format the ecoli.nt database run the following from the command line:

formatdb -i ecoli.nt -p F -o T

This will create seven index files that Standalone BLAST needs to perform the searches and produce results. The ecoli.nt file is not needed after formatdb has been done and you can delete this.

Next create a test nucleotide file to run against the new database. It may be easier to 'cheat' here and just extract a portion of a nucleotide sequence you know is in the downloaded ecoli.nt database.

Make a text file called test.txt with the following sequence:

>Test
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

To run your first (test) search, enter the following command from the Linux command line in your BLAST directory:

blastall -p blastn -d ecoli.nt -i test.txt -o test.out

This should generate a results file called test.out in the Standalone BLAST directory.

Now you are ready to create your own databases and run BLAST searches. For more information, you should refer to the Standalone BLAST README (ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST literature. This will give you some idea of all the programs BLAST supports and the use of different parameters for increasing or decreasing the stringency of your results.

Standalone BLAST tools and utilities

bl2seq
performs a comparison between two sequences using either the blastn or blastp algorithm. Both sequences must be either nucleotides or proteins.
blastall
may be used to perform all five flavors of blast comparison.
blastclust
automatically and systematically clusters protein or DNA sequences based on pairwise matches found using the BLAST algorithm in case of proteins or Mega BLAST algorithm for DNA.
blastpgp
performs gapped blastp searches and can be used to perform iterative searches in psi-blast and phi-blast mode.
copymat
copy matrices
fastacmd
retrieves FASTA formatted sequences from a blast database, as long as it it was successfully formatted using the -o option.
formatdb
must be used in order to format protein or nucleotide source databases before these databases can be searched by blastall, blastpgp, or MegaBLAST.
formatrpsdb
a utility that converts a collection of input sequences into a database suitable for use with Reverse Position Specific (RPS) Blast.
impala
IMPALA: Integrating Matrix Profiles And Local Alignments
makemat
make matrices
megablast
uses a greedy algorithm for nucleotide sequence alignment search and concatenates many queries to save time spent scanning the database.
rpsblast
(Reverse PSI-BLAST) searches a query sequence against a database of profiles.
seedtop

BLAST Assembled Genomes

Note: The following are examples of assembled genomes that can be used as BLAST databases. See here for complete list.

  • Human
  • Mouse
  • Rat
  • Arabidopsis thaliana
  • Oryza sativa
  • Bos taurus
  • Danio rerio
  • Drosophila melanogaster
  • Gallus gallus
  • Pan troglodytes
  • Microbes
  • Apis mellifera

Basic BLAST

nucleotide blast 
Search a nucleotide database using a nucleotide query (Algorithms: blastn, megablast, discontiguous megablast)
protein blast 
Search protein database using a protein query (Algorithms: blastp, psi-blast, phi-blast)
blastx 
Search protein database using a translated nucleotide query
tblastn 
Search translated nucleotide database using a protein query
tblastx 
Search translated nucleotide database using a translated nucleotide query

Blast Family of Programs

The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases:

blastp
compares an amino acid query sequence against a protein sequence database.
blastn
compares a nucleotide query sequence against a nucleotide sequence database.
blastx
compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.
tblastn
compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).
tblastx
compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

Note: The default matrix for all protein-protein comparisons is BLOSUM62.

Databases available for BLAST search

see: http://www.ncbi.nlm.nih.gov/blast/blast_databases.shtml

Overview of database search programs

Database
Protein DNA
Query Protein ssearch tfasta (translates database)
fasta tfastx3 (translates database, allowing frame-shift gaps)
blastp tblastn (translates database)
PSI-blast (simplified profile search)
DNA fastx3 (translates query, allowing frameshift gaps) ssearch
blastx (translates query) fasta
blastn
tblastx (translates query and database; no gaps)


Database updates

  Date       Database             Release             #Entries       #Residues
--------  -------------   ------------------------  ------------  ----------------
07/09/09  nr-nt                  07-09-08 (Sep 07)    76,464,199    80,213,348,830
07/09/09  nr-aa                  07-09-09 (Sep 07)     4,854,027     1,625,198,666
07/09/09  genbank                   160.0 (Jun 07)    73,078,143    77,248,690,945
07/09/09  genbank-upd        160.0+/09-08 (Sep 07)     4,733,502     7,144,085,480
07/09/09  refseq                 07-09-09 (Sep 07)     7,008,479    10,227,394,557
07/09/09  refnuc                 07-09-09 (Sep 07)     2,561,839     8,654,142,841
07/09/09  refpep                 07-09-09 (Sep 07)     4,446,640     1,573,251,716
07/09/09  embl                       91.0 (Jun 07)    97,361,640   170,766,876,848
07/09/09  embl-upd            91.0+/09-04 (Sep 07)     4,568,591     7,011,811,575
07/09/09  dbest                  07-09-09 (Sep 07)    45,549,863    24,954,650,044
07/09/09  dbgss                  07-09-07 (Sep 07)    20,944,229    13,583,822,738
07/09/09  dbsts                  07-09-02 (Sep 07)       930,348       522,183,185
07/09/09  htgs                   07-09-09 (Sep 07)       112,896    18,826,003,094
07/08/22  swissprot                  54.1 (Aug 07)       277,883       101,975,253
07/08/23  trembl                     37.1 (Aug 07)     4,754,787     1,543,116,088
05/06/30  pir                       80.00 (Dec 04)       283,416        96,212,201
07/08/06  prf                       116.0 (Jul 07)       832,123       281,306,426
07/09/09  genpept                   160.0 (Jun 07)     4,442,636     1,366,176,208
07/09/09  genpept-upd        160.0+/09-08 (Sep 07)       520,706       168,812,218
07/09/09  pdb                    07-09-03 (Sep 07)        45,506        25,428,104
07/09/09  pdbstr                 07-09-03 (Sep 07)       108,380        24,873,882
07/01/27  epd                        89.0 (Jan 07)         4,806        76,896,000
06/11/27  prosite                    20.0 (Oct 06)         2,006                  
06/11/27  prosdoc                    20.0 (Oct 06)         1,449                  
07/05/31  blocks                     14.3 (Apr 07)        29,068                  
06/03/20  prints                     43.0 (Mar 06)         1,900                  
04/01/27  prodom                   2003.1 (Jan 04)       391,935                  
07/07/25  pfam                       22.0 (Jul 07)         9,318                  
07/03/31  pmd                      Mar-07 (Mar 07)        45,239                  
06/08/14  aaindex                     9.1 (Aug 06)           685                  
07/07/20  expression             07-07-20 (Jul 07)           499                  
07/01/31  litdb                     32-22 (Dec 06)       509,670                  
07/09/08  omim                   07-09-09 (Sep 07)        18,876                  
07/09/09  pathway             43.0+/09-09 (Sep 07)        56,987                  
07/09/09  brite               43.0+/09-09 (Sep 07)         8,069                  
07/09/09  orthology           43.0+/09-09 (Sep 07)        10,236                  
07/09/07  genome          07-07-26+/09-07 (Sep 07)           666                  
07/09/09  genes               43.0+/09-09 (Sep 07)     2,616,500     3,654,863,702
07/09/05  dgenes              43.0+/09-05 (Sep 07)       286,440       463,517,479
06/09/07  egenes              43.0+/09-07 (Sep 06)       448,730       382,364,821
07/07/30  vgenes              43.0+/07-30 (Jul 07)        52,529        61,419,585
07/07/30  ogenes              43.0+/07-30 (Jul 07)        58,672        36,917,463
07/09/09  compound            43.0+/09-09 (Sep 07)        14,886                  
07/09/09  drug                43.0+/09-09 (Sep 07)         6,614                  
07/09/09  glycan              43.0+/09-09 (Sep 07)        10,972                  
07/09/09  reaction            43.0+/09-09 (Sep 07)         7,226                  
07/09/09  rpair               43.0+/09-09 (Sep 07)         7,295                  
07/09/09  enzyme              43.0+/09-09 (Sep 07)         4,928                  
07/09/08  linkdb                 07-09-08 (Sep 07)   300,844,213

See also

References

  1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs". Nucleic Acids Res, 25:3389-3402.

Further reading

External links

Tutorials