Difference between revisions of "BLAST"
(→Format your BLAST database files) |
|||
Line 54: | Line 54: | ||
</pre> | </pre> | ||
− | To run | + | To run your first (test) search, enter the following command from the Linux command line in your BLAST directory: |
blastall -p blastn -d ecoli.nt -i test.txt -o test.out | blastall -p blastn -d ecoli.nt -i test.txt -o test.out | ||
This should generate a results file called <code>test.out</code> in the Standalone BLAST directory. | This should generate a results file called <code>test.out</code> in the Standalone BLAST directory. | ||
− | Now you are ready to create your own databases and run BLAST searches. For more information you should refer to the Standalone BLAST README (ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST literature. This will give you some idea of all the programs BLAST supports and the use of different parameters for increasing or decreasing the stringency of your results. | + | Now you are ready to create your own databases and run BLAST searches. For more information, you should refer to the Standalone BLAST README (ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST literature. This will give you some idea of all the programs BLAST supports and the use of different parameters for increasing or decreasing the stringency of your results. |
==Standalone BLAST tools and utilities== | ==Standalone BLAST tools and utilities== |
Revision as of 15:47, 5 July 2012
The Basic Local Alignment Search Tool (or BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.[1]
It is a service of the National Center for Biotechnology Information (NCBI). A nucleotide or protein sequence sent to the BLAST server is compared against databases at the NCBI and a summary of matches is returned to the user.
The latest stable version is: 2.2.25 (2011-03-31)
see also: formatdb, BLAST/matrices
Contents
Installation
Note: The following is based on the "README for stand-alone BLAST" document. It will only cover the process for Linux systems (and assume 32-bit OS).
Download
Download the latest version of BLAST for Linux from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ (the archive will be called something like ncbi-blast-2.2.25+-ia32-linux.tar.gz
for 32-bit versions).
Configuration file
In order for Standalone BLAST to operate, you have will need to have a .ncbirc
file that contains the following lines:
[NCBI] Data="path/data/"
where "path/data/" is the path to the location of the Standalone BLAST "data" subdirectory. For Example:
Data=/home/blast/data
The data subdirectory should automatically appear in the directory where the downloaded file was extracted. Please note that in many cases it may be necessary to delimit the entire path including the machine name and or the net work you are located on. Your systems administrator can help you if you do not know the entire path to the data subdirectory.
Make sure that your .ncbirc
file is either in the directory that you call the Standalone BLAST program from or in your root directory.
Format your BLAST database files
The main advantage of Standalone BLAST is to be able to create your own BLAST databases. This can be done with any file of FASTA formatted protein or nucleotide sequences. If you are interested in creating your own database files you should refer to the sections "Non-redundant defline syntax" and "Appendix 1: Sequence Identifier Syntax" of the README in the BLAST database directory (ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA description available from the BLAST search pages (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).
However, for a testing purposes you should download one of the NCBI databases and run a search against it.
In the BLAST database FTP directory (ftp://ftp.ncbi.nih.gov/blast/db/) you will find the downloadable BLAST database files. For your first search we recommend downloading something relatively small like FASTA/ecoli.nt.gz
(1349 Kb). This is a FASTA formatted file of nucleotide sequences which is also compressed. Once uncompressed, you will need to format the database using the formatdb program which comes with your Standalone BLAST executable. The list of arguments for this program and all other BLAST programs are located at the end of the README in the Standalone BLAST FTP directory (ftp://ftp.ncbi.nih.gov/blast/executable/). Or
you can get these arguments by running each of the BLAST programs (formatdb, blastall, etc.) with a single hyphen as the argument (Example: formatdb -
). For this article, we are just going to show you the basic commands for formatting
the database and running your first search.
To format the ecoli.nt
database run the following from the command line:
formatdb -i ecoli.nt -p F -o T
This will create seven index files that Standalone BLAST needs to perform the searches and produce results. The ecoli.nt
file is not needed after formatdb has been done and you can delete this.
Next create a test nucleotide file to run against the new database. It may be easier to 'cheat' here and just extract a portion of a
nucleotide sequence you know is in the downloaded ecoli.nt
database.
Make a text file called test.txt
with the following sequence:
>Test AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
To run your first (test) search, enter the following command from the Linux command line in your BLAST directory:
blastall -p blastn -d ecoli.nt -i test.txt -o test.out
This should generate a results file called test.out
in the Standalone BLAST directory.
Now you are ready to create your own databases and run BLAST searches. For more information, you should refer to the Standalone BLAST README (ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST literature. This will give you some idea of all the programs BLAST supports and the use of different parameters for increasing or decreasing the stringency of your results.
Standalone BLAST tools and utilities
- bl2seq
- performs a comparison between two sequences using either the blastn or blastp algorithm. Both sequences must be either nucleotides or proteins.
- blastall
- may be used to perform all five flavors of blast comparison.
- blastclust
- automatically and systematically clusters protein or DNA sequences based on pairwise matches found using the BLAST algorithm in case of proteins or Mega BLAST algorithm for DNA.
- blastpgp
- performs gapped blastp searches and can be used to perform iterative searches in psi-blast and phi-blast mode.
- copymat
- copy matrices
- fastacmd
- retrieves FASTA formatted sequences from a blast database, as long as it it was successfully formatted using the
-o
option. - formatdb
- must be used in order to format protein or nucleotide source databases before these databases can be searched by blastall, blastpgp, or MegaBLAST.
- formatrpsdb
- a utility that converts a collection of input sequences into a database suitable for use with Reverse Position Specific (RPS) Blast.
- impala
- IMPALA: Integrating Matrix Profiles And Local Alignments
- makemat
- make matrices
- megablast
- uses a greedy algorithm for nucleotide sequence alignment search and concatenates many queries to save time spent scanning the database.
- rpsblast
- (Reverse PSI-BLAST) searches a query sequence against a database of profiles.
- seedtop
BLAST Assembled Genomes
Note: The following are examples of assembled genomes that can be used as BLAST databases. See here for complete list.
- Human
- Mouse
- Rat
- Arabidopsis thaliana
- Oryza sativa
- Bos taurus
- Danio rerio
- Drosophila melanogaster
- Gallus gallus
- Pan troglodytes
- Microbes
- Apis mellifera
Basic BLAST
- nucleotide blast
- Search a nucleotide database using a nucleotide query (Algorithms: blastn, megablast, discontiguous megablast)
- protein blast
- Search protein database using a protein query (Algorithms: blastp, psi-blast, phi-blast)
- blastx
- Search protein database using a translated nucleotide query
- tblastn
- Search translated nucleotide database using a protein query
- tblastx
- Search translated nucleotide database using a translated nucleotide query
Blast Family of Programs
The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases:
- blastp
- compares an amino acid query sequence against a protein sequence database.
- blastn
- compares a nucleotide query sequence against a nucleotide sequence database.
- blastx
- compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.
- tblastn
- compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).
- tblastx
- compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
The default matrix for all protein-protein comparisons is BLOSUM62.
Databases available for BLAST search
see: http://www.ncbi.nlm.nih.gov/blast/blast_databases.shtml
Overview of database search programs
Database | |||
---|---|---|---|
Protein | DNA | ||
Query | Protein | ssearch | tfasta (translates database) |
fasta | tfastx3 (translates database, allowing frame-shift gaps) | ||
blastp | tblastn (translates database) | ||
PSI-blast (simplified profile search) | |||
DNA | fastx3 (translates query, allowing frameshift gaps) | ssearch | |
blastx (translates query) | fasta | ||
blastn | |||
tblastx (translates query and database; no gaps) |
Database updates
Date Database Release #Entries #Residues -------- ------------- ------------------------ ------------ ---------------- 07/09/09 nr-nt 07-09-08 (Sep 07) 76,464,199 80,213,348,830 07/09/09 nr-aa 07-09-09 (Sep 07) 4,854,027 1,625,198,666 07/09/09 genbank 160.0 (Jun 07) 73,078,143 77,248,690,945 07/09/09 genbank-upd 160.0+/09-08 (Sep 07) 4,733,502 7,144,085,480 07/09/09 refseq 07-09-09 (Sep 07) 7,008,479 10,227,394,557 07/09/09 refnuc 07-09-09 (Sep 07) 2,561,839 8,654,142,841 07/09/09 refpep 07-09-09 (Sep 07) 4,446,640 1,573,251,716 07/09/09 embl 91.0 (Jun 07) 97,361,640 170,766,876,848 07/09/09 embl-upd 91.0+/09-04 (Sep 07) 4,568,591 7,011,811,575 07/09/09 dbest 07-09-09 (Sep 07) 45,549,863 24,954,650,044 07/09/09 dbgss 07-09-07 (Sep 07) 20,944,229 13,583,822,738 07/09/09 dbsts 07-09-02 (Sep 07) 930,348 522,183,185 07/09/09 htgs 07-09-09 (Sep 07) 112,896 18,826,003,094 07/08/22 swissprot 54.1 (Aug 07) 277,883 101,975,253 07/08/23 trembl 37.1 (Aug 07) 4,754,787 1,543,116,088 05/06/30 pir 80.00 (Dec 04) 283,416 96,212,201 07/08/06 prf 116.0 (Jul 07) 832,123 281,306,426 07/09/09 genpept 160.0 (Jun 07) 4,442,636 1,366,176,208 07/09/09 genpept-upd 160.0+/09-08 (Sep 07) 520,706 168,812,218 07/09/09 pdb 07-09-03 (Sep 07) 45,506 25,428,104 07/09/09 pdbstr 07-09-03 (Sep 07) 108,380 24,873,882 07/01/27 epd 89.0 (Jan 07) 4,806 76,896,000 06/11/27 prosite 20.0 (Oct 06) 2,006 06/11/27 prosdoc 20.0 (Oct 06) 1,449 07/05/31 blocks 14.3 (Apr 07) 29,068 06/03/20 prints 43.0 (Mar 06) 1,900 04/01/27 prodom 2003.1 (Jan 04) 391,935 07/07/25 pfam 22.0 (Jul 07) 9,318 07/03/31 pmd Mar-07 (Mar 07) 45,239 06/08/14 aaindex 9.1 (Aug 06) 685 07/07/20 expression 07-07-20 (Jul 07) 499 07/01/31 litdb 32-22 (Dec 06) 509,670 07/09/08 omim 07-09-09 (Sep 07) 18,876 07/09/09 pathway 43.0+/09-09 (Sep 07) 56,987 07/09/09 brite 43.0+/09-09 (Sep 07) 8,069 07/09/09 orthology 43.0+/09-09 (Sep 07) 10,236 07/09/07 genome 07-07-26+/09-07 (Sep 07) 666 07/09/09 genes 43.0+/09-09 (Sep 07) 2,616,500 3,654,863,702 07/09/05 dgenes 43.0+/09-05 (Sep 07) 286,440 463,517,479 06/09/07 egenes 43.0+/09-07 (Sep 06) 448,730 382,364,821 07/07/30 vgenes 43.0+/07-30 (Jul 07) 52,529 61,419,585 07/07/30 ogenes 43.0+/07-30 (Jul 07) 58,672 36,917,463 07/09/09 compound 43.0+/09-09 (Sep 07) 14,886 07/09/09 drug 43.0+/09-09 (Sep 07) 6,614 07/09/09 glycan 43.0+/09-09 (Sep 07) 10,972 07/09/09 reaction 43.0+/09-09 (Sep 07) 7,226 07/09/09 rpair 43.0+/09-09 (Sep 07) 7,295 07/09/09 enzyme 43.0+/09-09 (Sep 07) 4,928 07/09/08 linkdb 07-09-08 (Sep 07) 300,844,213
See also
References
- ↑ Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs". Nucleic Acids Res, 25:3389-3402.
Further reading
External links
- official BLAST website
- Stand-alone BLAST binaries
- DBGET - release info
- O'Reilly BLAST book on Google Books
Tutorials
- NCBI BLAST tutorial
- BLAST tutorial — by openwetware.org
- Using the Basic Local Alignment Search Tool (BLAST) — by Cold Spring Harbor