Difference between revisions of "BLAST"

From Christoph's Personal Wiki
Jump to: navigation, search
Line 2: Line 2:
  
 
The latest stable version is: '''2.2.16''' (2007-03-25)
 
The latest stable version is: '''2.2.16''' (2007-03-25)
 +
 +
==Installation==
 +
''Note: The following is based on the "README for stand-alone BLAST" document. It will only cover the process for [[Linux]] systems.''
 +
 +
===Download===
 +
Download the latest version of BLAST for Linux from ftp://ftp.ncbi.nih.gov/blast/ (the archive will be called something like <code>blast-2.2.16-ia32-linux.tar.gz</code>).
 +
 +
===Configuration file===
 +
In order for Standalone BLAST to operate, you have will need to have a <code>.ncbirc</code> file that contains the following lines:
 +
[NCBI]
 +
Data="path/data/"
 +
where "path/data/" is the path to the location of the Standalone BLAST "data" subdirectory. For Example:
 +
Data=/home/blast/data
 +
 +
The data subdirectory should automatically appear in the directory where the downloaded file was extracted. Please note that in many cases it may be necessary to delimit the entire path including the machine name and or the net work you are located on. Your systems administrator can help you if you do not know the entire path to the data subdirectory.
 +
 +
Make sure that your <code>.ncbirc</code> file is either in the directory that you call the Standalone BLAST program from or in your root directory.
 +
 +
===Format your BLAST database files===
 +
The main advantage of Standalone BLAST is to be able to create your own BLAST databases. This can be done with any file of FASTA formatted protein or nucleotide sequences. If you are interested in creating your own database files you should refer to the sections "Non-redundant defline syntax" and "Appendix 1: Sequence Identifier Syntax" of the README in the BLAST database directory
 +
(ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA description available from the BLAST search pages (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).
 +
 +
However, for a testing purposes you should download one of the NCBI databases and run a search against it.
 +
 +
In the BLAST database FTP directory (ftp://ftp.ncbi.nih.gov/blast/db/) you will find the downloadable BLAST database files. For your first search we recommend downloading something relatively small like <code>FASTA/ecoli.nt.gz</code> (1349 Kb). This is a FASTA formatted file of nucleotide sequences which is also compressed. Once uncompressed, you will need to format the database using the [[formatdb]] program which comes with your Standalone BLAST executable. The list of arguments for this program and all other BLAST programs are located at the end of the README in the Standalone BLAST FTP directory (ftp://ftp.ncbi.nih.gov/blast/executable/). Or
 +
you can get these arguments by running each of the BLAST programs ([[formatdb]], [[blastall]], etc.) with a single hyphen as the argument (Example: <code>formatdb -</code>). For this article, we are just going to show you the basic commands for formatting
 +
the database and running your first search.
 +
 +
To format the <code>ecoli.nt</code> database run the following from the command line:
 +
formatdb -i ecoli.nt -p F -o T
 +
 +
This will create seven index files that Standalone BLAST needs to perform the searches and produce results. The <code>ecoli.nt</code> file is not needed after <tt>formatdb</tt> has been done and you can delete this.
 +
 +
Next create a test nucleotide file to run against the new database. It may be easier to 'cheat' here and just extract a portion of a
 +
nucleotide sequence you know is in the downloaded <code>ecoli.nt</code> database.
 +
 +
Make a text file called <code>test.txt</code> with the following sequence:
 +
<pre>
 +
>Test
 +
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
 +
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
 +
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
 +
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
 +
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
 +
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
 +
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
 +
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
 +
</pre>
 +
 +
To run the first search enter the following command from the Linux command line in your BLAST directory:
 +
blastall -p blastn -d ecoli.nt -i test.txt -o test.out
 +
 +
This should generate a results file called <code>test.out</code> in the Standalone BLAST directory.
 +
 +
Now you are ready to create your own databases and run BLAST searches. For more information you should refer to the Standalone BLAST README (ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST literature. This will give you some idea of all the programs BLAST supports and the use of different parameters for increasing or decreasing the stringency of your results.
  
 
==BLAST Assembled Genomes==
 
==BLAST Assembled Genomes==

Revision as of 03:08, 9 September 2007

BLAST (the Basic Local Alignment Search Tool) is a service of the National Center for Biotechnology Information (NCBI). A nucleotide or protein sequence sent to the BLAST server is compared against databases at the NCBI and a summary of matches is returned to the user.

The latest stable version is: 2.2.16 (2007-03-25)

Installation

Note: The following is based on the "README for stand-alone BLAST" document. It will only cover the process for Linux systems.

Download

Download the latest version of BLAST for Linux from ftp://ftp.ncbi.nih.gov/blast/ (the archive will be called something like blast-2.2.16-ia32-linux.tar.gz).

Configuration file

In order for Standalone BLAST to operate, you have will need to have a .ncbirc file that contains the following lines:

[NCBI] 
Data="path/data/"

where "path/data/" is the path to the location of the Standalone BLAST "data" subdirectory. For Example:

Data=/home/blast/data

The data subdirectory should automatically appear in the directory where the downloaded file was extracted. Please note that in many cases it may be necessary to delimit the entire path including the machine name and or the net work you are located on. Your systems administrator can help you if you do not know the entire path to the data subdirectory.

Make sure that your .ncbirc file is either in the directory that you call the Standalone BLAST program from or in your root directory.

Format your BLAST database files

The main advantage of Standalone BLAST is to be able to create your own BLAST databases. This can be done with any file of FASTA formatted protein or nucleotide sequences. If you are interested in creating your own database files you should refer to the sections "Non-redundant defline syntax" and "Appendix 1: Sequence Identifier Syntax" of the README in the BLAST database directory (ftp://ftp.ncbi.nih.gov/blast/db/). You can also refer to the FASTA description available from the BLAST search pages (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).

However, for a testing purposes you should download one of the NCBI databases and run a search against it.

In the BLAST database FTP directory (ftp://ftp.ncbi.nih.gov/blast/db/) you will find the downloadable BLAST database files. For your first search we recommend downloading something relatively small like FASTA/ecoli.nt.gz (1349 Kb). This is a FASTA formatted file of nucleotide sequences which is also compressed. Once uncompressed, you will need to format the database using the formatdb program which comes with your Standalone BLAST executable. The list of arguments for this program and all other BLAST programs are located at the end of the README in the Standalone BLAST FTP directory (ftp://ftp.ncbi.nih.gov/blast/executable/). Or you can get these arguments by running each of the BLAST programs (formatdb, blastall, etc.) with a single hyphen as the argument (Example: formatdb -). For this article, we are just going to show you the basic commands for formatting the database and running your first search.

To format the ecoli.nt database run the following from the command line:

formatdb -i ecoli.nt -p F -o T

This will create seven index files that Standalone BLAST needs to perform the searches and produce results. The ecoli.nt file is not needed after formatdb has been done and you can delete this.

Next create a test nucleotide file to run against the new database. It may be easier to 'cheat' here and just extract a portion of a nucleotide sequence you know is in the downloaded ecoli.nt database.

Make a text file called test.txt with the following sequence:

>Test
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

To run the first search enter the following command from the Linux command line in your BLAST directory:

blastall -p blastn -d ecoli.nt -i test.txt -o test.out

This should generate a results file called test.out in the Standalone BLAST directory.

Now you are ready to create your own databases and run BLAST searches. For more information you should refer to the Standalone BLAST README (ftp://ftp.ncbi.nih.gov/blast/executable/) and the BLAST literature. This will give you some idea of all the programs BLAST supports and the use of different parameters for increasing or decreasing the stringency of your results.

BLAST Assembled Genomes

Note: The following are examples of assembled genomes that can be used as BLAST databases. See here for complete list.

  • Human
  • Mouse
  • Rat
  • Arabidopsis thaliana
  • Oryza sativa
  • Bos taurus
  • Danio rerio
  • Drosophila melanogaster
  • Gallus gallus
  • Pan troglodytes
  • Microbes
  • Apis mellifera

Basic BLAST

nucleotide blast 
Search a nucleotide database using a nucleotide query (Algorithms: blastn, megablast, discontiguous megablast)
protein blast 
Search protein database using a protein query (Algorithms: blastp, psi-blast, phi-blast)
blastx 
Search protein database using a translated nucleotide query
tblastn 
Search translated nucleotide database using a protein query
tblastx 
Search translated nucleotide database using a translated nucleotide query

Blast Family of Programs

The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases:

blastp
compares an amino acid query sequence against a protein sequence database.
blastn
compares a nucleotide query sequence against a nucleotide sequence database.
blastx
compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.
tblastn
compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).
tblastx
compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

The default matrix for all protein-protein comparisons is BLOSUM62.

Databases available for BLAST search

see: http://www.ncbi.nlm.nih.gov/blast/blast_databases.shtml

Overview of database search programs

Database
Protein DNA
Query Protein ssearch tfasta (translates database)
fasta tfastx3 (translates database, allowing frame-shift gaps)
blastp tblastn (translates database)
PSI-blast (simplified profile search)
DNA fastx3 (translates query, allowing frameshift gaps) ssearch
blastx (translates query) fasta
blastn
tblastx (translates query and database; no gaps)

See also

References

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

External links