Difference between revisions of "ClustalW"
(Added "Command Line Interface") |
|||
(9 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | '''ClustalW''' is multiple alignment programme for Unix and Linux (I am ignoring other operating systems in this article/tutorial). ClustalW is a major update and rewrite of the ClustalV program which was described in Higgins ''et al.'', 1992. | + | '''ClustalW''' is multiple alignment programme for Unix and Linux (I am ignoring other operating systems in this article/tutorial). ClustalW is a major update and rewrite of the ClustalV program which was described in Higgins ''et al.'', 1992.<ref name=Higgins1992>Higgins DG, Bleasby AJ, Fuchs R (1992). CLUSTAL V: improved software for multiple sequence alignment. ''Comput Appl Biosci, 8(2):189-91''. pmid:1591615.</ref> |
− | The main new features are a greatly improved (more sensitive) multiple alignment procedure for proteins and improved support | + | The main new features are a greatly improved (more sensitive) multiple alignment procedure for proteins and improved support for different file formats. This software was described in Thompson ''et al.'', 1994.<ref name=Thompson1994>Thompson JD, Higgins DG, Gibson TJ (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. ''Nucleic Acids Res, 22(22):4673-80''. pmid:7984417.</ref> |
− | for different file formats. This software was described in Thompson ''et al.'', 1994. | + | |
− | The usage of ClustalW is largely the same as for ClustalV. Details of the new alignment algorithms are described in the manuscript by Thompson ''et al.'', 1994. | + | The usage of ClustalW is largely the same as for ClustalV. Details of the new alignment algorithms are described in the manuscript by Thompson ''et al.'', 1994.<ref name=Thompson1994>Thompson JD, Higgins DG, Gibson TJ (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. ''Nucleic Acids Res, 22(22):4673-80''. pmid:7984417.</ref> |
+ | |||
+ | <blockquote>'''Note''': The latest version is '''[http://www.clustal.org/download/current/ 2.0.10]''' (2008-10-14)</blockquote> | ||
== Usage == | == Usage == | ||
Line 11: | Line 12: | ||
* Create a directory where you want to store the ClustalW executables and files. Extract files from archive downloaded into this directory. For an example (under your "home" directory), | * Create a directory where you want to store the ClustalW executables and files. Extract files from archive downloaded into this directory. For an example (under your "home" directory), | ||
− | + | mkdir Clustal | |
− | mkdir Clustal | + | cd Clustal |
− | cd Clustal | + | mkdir ClustalW |
− | mkdir ClustalW | + | |
− | + | ||
* Compile source files | * Compile source files | ||
− | + | make | |
− | * Add binary path to environment. For example, in a bash console, edit the .bashrc file and add the following: | + | * Add binary path to environment. For example, in a [[Bash|bash]] console, edit the <tt>.bashrc</tt> file and add the following: |
− | + | export PATH=/path/to/clustalw/binary:$PATH | |
Installation complete! | Installation complete! | ||
Line 86: | Line 85: | ||
=== Notes on "Bootstrapping" === | === Notes on "Bootstrapping" === | ||
− | When you use the BOOTSTRAP in ClustalW to estimate the reliability of parts of a tree, many of the uncorrected distances may randomly exceed the arbitrary cut off of 0.93 (sequences only 7% identical) if the sequences are distantly related. This will happen randomly i.e. even if none of the pairs of sequences are less than 7% identical, the bootstrap samples may contain pairs | + | When you use the BOOTSTRAP in ClustalW to estimate the reliability of parts of a tree, many of the uncorrected distances may randomly exceed the arbitrary cut off of 0.93 (sequences only 7% identical) if the sequences are distantly related. This will happen randomly i.e. even if none of the pairs of sequences are less than 7% identical, the bootstrap samples may contain pairs of sequences that do exceed this cut off. If this happens, you will be warned. In practice, this can happen with many data sets. It is not a serious problem if it happens rarely. If it does happen (you are warned when it happens and told how often the problem occurs), you should consider removing the most distantly related sequences and/or using the PHYLIP package instead. |
− | of sequences that do exceed this cut off. If this happens, you will be warned. In practice, this can happen with many data sets. It is not a serious problem if it happens rarely. If it does happen (you are warned when it happens and told how often the problem occurs), you should consider removing the most distantly related sequences and/or using the PHYLIP package instead. | + | |
− | A further problem arises in almost exactly the opposite situation: when you bootstrap a data set which contains 3 or more sequences that are identical or almost identical. Here, the sets of identical sequences should be shown as a multifurcation (several sequences joing at the same part of the tree). Because the Neighbor-Joining method only gives strictly dichotomous trees (never more than 2 sequences join at one time), this cannot be exactly represented. In practice, this is NOT a problem as there will be some internal branches of zero length seperating the sequences. If you display the tree with all branch lengths, you will still see a multifurcation. However, when you bootstrap the tree, only the branching orders are stored and counted. In the case of multifurcations, the exact branching order is arbitrary but the program will always get the same branching order, depending only on the input order of the sequences. In practice, this is only a problem in situations where | + | A further problem arises in almost exactly the opposite situation: when you bootstrap a data set which contains 3 or more sequences that are identical or almost identical. Here, the sets of identical sequences should be shown as a multifurcation (several sequences joing at the same part of the tree). Because the Neighbor-Joining method only gives strictly dichotomous trees (never more than 2 sequences join at one time), this cannot be exactly represented. In practice, this is NOT a problem as there will be some internal branches of zero length seperating the sequences. If you display the tree with all branch lengths, you will still see a multifurcation. However, when you bootstrap the tree, only the branching orders are stored and counted. In the case of multifurcations, the exact branching order is arbitrary but the program will always get the same branching order, depending only on the input order of the sequences. In practice, this is only a problem in situations where you have a set of sequences where all of them are VERY similar. In this case, you can find very high support for some groupings which will disappear if you run the analysis with a different input order. Again, the PHYLIP package deals with this by offering a JUMBLE option to shuffle the input order of your sequences between each bootstrap sample. |
− | you have a set of sequences where all of them are VERY similar. In this case, you can find very high support for some groupings which will disappear if you run the analysis with a different input order. Again, the PHYLIP package deals with this by offering a JUMBLE option to shuffle the input order of your sequences between each bootstrap sample. | + | |
=== Summary of the command line usage (very brief) === | === Summary of the command line usage (very brief) === | ||
Line 97: | Line 94: | ||
If anything is put on the command line, the program will (attempt to) carry out whatever is requested and will exit. If you wish to use the command line to set some parameters and then go into interactive mode, use the command line switch: <tt>interactive</tt>. For an example: | If anything is put on the command line, the program will (attempt to) carry out whatever is requested and will exit. If you wish to use the command line to set some parameters and then go into interactive mode, use the command line switch: <tt>interactive</tt>. For an example: | ||
− | + | clustalw -quicktree -interactive | |
This will set the default initial alignment mode to fast/approximate and will then go to the main menu. | This will set the default initial alignment mode to fast/approximate and will then go to the main menu. | ||
To see a list of all the command line parameters, type: | To see a list of all the command line parameters, type: | ||
− | + | clustalw -options | |
This will return a list with no explanation. | This will return a list with no explanation. | ||
To get (''very brief'') help on command line usage, use the <tt>-help</tt> or <tt>-check</tt> options. Otherwise, the command line usage is self explanatory or is explained in <tt>clustalv.doc</tt>. The defaults for all parameters are set in the file <tt>param.h</tt> which can be changed easily (remember to recompile the program afterwards). | To get (''very brief'') help on command line usage, use the <tt>-help</tt> or <tt>-check</tt> options. Otherwise, the command line usage is self explanatory or is explained in <tt>clustalv.doc</tt>. The defaults for all parameters are set in the file <tt>param.h</tt> which can be changed easily (remember to recompile the program afterwards). | ||
− | + | ==Command Line Interface== | |
+ | '''Note: The following is taken directly from the "<code>clustalw.doc</code>" (v.1.83) file. It will be edited later.''' | ||
− | + | ===Data (sequences)=== | |
− | < | + | ;<code>-INFILE=file.ext</code>:input sequences. |
− | + | ;<code>-PROFILE1=file.ext and -PROFILE2=file.ext</code>:profiles (old alignment). | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ===Verbs (do things)=== | |
− | + | ;<code>-OPTIONS</code>:list the command line parameters | |
− | command line | + | ;<code>-HELP or -CHECK</code>:outline the command line params. |
+ | ;<code>-ALIGN</code>:do full multiple alignment | ||
+ | ;<code>-TREE</code>:calculate NJ tree. | ||
+ | ;<code>-BOOTSTRAP(=n)</code>:bootstrap a NJ tree (n= number of bootstraps; def. = 1000). | ||
+ | ;<code>-CONVERT</code>:output the input sequences in a different file format. | ||
+ | ===Parameters (set things)=== | ||
− | + | ====General settings==== | |
+ | ;<code>-INTERACTIVE</code>:read command line, then enter normal interactive menus | ||
+ | ;<code>-QUICKTREE</code>:use FAST algorithm for the alignment guide tree | ||
+ | ;<code>-TYPE=</code>:PROTEIN or DNA sequences | ||
+ | ;<code>-NEGATIVE</code>:protein alignment with negative values in matrix | ||
+ | ;<code>-OUTFILE=</code>:sequence alignment file name | ||
+ | ;<code>-OUTPUT=</code>:GCG, GDE, PHYLIP, PIR, or NEXUS | ||
+ | ;<code>-OUTORDER=</code>:INPUT or ALIGNED | ||
+ | ;<code>-CASE</code>:LOWER or UPPER (for GDE output only) | ||
+ | ;<code>-SEQNOS=</code>:OFF or ON (for Clustal output only) | ||
+ | ;<code>-SEQNO_RANGE=</code>:OFF or ON (NEW: for all output formats) | ||
+ | ;<code>-RANGE=m,n</code>:sequence range to write starting m to m+n. | ||
− | + | ====Fast Pairwise Alignments==== | |
− | + | ;<code>-KTUPLE=n</code>:word size | |
− | + | ;<code>-TOPDIAGS=n</code>:number of best diags. | |
− | + | ;<code>-WINDOW=n</code>:window around best diags. | |
− | + | ;<code>-PAIRGAP=n</code>:gap penalty | |
− | + | ;<code>-SCORE</code>:PERCENT or ABSOLUTE | |
− | + | ||
− | + | ||
− | + | ====Slow Pairwise Alignments==== | |
− | + | ;<code>-PWMATRIX=</code>:Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename | |
− | + | ;<code>-PWDNAMATRIX=</code>:DNA weight matrix=IUB, CLUSTALW or filename | |
+ | ;<code>-PWGAPOPEN=f</code>:gap opening penalty | ||
+ | ;<code>-PWGAPEXT=f</code>:gap opening penalty | ||
− | + | ====Multiple Alignments==== | |
+ | ;<code>-NEWTREE=</code>:file for new guide tree | ||
+ | ;<code>-USETREE=</code>:file for old guide tree | ||
+ | ;<code>-MATRIX=</code>:Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename | ||
+ | ;<code>-DNAMATRIX=</code>:DNA weight matrix=IUB, CLUSTALW or filename | ||
+ | ;<code>-GAPOPEN=f</code>:gap opening penalty | ||
+ | ;<code>-GAPEXT=f</code>:gap extension penalty | ||
+ | ;<code>-ENDGAPS</code>:no end gap separation pen. | ||
+ | ;<code>-GAPDIST=n</code>:gap separation pen. range | ||
+ | ;<code>-NOPGAP</code>:residue-specific gaps off | ||
+ | ;<code>-NOHGAP</code>:hydrophilic gaps off | ||
+ | ;<code>-HGAPRESIDUES=</code>:list hydrophilic res. | ||
+ | ;<code>-MAXDIV=n</code>:% ident. for delay | ||
+ | ;<code>-TYPE=</code>:PROTEIN or DNA | ||
+ | ;<code>-TRANSWEIGHT=f</code>:transitions weighting | ||
+ | ====Profile Alignments==== | ||
+ | ;<code>-PROFILE</code>:Merge two alignments by profile alignment | ||
+ | ;<code>-NEWTREE1=</code>:file for new guide tree for profile1 | ||
+ | ;<code>-NEWTREE2=</code>:file for new guide tree for profile2 | ||
+ | ;<code>-USETREE1=</code>:file for old guide tree for profile1 | ||
+ | ;<code>-USETREE2=</code>:file for old guide tree for profile2 | ||
− | + | ====Sequence to Profile Alignments==== | |
+ | ;<code>-SEQUENCES</code>:Sequentially add profile2 sequences to profile1 alignment | ||
+ | ;<code>-NEWTREE=</code>:file for new guide tree | ||
+ | ;<code>-USETREE=</code>:file for old guide tree | ||
− | / | + | ====Structure Alignments==== |
− | + | ;<code>-NOSECSTR1</code>:do not use secondary structure-gap penalty mask for profile 1 | |
− | + | ;<code>-NOSECSTR2</code>:do not use secondary structure-gap penalty mask for profile 2 | |
− | + | ;<code>-SECSTROUT=STRUCTURE or MASK or BOTH or NONE</code>:output in alignment file | |
− | + | ;<code>-HELIXGAP=n</code>:gap penalty for helix core residues | |
− | + | ;<code>-STRANDGAP=n</code>:gap penalty for strand core residues | |
+ | ;<code>-LOOPGAP=n</code>:gap penalty for loop regions | ||
+ | ;<code>-TERMINALGAP=n</code>:gap penalty for structure termini | ||
+ | ;<code>-HELIXENDIN=n</code>:number of residues inside helix to be treated as terminal | ||
+ | ;<code>-HELIXENDOUT=n</code>:number of residues outside helix to be treated as terminal | ||
+ | ;<code>-STRANDENDIN=n</code>:number of residues inside strand to be treated as terminal | ||
+ | ;<code>-STRANDENDOUT=n</code>:number of residues outside strand to be treated as terminal | ||
− | / | + | ====Trees==== |
− | / | + | ;<code>-OUTPUTTREE=nj OR phylip OR dist OR nexus</code>: |
− | + | ;<code>-SEED=n</code>:seed number for bootstraps. | |
− | + | ;<code>-KIMURA</code>:use Kimura's correction. | |
+ | ;<code>-TOSSGAPS</code>:ignore positions with gaps. | ||
+ | ;<code>-BOOTLABELS=node OR branch</code>:position of bootstrap values in tree display | ||
− | + | ==Examples== | |
− | + | *Generate a DNA alignment file | |
− | + | $ clustalw -align -type=dna -infile=foo.fasta > /dev/null | |
− | + | $ cat foo.aln | |
− | + | CLUSTAL W (1.8) multiple sequence alignment | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | foobar1 AGGTTGTATACTATC | |
− | + | foobar3 AGGTTGT--ACTATC | |
− | + | foobar2 AGGTTGTTTACTATC | |
− | + | foobar4 CGGTTGT--ACTATC | |
− | + | ****** ****** | |
− | + | *Generate a [[Newick phylogenetic tree format|Newick tree]] file | |
− | + | $ clustalw -tree -infile=foo.aln > /dev/null | |
− | + | $ cat foo.ph | |
− | *** | + | ( |
− | + | ( | |
− | + | foobar1:0.01667, | |
− | + | foobar3:-0.01667) | |
− | + | :0.01667, | |
− | + | foobar2:0.01667, | |
− | + | foobar4:0.06026); | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | *** | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | $ | + | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
+ | ===ClustalV=== | ||
+ | ''Note: These examples use the VAX/VMS $ prompt; otherwise, command-line usage is the same on all machines except the Macintosh.'' | ||
− | $ clustalv proteins. | + | *Read whatever sequences are in the file "proteins.seq" and do a full multiple alignment; output will go to the files: "proteins.dnd" (dendrogram) and "proteins.aln" (alignment): |
+ | $ clustalv proteins.seq | ||
+ | ~OR~ | ||
+ | $ clustalv /infile=proteins.seq | ||
− | + | *Same as last example but use K-Tuple size of 2; use a PAM 100 protein weight matrix; write the alignment out in NBRF/PIR format (goes to a file called "proteins.pir"): | |
− | + | $ clustalv proteins.seq/ktup=2/matrix=pam100/output=pir | |
− | + | ||
+ | *Take the alignment in "proteins.seq" and align it with "more.seq" using default values for everything except the fixed gap penalty which is set to 11. The sequence type is explicitely set to PROTEIN: | ||
+ | $ clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11 | ||
− | $ clustalv proteins.pir | + | *Take the sequences in proteins.pir (they MUST BE ALIGNED ALREADY) and calculate a phylogenetic tree using Kimura's correction for distances: |
+ | $ clustalv proteins.pir/tree/kimura | ||
− | Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE | + | *Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE FIRST: |
− | FIRST. | + | $ clustalv proteins.pir/align/tree/kimura |
+ | *Take the sequences in proteins.seq; they are explicitely set to be protein; align them; bootstrap a tree using 500 samples and a seed number of 99: | ||
+ | $ clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=p | ||
− | + | ==See also== | |
+ | *[[T-Coffee]] | ||
+ | *[[MUSCLE]] | ||
− | + | ==References== | |
− | + | <references/> | |
− | + | '''Further reading''' | |
− | + | * Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG (2007). "Clustal W and Clustal X version 2.0". ''Bioinformatics'' '''23''':2947-2948. | |
+ | * Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD (2003). "Multiple sequence alignment with the Clustal series of programs". ''Nucleic Acids Res'' '''31'''(13):3497-500. pmid:12824352. | ||
+ | * Jeanmougin F, Thompson JD, Gouy M, Higgins DG, Gibson TJ (1998). "Multiple sequence alignment with Clustal X". ''Trends Biochem Sci'' '''23'''(10):403-5. pmid:9810230. | ||
+ | * Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997). "The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools". ''Nucleic Acids Res'' '''25''':4876-4882. | ||
+ | * Higgins DG, Thompson JD, Gibson TJ (1996). "Using CLUSTAL for multiple sequence alignments". ''Methods Enzymol'' '''266''':383-402. | ||
+ | * Thompson JD, Higgins DG, Gibson TJ (1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice". ''Nucleic Acids Res'' '''22''':4673-4680. | ||
+ | * Higgins DG (1994). "CLUSTAL V: multiple alignment of DNA and protein sequences". ''Methods Mol Biol'' '''25''':307-318. | ||
+ | * Higgins DG, Bleasby AJ, Fuchs R (1992). "CLUSTAL V: improved software for multiple sequence alignment". ''Comput Appl Biosci'' '''8''':189-191. | ||
+ | * Higgins DG, Sharp PM (1989). "Fast and sensitive multiple sequence alignments on a microcomputer". ''Comput Appl Biosci'' '''5''':151-153. | ||
+ | * Higgins DG, Sharp PM (1988). "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer". ''Gene'' '''73'''(1):237-44. pmid:3243435. | ||
− | == | + | ==External links== |
− | * | + | *[http://www.clustal.org/ Clustal homepage] |
− | * | + | *[http://www.clustal.org/download/clustalw_help.txt clustalw documentation] |
− | [[Category: | + | [[Category:Bioinformatics]] |
Latest revision as of 20:40, 6 November 2008
ClustalW is multiple alignment programme for Unix and Linux (I am ignoring other operating systems in this article/tutorial). ClustalW is a major update and rewrite of the ClustalV program which was described in Higgins et al., 1992.[1]
The main new features are a greatly improved (more sensitive) multiple alignment procedure for proteins and improved support for different file formats. This software was described in Thompson et al., 1994.[2]
The usage of ClustalW is largely the same as for ClustalV. Details of the new alignment algorithms are described in the manuscript by Thompson et al., 1994.[2]
Note: The latest version is 2.0.10 (2008-10-14)
Contents
Usage
Installation
- Create a directory where you want to store the ClustalW executables and files. Extract files from archive downloaded into this directory. For an example (under your "home" directory),
mkdir Clustal cd Clustal mkdir ClustalW
- Compile source files
make
- Add binary path to environment. For example, in a bash console, edit the .bashrc file and add the following:
export PATH=/path/to/clustalw/binary:$PATH
Installation complete!
File input (sequences to be aligned)
The sequences must all be in one file (or two files for a "profile alignment") in ONE of the following formats:
- FASTA (Pearson)
- NBRF/PIR
- EMBL/Swiss Prot
- GDE
- CLUSTAL
- GCG/MSF
- GCG9/RSF
The program tries to "guess" which format is being used and whether the sequences are nucleic acid (DNA/RNA) or amino acid (proteins). The format is recognised by the first characters in the file. This is kind of stupid/crude but works most of the time and it is difficult to do reliably, any other way.
First characters in a file recognised by ClustalW | |
---|---|
Format | First non-blank word or character in the file |
FASTA | > |
NBRF | >P1; or >D1; |
EMBL/SWISS | ID |
GDE protein | % |
GDE nucleotide | # |
CLUSTAL | CLUSTAL (blocked multiple alignments) |
GCG/MSF | PILEUP or !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT or MSF on the first line, and '..' at the end of line |
GCG9/RSF | !!RICH_SEQUENCE |
DNA vs. Protein: the program will count the number of A, C, G, T, U, and N charcters. If 85% or more of the characters in a sequence are as above, then DNA/RNA is assumed, protein otherwise.
File output
The alignments
The alignment output format can be set to: CLUSTAL (a self explanatory blocked alignment). There are other outputs, however, I am only interested in the "CLUSTAL" output.
The trees
The Alignment Algorithms
Terminal gaps
Speed of the initial (pairwise) alignments (fast approximate/slow accurate)
Delaying alignment of distant sequences
Iterative realignment/Reset gaps between alignments
Profile alignment
Notes on "Bootstrapping"
When you use the BOOTSTRAP in ClustalW to estimate the reliability of parts of a tree, many of the uncorrected distances may randomly exceed the arbitrary cut off of 0.93 (sequences only 7% identical) if the sequences are distantly related. This will happen randomly i.e. even if none of the pairs of sequences are less than 7% identical, the bootstrap samples may contain pairs of sequences that do exceed this cut off. If this happens, you will be warned. In practice, this can happen with many data sets. It is not a serious problem if it happens rarely. If it does happen (you are warned when it happens and told how often the problem occurs), you should consider removing the most distantly related sequences and/or using the PHYLIP package instead.
A further problem arises in almost exactly the opposite situation: when you bootstrap a data set which contains 3 or more sequences that are identical or almost identical. Here, the sets of identical sequences should be shown as a multifurcation (several sequences joing at the same part of the tree). Because the Neighbor-Joining method only gives strictly dichotomous trees (never more than 2 sequences join at one time), this cannot be exactly represented. In practice, this is NOT a problem as there will be some internal branches of zero length seperating the sequences. If you display the tree with all branch lengths, you will still see a multifurcation. However, when you bootstrap the tree, only the branching orders are stored and counted. In the case of multifurcations, the exact branching order is arbitrary but the program will always get the same branching order, depending only on the input order of the sequences. In practice, this is only a problem in situations where you have a set of sequences where all of them are VERY similar. In this case, you can find very high support for some groupings which will disappear if you run the analysis with a different input order. Again, the PHYLIP package deals with this by offering a JUMBLE option to shuffle the input order of your sequences between each bootstrap sample.
Summary of the command line usage (very brief)
ClustalW is designed to be run interactively. However, all parameters can be set and run from the command line by giving options after the clustalw command. Options should be preceded by '-'.
If anything is put on the command line, the program will (attempt to) carry out whatever is requested and will exit. If you wish to use the command line to set some parameters and then go into interactive mode, use the command line switch: interactive. For an example:
clustalw -quicktree -interactive
This will set the default initial alignment mode to fast/approximate and will then go to the main menu.
To see a list of all the command line parameters, type:
clustalw -options
This will return a list with no explanation.
To get (very brief) help on command line usage, use the -help or -check options. Otherwise, the command line usage is self explanatory or is explained in clustalv.doc. The defaults for all parameters are set in the file param.h which can be changed easily (remember to recompile the program afterwards).
Command Line Interface
Note: The following is taken directly from the "clustalw.doc
" (v.1.83) file. It will be edited later.
Data (sequences)
-INFILE=file.ext
- input sequences.
-PROFILE1=file.ext and -PROFILE2=file.ext
- profiles (old alignment).
Verbs (do things)
-OPTIONS
- list the command line parameters
-HELP or -CHECK
- outline the command line params.
-ALIGN
- do full multiple alignment
-TREE
- calculate NJ tree.
-BOOTSTRAP(=n)
- bootstrap a NJ tree (n= number of bootstraps; def. = 1000).
-CONVERT
- output the input sequences in a different file format.
Parameters (set things)
General settings
-INTERACTIVE
- read command line, then enter normal interactive menus
-QUICKTREE
- use FAST algorithm for the alignment guide tree
-TYPE=
- PROTEIN or DNA sequences
-NEGATIVE
- protein alignment with negative values in matrix
-OUTFILE=
- sequence alignment file name
-OUTPUT=
- GCG, GDE, PHYLIP, PIR, or NEXUS
-OUTORDER=
- INPUT or ALIGNED
-CASE
- LOWER or UPPER (for GDE output only)
-SEQNOS=
- OFF or ON (for Clustal output only)
-SEQNO_RANGE=
- OFF or ON (NEW: for all output formats)
-RANGE=m,n
- sequence range to write starting m to m+n.
Fast Pairwise Alignments
-KTUPLE=n
- word size
-TOPDIAGS=n
- number of best diags.
-WINDOW=n
- window around best diags.
-PAIRGAP=n
- gap penalty
-SCORE
- PERCENT or ABSOLUTE
Slow Pairwise Alignments
-PWMATRIX=
- Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename
-PWDNAMATRIX=
- DNA weight matrix=IUB, CLUSTALW or filename
-PWGAPOPEN=f
- gap opening penalty
-PWGAPEXT=f
- gap opening penalty
Multiple Alignments
-NEWTREE=
- file for new guide tree
-USETREE=
- file for old guide tree
-MATRIX=
- Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename
-DNAMATRIX=
- DNA weight matrix=IUB, CLUSTALW or filename
-GAPOPEN=f
- gap opening penalty
-GAPEXT=f
- gap extension penalty
-ENDGAPS
- no end gap separation pen.
-GAPDIST=n
- gap separation pen. range
-NOPGAP
- residue-specific gaps off
-NOHGAP
- hydrophilic gaps off
-HGAPRESIDUES=
- list hydrophilic res.
-MAXDIV=n
- % ident. for delay
-TYPE=
- PROTEIN or DNA
-TRANSWEIGHT=f
- transitions weighting
Profile Alignments
-PROFILE
- Merge two alignments by profile alignment
-NEWTREE1=
- file for new guide tree for profile1
-NEWTREE2=
- file for new guide tree for profile2
-USETREE1=
- file for old guide tree for profile1
-USETREE2=
- file for old guide tree for profile2
Sequence to Profile Alignments
-SEQUENCES
- Sequentially add profile2 sequences to profile1 alignment
-NEWTREE=
- file for new guide tree
-USETREE=
- file for old guide tree
Structure Alignments
-NOSECSTR1
- do not use secondary structure-gap penalty mask for profile 1
-NOSECSTR2
- do not use secondary structure-gap penalty mask for profile 2
-SECSTROUT=STRUCTURE or MASK or BOTH or NONE
- output in alignment file
-HELIXGAP=n
- gap penalty for helix core residues
-STRANDGAP=n
- gap penalty for strand core residues
-LOOPGAP=n
- gap penalty for loop regions
-TERMINALGAP=n
- gap penalty for structure termini
-HELIXENDIN=n
- number of residues inside helix to be treated as terminal
-HELIXENDOUT=n
- number of residues outside helix to be treated as terminal
-STRANDENDIN=n
- number of residues inside strand to be treated as terminal
-STRANDENDOUT=n
- number of residues outside strand to be treated as terminal
Trees
-OUTPUTTREE=nj OR phylip OR dist OR nexus
-SEED=n
- seed number for bootstraps.
-KIMURA
- use Kimura's correction.
-TOSSGAPS
- ignore positions with gaps.
-BOOTLABELS=node OR branch
- position of bootstrap values in tree display
Examples
- Generate a DNA alignment file
$ clustalw -align -type=dna -infile=foo.fasta > /dev/null $ cat foo.aln CLUSTAL W (1.8) multiple sequence alignment foobar1 AGGTTGTATACTATC foobar3 AGGTTGT--ACTATC foobar2 AGGTTGTTTACTATC foobar4 CGGTTGT--ACTATC ****** ******
- Generate a Newick tree file
$ clustalw -tree -infile=foo.aln > /dev/null $ cat foo.ph ( ( foobar1:0.01667, foobar3:-0.01667) :0.01667, foobar2:0.01667, foobar4:0.06026);
ClustalV
Note: These examples use the VAX/VMS $ prompt; otherwise, command-line usage is the same on all machines except the Macintosh.
- Read whatever sequences are in the file "proteins.seq" and do a full multiple alignment; output will go to the files: "proteins.dnd" (dendrogram) and "proteins.aln" (alignment):
$ clustalv proteins.seq ~OR~ $ clustalv /infile=proteins.seq
- Same as last example but use K-Tuple size of 2; use a PAM 100 protein weight matrix; write the alignment out in NBRF/PIR format (goes to a file called "proteins.pir"):
$ clustalv proteins.seq/ktup=2/matrix=pam100/output=pir
- Take the alignment in "proteins.seq" and align it with "more.seq" using default values for everything except the fixed gap penalty which is set to 11. The sequence type is explicitely set to PROTEIN:
$ clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11
- Take the sequences in proteins.pir (they MUST BE ALIGNED ALREADY) and calculate a phylogenetic tree using Kimura's correction for distances:
$ clustalv proteins.pir/tree/kimura
- Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE FIRST:
$ clustalv proteins.pir/align/tree/kimura
- Take the sequences in proteins.seq; they are explicitely set to be protein; align them; bootstrap a tree using 500 samples and a seed number of 99:
$ clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=p
See also
References
- ↑ Higgins DG, Bleasby AJ, Fuchs R (1992). CLUSTAL V: improved software for multiple sequence alignment. Comput Appl Biosci, 8(2):189-91. pmid:1591615.
- ↑ 2.0 2.1 Thompson JD, Higgins DG, Gibson TJ (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22(22):4673-80. pmid:7984417.
Further reading
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG (2007). "Clustal W and Clustal X version 2.0". Bioinformatics 23:2947-2948.
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD (2003). "Multiple sequence alignment with the Clustal series of programs". Nucleic Acids Res 31(13):3497-500. pmid:12824352.
- Jeanmougin F, Thompson JD, Gouy M, Higgins DG, Gibson TJ (1998). "Multiple sequence alignment with Clustal X". Trends Biochem Sci 23(10):403-5. pmid:9810230.
- Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997). "The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools". Nucleic Acids Res 25:4876-4882.
- Higgins DG, Thompson JD, Gibson TJ (1996). "Using CLUSTAL for multiple sequence alignments". Methods Enzymol 266:383-402.
- Thompson JD, Higgins DG, Gibson TJ (1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice". Nucleic Acids Res 22:4673-4680.
- Higgins DG (1994). "CLUSTAL V: multiple alignment of DNA and protein sequences". Methods Mol Biol 25:307-318.
- Higgins DG, Bleasby AJ, Fuchs R (1992). "CLUSTAL V: improved software for multiple sequence alignment". Comput Appl Biosci 8:189-191.
- Higgins DG, Sharp PM (1989). "Fast and sensitive multiple sequence alignments on a microcomputer". Comput Appl Biosci 5:151-153.
- Higgins DG, Sharp PM (1988). "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer". Gene 73(1):237-44. pmid:3243435.