ClustalW

From Christoph's Personal Wiki
Revision as of 01:12, 22 December 2006 by Christoph (Talk | contribs) (References)

Jump to: navigation, search

ClustalW is multiple alignment programme for Unix and Linux (I am ignoring other operating systems in this article/tutorial). ClustalW is a major update and rewrite of the ClustalV program which was described in Higgins et al., 1992.

The main new features are a greatly improved (more sensitive) multiple alignment procedure for proteins and improved support for different file formats. This software was described in Thompson et al., 1994.

The usage of ClustalW is largely the same as for ClustalV. Details of the new alignment algorithms are described in the manuscript by Thompson et al., 1994.

Usage

Installation

  • Create a directory where you want to store the ClustalW executables and files. Extract files from archive downloaded into this directory. For an example (under your "home" directory),
mkdir Clustal
cd Clustal
mkdir ClustalW
  • Compile source files
make
  • Add binary path to environment. For example, in a bash console, edit the .bashrc file and add the following:
export PATH=/path/to/clustalw/binary:$PATH

Installation complete!

File input (sequences to be aligned)

The sequences must all be in one file (or two files for a "profile alignment") in ONE of the following formats:

  • FASTA (Pearson)
  • NBRF/PIR
  • EMBL/Swiss Prot
  • GDE
  • CLUSTAL
  • GCG/MSF
  • GCG9/RSF

The program tries to "guess" which format is being used and whether the sequences are nucleic acid (DNA/RNA) or amino acid (proteins). The format is recognised by the first characters in the file. This is kind of stupid/crude but works most of the time and it is difficult to do reliably, any other way.

First characters in a file recognised by ClustalW
Format First non-blank word or character in the file
FASTA >
NBRF >P1; or >D1;
EMBL/SWISS ID
GDE protein %
GDE nucleotide #
CLUSTAL CLUSTAL (blocked multiple alignments)
GCG/MSF PILEUP or !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT or MSF on the first line, and '..' at the end of line
GCG9/RSF !!RICH_SEQUENCE
source: ClustalW documentation by EMBL


DNA vs. Protein: the program will count the number of A, C, G, T, U, and N charcters. If 85% or more of the characters in a sequence are as above, then DNA/RNA is assumed, protein otherwise.

File output

The alignments

The alignment output format can be set to: CLUSTAL (a self explanatory blocked alignment). There are other outputs, however, I am only interested in the "CLUSTAL" output.

The trees

The Alignment Algorithms

Terminal gaps

Speed of the initial (pairwise) alignments (fast approximate/slow accurate)

Delaying alignment of distant sequences

Iterative realignment/Reset gaps between alignments

Profile alignment

Notes on "Bootstrapping"

When you use the BOOTSTRAP in ClustalW to estimate the reliability of parts of a tree, many of the uncorrected distances may randomly exceed the arbitrary cut off of 0.93 (sequences only 7% identical) if the sequences are distantly related. This will happen randomly i.e. even if none of the pairs of sequences are less than 7% identical, the bootstrap samples may contain pairs of sequences that do exceed this cut off. If this happens, you will be warned. In practice, this can happen with many data sets. It is not a serious problem if it happens rarely. If it does happen (you are warned when it happens and told how often the problem occurs), you should consider removing the most distantly related sequences and/or using the PHYLIP package instead.

A further problem arises in almost exactly the opposite situation: when you bootstrap a data set which contains 3 or more sequences that are identical or almost identical. Here, the sets of identical sequences should be shown as a multifurcation (several sequences joing at the same part of the tree). Because the Neighbor-Joining method only gives strictly dichotomous trees (never more than 2 sequences join at one time), this cannot be exactly represented. In practice, this is NOT a problem as there will be some internal branches of zero length seperating the sequences. If you display the tree with all branch lengths, you will still see a multifurcation. However, when you bootstrap the tree, only the branching orders are stored and counted. In the case of multifurcations, the exact branching order is arbitrary but the program will always get the same branching order, depending only on the input order of the sequences. In practice, this is only a problem in situations where you have a set of sequences where all of them are VERY similar. In this case, you can find very high support for some groupings which will disappear if you run the analysis with a different input order. Again, the PHYLIP package deals with this by offering a JUMBLE option to shuffle the input order of your sequences between each bootstrap sample.

Summary of the command line usage (very brief)

ClustalW is designed to be run interactively. However, all parameters can be set and run from the command line by giving options after the clustalw command. Options should be preceded by '-'.

If anything is put on the command line, the program will (attempt to) carry out whatever is requested and will exit. If you wish to use the command line to set some parameters and then go into interactive mode, use the command line switch: interactive. For an example:

clustalw -quicktree -interactive

This will set the default initial alignment mode to fast/approximate and will then go to the main menu.

To see a list of all the command line parameters, type:

clustalw -options

This will return a list with no explanation.

To get (very brief) help on command line usage, use the -help or -check options. Otherwise, the command line usage is self explanatory or is explained in clustalv.doc. The defaults for all parameters are set in the file param.h which can be changed easily (remember to recompile the program afterwards).

Command Line Interface

Note: The following is taken directly from the "clustalv.doc" file. It will be edited later.

You can do almost everything that can be done from the menus, using 
a command line interface. In this mode, the program will take all of 
its instructions as "switches" when you activate it; no questions 
will be asked; if there are no errors, the program just does an 
analysis and stops.   It does not work so well on the MAC but is 
still possible.  To get you started we will show you the 2 simplest 
uses of the command line as it looks on VAX/VMS.  On all other 
machines (except the MAC) it works in the same way.

$ clustalv /help           **OR**   $ clustalv /check

Both of the above switches give you a one page summary of the 
command line on the screen and then the program stops. 


$ clustalv proteins.seq    **OR**   $ clustalv /infile=proteins.seq    

This will read the sequences from the file 'proteins.seq' and do a 
complete multiple alignment.  Default parameters will be used, the 
program will try to tell whether or not the sequences are DNA or 
protein and the output will go to a file called 'proteins.aln' . A 
dendrogram file called 'proteins.dnd' will also be created.  Thus 
the default action for the program, when it successfully reads in an 
input file is to do a full multiple alignment.  Some further 
examples of command line usage will be given leter.

Command line switches can be abbreviated but MAKE SURE YOU DO NOT 
MAKE THEM AMBIGUOUS.  No attempt will be made to detect ambiguity.  
Use enough characters to distinguish each switch uniquely.

The full list of allowed switches is given below:


                DATA (sequences)

/INFILE=file.ext    :input sequences.  If you give an input file and 
				nothing else as a switch, the default action is 
				to do a complete multiple alignment.  The input 
				file can also be specified by giving it as the 
				first command line parameter with no "/" in 	
				front of it e.g $ clustalv file.ext  .

/PROFILE1=file.ext	:You use these two switches to give the names of  
/PROFILE2=file.ext	two profiles.  The default action is to align 
			the two. You must give the names of both profile 
				files. 



                VERBS (do things)

/HELP  		:list the command line parameters on the screen.
/CHECK           
                
/ALIGN        	:do full multiple alignment.  This is the default 	
			action if no other switches except for input files 
			are given.

/TREE      	:calculate NJ tree.  If this is the only action 	
			specified (e.g. $ clustalv proteins.seq/tree ) it IS 
			ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED.  If 
			the sequences are not already aligned, you should 	
			also give the /ALIGN switch.  This will align the 	
			sequences first, output an alignment file and 	
			calculate the tree in memory. 

/BOOTSTRAP(=n)	:bootstrap a NJ tree (n= number of bootstraps; 	
			default = 1000).  If this is the only action 		
			specified (e.g. $ clustalv proteins.seq/bootstrap ) 
			it IS ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED.  
			If the sequences are not already aligned, you should 
			also give the /ALIGN switch.  This will align the 	
			sequences first, output an alignment file and 	
			calculate the bootstraps in memory.  You can set the 
			number of bootstrap trials here (e.g./bootstrap=500).  
			You can set the seed number for the random number 	
			generator with /seed=n.



                PARAMETERS (set things)

***Pairwise alignments:***

/KTUP=n      	:word size              
    
/TOPDIAGS=n  	:number of best diagonals

/WINDOW=n    	:window around best diagonals 
 
/PAIRGAP=n   	:gap penalty



***Multiple alignments:***

/FIXEDGAP=n  	:fixed length gap pen.  
    
/FLOATGAP=n  	:variable length gap pen.

/MATRIX=     	:PAM100 or ID or file name. The default weight matrix 
			for proteins is PAM 250.

/TYPE=p or d 	:type is protein or DNA.   This allows you to 	
			explicitely overide the programs attempt at guessing 
			the type of the sequence.  It is only useful if you 
			are using sequences with a VERY strange composition.

/OUTPUT=     	:GCG or PHYLIP or PIR.  The default output is 	
			Clustal format.
    
/TRANSIT     	:transitions not weighted.  The default is to weight 
			transitions as more favourable than other mismatches 
			in DNA alignments.  This switch makes all nucleotide 
			mismatches equally weighted.


***Trees:***                             

/KIMURA      	:use Kimura's correction on distances.   

/TOSSGAPS    	:ignore positions with a gap in ANY sequence.

/SEED=n      	:seed number for bootstraps.


EXAMPLES:

These examples use the VAX/VMS $ prompt; otherwise, command-line 
usage is the same on all machines except the Macintosh.

 
$ clustalv proteins.seq      OR     $ clustalv /infile=proteins.seq

Read whatever sequences are in the file "proteins.seq" and do a full 
multiple alignment; output will go to the files: "proteins.dnd" 
(dendrogram) and "proteins.aln" (alignment).


$ clustalv proteins.seq/ktup=2/matrix=pam100/output=pir

Same as last example but use K-Tuple size of 2; use a PAM 100 
protein weight matrix; write the alignment out in NBRF/PIR format 
(goes to a file called "proteins.pir").


$ clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11

Take the alignment in "proteins.seq" and align it with "more.seq" 
using default values for everything except the fixed gap penalty 
which is set to 11.  The sequence type is explicitely set to 
PROTEIN.


$ clustalv proteins.pir/tree/kimura

Take the sequences in proteins.pir (they MUST BE ALIGNED ALREADY) 
and calculate a phylogenetic tree using Kimura's correction for 
distances.  


$ clustalv proteins.pir/align/tree/kimura

Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE 
FIRST.


$ clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=p

Take the sequences in proteins.seq; they are explicitely set to be 
protein; align them; bootstrap a tree using 500 samples and a seed 
number of 99.

References

  • Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD (2003). Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res, 31(13):3497-500. pmid:12824352.
  • Jeanmougin F, Thompson JD, Gouy M, Higgins DG, Gibson TJ (1998). Multiple sequence alignment with Clustal X. Trends Biochem Sci, 23(10):403-5. pmid:9810230.
  • Thompson JD, Higgins DG, Gibson TJ (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22(22):4673-80. pmid:7984417.
  • Higgins DG, Bleasby AJ, Fuchs R (1992). CLUSTAL V: improved software for multiple sequence alignment. Comput Appl Biosci, 8(2):189-91. pmid:1591615.
  • Higgins DG, Sharp PM (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73(1):237-44. pmid:3243435.