Difference between revisions of "ClustalW"

From Christoph's Personal Wiki
Jump to: navigation, search
(Started article)
 
(File input (sequences to be aligned))
Line 67: Line 67:
 
<br clear="all" />
 
<br clear="all" />
  
 
+
DNA vs. Protein: the program will count the number of A, C, G, T, U, and N charcters. If 85% or more of the characters in a sequence are as above, then DNA/RNA is assumed, protein otherwise.
  
 
== References ==
 
== References ==

Revision as of 22:34, 28 December 2005

ClustalW is multiple alignment programme for Unix and Linux (I am ignoring other operating systems in this article/tutorial). ClustalW is a major update and rewrite of the Clustal V program which was described in Higgins et al., 1992.

The main new features are a greatly improved (more sensitive) multiple alignment procedure for proteins and improved support for different file formats. This software was described in Thompson et al., 1994.

The usage of ClustalW is largely the same as for ClustalV. Details of the new alignment algorithms are described in the manuscript by Thompson et al., 1994.

Usage

Installation

  • Create a directory where you want to store the ClustalW executables and files. Extract files from archive downloaded into this directory. For an example (under your "home" directory),
mkdir Clustal
cd Clustal
mkdir ClustalW
  • Compile source files
make
  • Add binary path to environment. For example, in a bash console, edit the .bashrc file and add the following:
export PATH=/path/to/clustalw/binary:$PATH

Installation complete!

File input (sequences to be aligned)

The sequences must all be in one file (or two files for a "profile alignment") in ONE of the following formats:

  • FASTA (Pearson)
  • NBRF/PIR
  • EMBL/Swiss Prot
  • GDE
  • CLUSTAL
  • GCG/MSF
  • GCG9/RSF

The program tries to "guess" which format is being used and whether the sequences are nucleic acid (DNA/RNA) or amino acid (proteins). The format is recognised by the first characters in the file. This is kind of stupid/crude but works most of the time and it is difficult to do reliably, any other way.

First characters in a file recognised by ClustalW
Format First non-blank word or character in the file
FASTA >
NBRF >P1; or >D1;
EMBL/SWISS ID
GDE protein %
GDE nucleotide #
CLUSTAL CLUSTAL (blocked multiple alignments)
GCG/MSF PILEUP or !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT or MSF on the first line, and '..' at the end of line
GCG9/RSF !!RICH_SEQUENCE
source: ClustalW documentation by EMBL


DNA vs. Protein: the program will count the number of A, C, G, T, U, and N charcters. If 85% or more of the characters in a sequence are as above, then DNA/RNA is assumed, protein otherwise.

References

  • Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22(22):4673-4680.
  • Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992) CLUSTAL V: improved software for multiple sequence alignment. Computer Applications in the Biosciences (CABIOS), 8(2):189-191.