Difference between revisions of "Superpose"

From Christoph's Personal Wiki
Jump to: navigation, search
(Background)
Line 14: Line 14:
 
Several approaches to protein structure alignment have been explored over the past decade. The techniques used
 
Several approaches to protein structure alignment have been explored over the past decade. The techniques used
 
include:
 
include:
*comparison of distance matrices (DALI; Holm & Sander, 1993);
+
*comparison of distance matrices (''DALI'')<ref name="Holm1993">Holm L, Sander C (1993). ''J Mol Biol, 233:123-138''.</ref>;
*analysis of differences in vector distance plots (Orengo & Taylor, 1996);
+
*analysis of differences in vector distance plots<ref name="Orengo1996">Orengo CA, Taylor WR (1996). ''Methods Enzymol, 266:617-635''.</ref>;
*minimization of the soap-bubble surface area between two protein backbones (Falicov & Cohen, 1996);
+
*minimization of the soap-bubble surface area between two protein backbones<ref name="Falicov1996">Falicov A, Cohen FE (1996). ''J Mol Biol, 258:871-892''.</ref>;
*dynamic programming on pairwise distances between the proteins' residues (Subbiah et al., 1993; Gerstein & Levitt, 1996, 1998);
+
*dynamic programming on pairwise distances between the proteins' residues<ref name="Subbiah1993">Subbiah S, Laurents DV, Levitt M (1993). ''Curr Biol, 3:141-148''.</ref><ref name="Gerstein1998">Gerstein M, Levitt M (1998). ''Protein Sci, 7:445-456''.</ref><ref name="Gerstein1996">Gerstein M, Levitt M (1996). Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pp. 59-67. Menlo Park, California: AAAI Press.</ref>;
*secondary-structure elements (SSEs) (Singh & Brutlag, 1997);
+
*secondary-structure elements (SSEs)<ref name="Singh1997">Singh AP, Brutlag DL (1997). Proceedings of the International Conference on Intelligent Systems for Molecular Biology ISMB-97, pp. 284-293. Halkidiki, Greece: AAAI Press.</ref>;
*three-dimensional clustering (Vriend & Sander, 1991; Mizuguchi & Go, 1995);
+
*three-dimensional clustering<ref name="Vriend1991">Vriend G, Sander C (1991). ''Proteins, 11:52-58''.</ref><ref name="Mizuguchi1995">Mizuguchi K, Go N (1995). ''Protein Eng, 8:353-362''.</ref>;
*graph theory (Mitchell et al., 1990; Alexandrov, 1996; Grindley et al., 1993);
+
*graph theory<ref name="Mitchell1990">Mitchell EM, Artymiuk PJ, Rice DW, Willett PJ (1990). ''Mol Biol, 212:151-166''.</ref><ref name="Alexandrov1996">Alexandrov NN (1996). ''Protein Eng, pp.727-732''.</ref><ref name="Grindley1993">Grindley HM, Artymiuk PJ, Rice DW, Willett PJ (1993). ''Mol Biol, 229:707-721''.</ref>;
*combinatorial extension of alignment path (CE; Shindyalov & Bourne, 1998);
+
*combinatorial extension of alignment path (''CE'')<ref name="Shindyalov1998">Shindyalov IN, Bourne PE (1998). ''Protein Eng, 11:739-747''.</ref>;
*vector alignment of SSEs (VAST; Gibrat et al., 1996);
+
*vector alignment of SSEs (''VAST'')<ref name="Gibrat1996">Gibrat J-F, Madej T, Bryant SH (1996). ''Curr Opin Struct Biol, 6:377-385''.</ref>;
*depth-first recursive search on SSE (DEJAVU; Kleywegt & Jones, 1997); and
+
*depth-first recursive search on SSE (''DEJAVU'')<ref name="Kleywegt1997">Kleywegt GJ, Jones TA (1997). ''Methods Enzymol, 277:525-545''.</ref>; and
*many others (Zuker & Somorjai, 1989; Taylor & Orengo, 1989; Godzik & Skolnick, 1994; Russell & Barton, 1992; Sali & Blundell, 1990; Barakat & Dean, 1991; Leluk et al., 2003; Jung & Lee, 2000; Kato & Takahashi, 2001).
+
*many others.<ref name="Zuker1989">Zuker M, Somorjai RL (1989). ''Bull Math Biol, 51:55-78''.</ref><ref name="Taylor1989">Taylor W, Orengo CJ (1989). ''Mol Biol, 208:1-22''.</ref><ref name="Godzik1994">Godzik A, Skolnick J (1994). ''Comput Appl Biosci, 10:587-596''.</ref><ref name="Russell1992">Russell RB, Barton GJ (1992). ''Proteins, 14:309-323''.</ref><ref name="Sali1990">Sali A, Blundell TJ (1990). ''Mol Biol, 212:403-428''.</ref><ref name="Barakat1991">Barakat DW, Dean PMJ (1991). ''Comput Aided Mol Des, 5:107-117''.</ref><ref name="Leluk2003">Leluk J, Konieczny L, Roterman I (2003). ''Bioinformatics, 19:117-124''.</ref><ref name="Jung2000">Jung J, Lee B (2000). ''Protein Eng, 13:535-543''.</ref><ref name="Kato2001">Kato H, Takahashi YJ (2001). ''Chem Softw, 7:161-170''.</ref>
  
 
Most details of protein fold may be expressed in terms of just two types of SSEs, namely helices (including what type of helix) and strands.
 
Most details of protein fold may be expressed in terms of just two types of SSEs, namely helices (including what type of helix) and strands.
Line 33: Line 33:
 
*"Strict" connectivity: Matched SSEs follow the same order along their protein chains and may be separated only by an equal number of matched or unmatched SSEs in both structures.
 
*"Strict" connectivity: Matched SSEs follow the same order along their protein chains and may be separated only by an equal number of matched or unmatched SSEs in both structures.
  
The decrease in structure similarity is seen as an exponential-like increase in RMSD, which has also been found
+
The decrease in structure similarity is seen as an exponential-like increase in RMSD, which has also been found in other studies.<ref name="Chotia1986">Chotia C, Lesk AM (1986). ''EMBO J, 5:823-826''.</ref><ref name="Hubbard1987">Hubbard TJP, Blundell TL (1987). ''Protein Eng, 1:159-171''.</ref><ref name="Flores1993">Flores TP, Orengo CA, Moss DC, Thornton JM (1993). ''Protein Sci, 2:1811-1826''.</ref><ref name="Russell1994">Russell RB, Barton GJ (1994). ''J Mol Biol, 244:332-350''.</ref>
in other studies (Chotia & Lesk, 1986; Hubbard & Blundell, 1987; Flores et al., 1993; Russell & Barton, 1994; Russell et al., 1997).
+
 
 +
*Rouvray et al., address the problems of structure comparison and recognition by the graph-theoretical approach.<ref name="Rouvray1979">Rouvray DH, Balaban AT, Wilson RJ, Beineke LW (1979). Editors. Applications of Graph Theory, pp. 177-221. NewYork: Academic Press.</ref>
 +
 
 +
==Scores==
 +
===Q score===
 +
Q = Nalign**2 / {[1+(RMSD/R0)**2]*N1*N2}
 +
where, Nalign is the number of residues that align with each other, N1 and N2 are the input structures, and R0 is an empirical parameter (chose at 3 A) that measure the relative significance of RMSD and Nalign.
 +
 
 +
Q reaches 1 only for identical structures (Nalign = N1 = N2 and RMSD = 0), and decreases to zero with decreasing similarity (increasing RMSD or/and decreasing Nalign. Despite the fact that the Q score represents a very basic measure that does not take into account many factors related to the quality of alignment (the number of gaps and their size, sequence identity, etc.), we found that maximization of the Q score produces good results.
 +
 
 +
The higher the Q score the "better", in general, the alignment.
 +
 
 +
===Sequence Identity (SI)===
 +
Nm = Nalign / min(N1,N2)
 +
where, Nm is the normalized alignment length.
 +
 
 +
The sequence identity is defined as a fraction of identical residues in the total number of (structurally) aligned residues:
 +
SI = Nident / Nalign
 +
 
 +
SI<20% is a solid indication of low structural similarity.
  
 
==Keywords==
 
==Keywords==
secondary-structure elements (SSEs), singular value decomposition (SVM; of the correlation matrix, following the method described by Lesk, 1986), [[RMSD]]
+
secondary-structure elements (SSEs), singular value decomposition (SVM; of the correlation matrix, following the method described by Lesk<ref name="Lesk">Lesk AM (1986). ''Acta Cryst A42, 110-113''.</ref>), [[RMSD]]
  
 
==See also==
 
==See also==
 
===Web servers===
 
===Web servers===
*''DALI'' (Holm & Sander, 1993)
+
*''DALI''<ref name="Holm1993">Holm L, Sander C (1993). ''J Mol Biol, 233:123-138''.</ref>
*''VAST'' (Gibrat et al., 1996)
+
*''VAST''<ref name="Gibrat1996">Gibrat J-F, Madej T, Bryant SH (1996). ''Curr Opin Struct Biol, 6:377-385''.</ref>
*combinatorial extension (CE) (Shindyalov & Bourne, 1998)
+
*combinatorial extension (CE)<ref name="Shindyalov1998">Shindyalov IN, Bourne PE (1998). ''Protein Eng, 11:739-747''.</ref>
*''DEJAVU''
+
*''DEJAVU''<ref name="Kleywegt1997">Kleywegt GJ, Jones TA (1997). ''Methods Enzymol, 277:525-545''.</ref>
  
 
===Related===
 
===Related===
Line 52: Line 71:
  
 
==References==
 
==References==
<references/>
+
<small><references/></small>
===Further reading===
+
*Rouvray et al., 1979, and references therein &mdash; addresses the problems of structure comparison and recognition by the graph-theoretical approach.
+
 
+
 
==External links==
 
==External links==
 
*[http://www.ebi.ac.uk/msd-srv/ssm Secondary Structure Matching (SSM) web server]
 
*[http://www.ebi.ac.uk/msd-srv/ssm Secondary Structure Matching (SSM) web server]

Revision as of 02:36, 1 August 2007

superpose - structural alignment based on secondary structure matching and is based on the Secondary Structure Matching (SSM) advanced graph-matching algorithm. It is part of the CCP4 package and was written by Eugene Krissinel of the European Bioinformatics Institute, Cambridge, UK.

Background

"While high sequence similarity almost always implies structural similarity, the opposite is not true. It is therefore expected that three-dimensional alignment will provide more significant clues to protein function and properties than sequence alignment alone".[1]

Most similarity measures are based on the evaluation of the size of common substructures, for example the length of alignment (the longer, the better), and a measure of the distance between them, such as r.m.s.d. (the lower, the better).

The graph-theoretical approach typically includes three major steps:

  1. graph representation of the objects in question;
  2. matching the graphs representing the objects; and
  3. evaluating the common subgraphs found in order to form conclusions about similarity.

Several approaches to protein structure alignment have been explored over the past decade. The techniques used include:

  • comparison of distance matrices (DALI)[2];
  • analysis of differences in vector distance plots[3];
  • minimization of the soap-bubble surface area between two protein backbones[4];
  • dynamic programming on pairwise distances between the proteins' residues[5][6][7];
  • secondary-structure elements (SSEs)[8];
  • three-dimensional clustering[9][10];
  • graph theory[11][12][13];
  • combinatorial extension of alignment path (CE)[14];
  • vector alignment of SSEs (VAST)[15];
  • depth-first recursive search on SSE (DEJAVU)[16]; and
  • many others.[17][18][19][20][21][22][23][24][25]

Most details of protein fold may be expressed in terms of just two types of SSEs, namely helices (including what type of helix) and strands.

Usually the connectivity of SSEs is significant; however, there are situations where it may or should be neglected (e.g. comparison of mutated or engineered proteins, or geometry of active sites). This is the case I am interested in. That is, one can have three-dimensional SSE graphs that are geometrically identical yet have a difference in connectivity between the SSEs. Flexible connectivity is handled in the following ways:

  • Connectivity of SSEs is neglected;
  • "Soft" connectivity: The general order of matched SSEs along their protein chains is the same in both structures, but any number of missing or unmatched SSEs between the matched ones is allowed; and
  • "Strict" connectivity: Matched SSEs follow the same order along their protein chains and may be separated only by an equal number of matched or unmatched SSEs in both structures.

The decrease in structure similarity is seen as an exponential-like increase in RMSD, which has also been found in other studies.[26][27][28][29]

  • Rouvray et al., address the problems of structure comparison and recognition by the graph-theoretical approach.[30]

Scores

Q score

Q = Nalign**2 / {[1+(RMSD/R0)**2]*N1*N2}

where, Nalign is the number of residues that align with each other, N1 and N2 are the input structures, and R0 is an empirical parameter (chose at 3 A) that measure the relative significance of RMSD and Nalign.

Q reaches 1 only for identical structures (Nalign = N1 = N2 and RMSD = 0), and decreases to zero with decreasing similarity (increasing RMSD or/and decreasing Nalign. Despite the fact that the Q score represents a very basic measure that does not take into account many factors related to the quality of alignment (the number of gaps and their size, sequence identity, etc.), we found that maximization of the Q score produces good results.

The higher the Q score the "better", in general, the alignment.

Sequence Identity (SI)

Nm = Nalign / min(N1,N2)

where, Nm is the normalized alignment length.

The sequence identity is defined as a fraction of identical residues in the total number of (structurally) aligned residues:

SI = Nident / Nalign

SI<20% is a solid indication of low structural similarity.

Keywords

secondary-structure elements (SSEs), singular value decomposition (SVM; of the correlation matrix, following the method described by Lesk[31]), RMSD

See also

Web servers

Related

  • PROMOTIF algorithm (Hutchinson & Thornton, 1996) — aids in calculating SSEs
  • CONTACT bricking algorithm (e.g., Tadeusz Skarzynski in CCP4 suite) — computes various types of contacts in protein structures.
  • NCONT — analyses contacts between subsets of atoms in a PDB file.

References

  1. Krissinel E, Henrick K (2004). "Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions". Acta Cryst, D60:2256-2268.
  2. 2.0 2.1 Holm L, Sander C (1993). J Mol Biol, 233:123-138.
  3. Orengo CA, Taylor WR (1996). Methods Enzymol, 266:617-635.
  4. Falicov A, Cohen FE (1996). J Mol Biol, 258:871-892.
  5. Subbiah S, Laurents DV, Levitt M (1993). Curr Biol, 3:141-148.
  6. Gerstein M, Levitt M (1998). Protein Sci, 7:445-456.
  7. Gerstein M, Levitt M (1996). Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pp. 59-67. Menlo Park, California: AAAI Press.
  8. Singh AP, Brutlag DL (1997). Proceedings of the International Conference on Intelligent Systems for Molecular Biology ISMB-97, pp. 284-293. Halkidiki, Greece: AAAI Press.
  9. Vriend G, Sander C (1991). Proteins, 11:52-58.
  10. Mizuguchi K, Go N (1995). Protein Eng, 8:353-362.
  11. Mitchell EM, Artymiuk PJ, Rice DW, Willett PJ (1990). Mol Biol, 212:151-166.
  12. Alexandrov NN (1996). Protein Eng, pp.727-732.
  13. Grindley HM, Artymiuk PJ, Rice DW, Willett PJ (1993). Mol Biol, 229:707-721.
  14. 14.0 14.1 Shindyalov IN, Bourne PE (1998). Protein Eng, 11:739-747.
  15. 15.0 15.1 Gibrat J-F, Madej T, Bryant SH (1996). Curr Opin Struct Biol, 6:377-385.
  16. 16.0 16.1 Kleywegt GJ, Jones TA (1997). Methods Enzymol, 277:525-545.
  17. Zuker M, Somorjai RL (1989). Bull Math Biol, 51:55-78.
  18. Taylor W, Orengo CJ (1989). Mol Biol, 208:1-22.
  19. Godzik A, Skolnick J (1994). Comput Appl Biosci, 10:587-596.
  20. Russell RB, Barton GJ (1992). Proteins, 14:309-323.
  21. Sali A, Blundell TJ (1990). Mol Biol, 212:403-428.
  22. Barakat DW, Dean PMJ (1991). Comput Aided Mol Des, 5:107-117.
  23. Leluk J, Konieczny L, Roterman I (2003). Bioinformatics, 19:117-124.
  24. Jung J, Lee B (2000). Protein Eng, 13:535-543.
  25. Kato H, Takahashi YJ (2001). Chem Softw, 7:161-170.
  26. Chotia C, Lesk AM (1986). EMBO J, 5:823-826.
  27. Hubbard TJP, Blundell TL (1987). Protein Eng, 1:159-171.
  28. Flores TP, Orengo CA, Moss DC, Thornton JM (1993). Protein Sci, 2:1811-1826.
  29. Russell RB, Barton GJ (1994). J Mol Biol, 244:332-350.
  30. Rouvray DH, Balaban AT, Wilson RJ, Beineke LW (1979). Editors. Applications of Graph Theory, pp. 177-221. NewYork: Academic Press.
  31. Lesk AM (1986). Acta Cryst A42, 110-113.

External links