Superpose

From Christoph's Personal Wiki
Jump to: navigation, search

superpose - structural alignment based on secondary structure matching and is based on the Secondary Structure Matching (SSM) advanced graph-matching algorithm. It is part of the CCP4 package and was written by Eugene Krissinel of the European Bioinformatics Institute, Cambridge, UK.

Background

"While high sequence similarity almost always implies structural similarity, the opposite is not true. It is therefore expected that three-dimensional alignment will provide more significant clues to protein function and properties than sequence alignment alone".[1]

Most similarity measures are based on the evaluation of the size of common substructures, for example the length of alignment (the longer, the better), and a measure of the distance between them, such as r.m.s.d. (the lower, the better).

The graph-theoretical approach typically includes three major steps:

  1. graph representation of the objects in question;
  2. matching the graphs representing the objects; and
  3. evaluating the common subgraphs found in order to form conclusions about similarity.

Several approaches to protein structure alignment have been explored over the past decade. The techniques used include:

  • comparison of distance matrices (DALI)[2];
  • analysis of differences in vector distance plots[3];
  • minimization of the soap-bubble surface area between two protein backbones[4];
  • dynamic programming on pairwise distances between the proteins' residues[5][6][7];
  • secondary-structure elements (SSEs)[8];
  • three-dimensional clustering[9][10];
  • graph theory[11][12][13];
  • combinatorial extension of alignment path (CE)[14];
  • vector alignment of SSEs (VAST)[15];
  • depth-first recursive search on SSE (DEJAVU)[16]; and
  • many others.[17][18][19][20][21][22][23][24][25]

Most details of protein fold may be expressed in terms of just two types of SSEs, namely helices (including what type of helix) and strands.

Usually the connectivity of SSEs is significant; however, there are situations where it may or should be neglected (e.g. comparison of mutated or engineered proteins, or geometry of active sites). This is the case I am interested in. That is, one can have three-dimensional SSE graphs that are geometrically identical yet have a difference in connectivity between the SSEs. Flexible connectivity is handled in the following ways:

  • Connectivity of SSEs is neglected;
  • "Soft" connectivity: The general order of matched SSEs along their protein chains is the same in both structures, but any number of missing or unmatched SSEs between the matched ones is allowed; and
  • "Strict" connectivity: Matched SSEs follow the same order along their protein chains and may be separated only by an equal number of matched or unmatched SSEs in both structures.

The decrease in structure similarity is seen as an exponential-like increase in RMSD, which has also been found in other studies.[26][27][28][29]

  • Rouvray et al., address the problems of structure comparison and recognition by the graph-theoretical approach.[30]

Scores

Q score

where, Nalign is the number of residues that align with each other, N1 and N2 are the input structures, and R0 is an empirical parameter (chose at 3 Å) that measure the relative significance of RMSD and Nalign.

Q reaches 1 only for identical structures (Nalign = N1 = N2 and RMSD = 0), and decreases to zero with decreasing similarity (increasing RMSD or/and decreasing Nalign. Despite the fact that the Q score represents a very basic measure that does not take into account many factors related to the quality of alignment (the number of gaps and their size, sequence identity, etc.), we found that maximization of the Q score produces good results.

The higher the Q score the "better", in general, the alignment.

Sequence Identity (SI)

Nm = Nalign / min(N1,N2)

where, Nm is the normalized alignment length.

The sequence identity is defined as a fraction of identical residues in the total number of (structurally) aligned residues:

SI = Nident / Nalign

SI <20% is a solid indication of low structural similarity.

Usage

superpose foo_1st.pdb [-s CID1] foo_2nd.pdb [-s CID2] [foo_out.pdb]

where [-s CID1/2] are optional selection strings in MMDB convention, and [foo_out.pdb] is optional output file.

  • Simple example:
superpose unbound.pdb bound.pdb fitted.pdb

Keywords

secondary-structure elements (SSEs), singular value decomposition (SVM; of the correlation matrix, following the method described by Lesk[31]), RMSD

See also

Topics

Web servers

Related

  • PROMOTIF algorithm (Hutchinson & Thornton, 1996) — aids in calculating SSEs
  • CONTACT bricking algorithm (e.g., Tadeusz Skarzynski in CCP4 suite) — computes various types of contacts in protein structures.
  • NCONT — analyses contacts between subsets of atoms in a PDB file.
  • Global Distance Test (GDT) — A different structure comparison measure
  • Template Modeling Score (TM-Score) — A different structure comparison measure
  • Longest Continuous Segment (LCS) — A different structure comparison measure

References

  1. Krissinel E, Henrick K (2004). "Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions". Acta Cryst, D60:2256-2268.
  2. 2.0 2.1 Holm L, Sander C (1993). J Mol Biol, 233:123-138.
  3. Orengo CA, Taylor WR (1996). Methods Enzymol, 266:617-635.
  4. Falicov A, Cohen FE (1996). J Mol Biol, 258:871-892.
  5. Subbiah S, Laurents DV, Levitt M (1993). Curr Biol, 3:141-148.
  6. Gerstein M, Levitt M (1998). Protein Sci, 7:445-456.
  7. Gerstein M, Levitt M (1996). Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, pp. 59-67. Menlo Park, California: AAAI Press.
  8. Singh AP, Brutlag DL (1997). Proceedings of the International Conference on Intelligent Systems for Molecular Biology ISMB-97, pp. 284-293. Halkidiki, Greece: AAAI Press.
  9. Vriend G, Sander C (1991). Proteins, 11:52-58.
  10. Mizuguchi K, Go N (1995). Protein Eng, 8:353-362.
  11. Mitchell EM, Artymiuk PJ, Rice DW, Willett PJ (1990). Mol Biol, 212:151-166.
  12. Alexandrov NN (1996). Protein Eng, pp.727-732.
  13. Grindley HM, Artymiuk PJ, Rice DW, Willett PJ (1993). Mol Biol, 229:707-721.
  14. 14.0 14.1 14.2 Shindyalov IN, Bourne PE (1998). Protein Eng, 11:739-747.
  15. 15.0 15.1 Gibrat J-F, Madej T, Bryant SH (1996). Curr Opin Struct Biol, 6:377-385.
  16. 16.0 16.1 Kleywegt GJ, Jones TA (1997). Methods Enzymol, 277:525-545.
  17. Zuker M, Somorjai RL (1989). Bull Math Biol, 51:55-78.
  18. Taylor W, Orengo CJ (1989). Mol Biol, 208:1-22.
  19. Godzik A, Skolnick J (1994). Comput Appl Biosci, 10:587-596.
  20. Russell RB, Barton GJ (1992). Proteins, 14:309-323.
  21. Sali A, Blundell TJ (1990). Mol Biol, 212:403-428.
  22. Barakat DW, Dean PMJ (1991). Comput Aided Mol Des, 5:107-117.
  23. Leluk J, Konieczny L, Roterman I (2003). Bioinformatics, 19:117-124.
  24. Jung J, Lee B (2000). Protein Eng, 13:535-543.
  25. Kato H, Takahashi YJ (2001). Chem Softw, 7:161-170.
  26. Chotia C, Lesk AM (1986). EMBO J, 5:823-826.
  27. Hubbard TJP, Blundell TL (1987). Protein Eng, 1:159-171.
  28. Flores TP, Orengo CA, Moss DC, Thornton JM (1993). Protein Sci, 2:1811-1826.
  29. Russell RB, Barton GJ (1994). J Mol Biol, 244:332-350.
  30. Rouvray DH, Balaban AT, Wilson RJ, Beineke LW (1979). Editors. Applications of Graph Theory, pp. 177-221. NewYork: Academic Press.
  31. Lesk AM (1986). Acta Cryst A42, 110-113.
  32. Kabsch W (1976). "A solution for the best rotation to relate two sets of vectors". Acta Crystallographica 32:922–923.
  33. Holm L, Kaariainen S, Rosenstrom P, Schenkel A. (2008) Searching protein structure databases with DaliLite v.3". Bioinformatics, 24:2780-2781.
  34. Jia Y, Dewey TG, Shindyalov IN, Bourne PE (2004). "A new scoring function and associated statistical significance for structure alignment by CE". J Comput Biol, 11(5):787-799. PMID: 15700402.
  35. Pekurovsky D, Shindyalov IN, Bourne PE (2004). "A case study of high-throughput biological data processing on parallel platforms". Bioinformatics, 20(12):1940-1947. PMID: 15044237.
  36. Shindyalov IN, Bourne PE (2000). "An alternative view of protein fold space". Proteins, 38(3):247-60. PMID: [http://www.ncbi.nlm.nih.gov/pubmed/10713986 10713986

Further reading

  • Armougom F, Moretti S, Keduas V, Notredame C (2006). "The iRMSD: a local measure of sequence alignment accuracy using structural information." Bioinformatics, 22(14):e35-9.
  • Damm KL, Carlson HA (2006). "Gaussian-weighted RMSD superposition of proteins: a structural comparison for flexible proteins and predicted protein structures." Biophys J, 90(12):4558-73.
  • Kneller GR (2005). "Comment on ``Using quaternions to calculate RMSD" [J. Comp. Chem. 25, 1849 (2004)]." J Comput Chem, 26(15):1660-2.
  • Theobald DL (2005). "Rapid calculation of RMSDs using a quaternion-based characteristic polynomial." Acta Crystallogr A, 61(Pt 4):478-80.
  • Maiorov VN, Crippen GM (1994). "Significance of root-mean-square deviation in comparing three-dimensional structures of globular proteins." J Mol Biol, 235(2):625-34.

External links