Difference between revisions of "BLOSUM"

From Christoph's Personal Wiki
Jump to: navigation, search
 
 
Line 1: Line 1:
'''BLOSUM''' (BLOcks SUbstitution Matrix) is a [[wikipedia:substitution matrix|substitution matrix]] used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They were first introduced in a paper by Henikoff and Henikoff (1992; PNAS 89:10915-10919). They scanned the BLOCKS database very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of [[amino acids]] and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitutions of the 20 standard amino acids. BLOSUM62 is the matrix calculated by using the observed substitutions between proteins which have ''less than'' 62% sequence identity. BLOSUM has proved better at scoring distantly related sequences than the once-widely-used [[Point accepted mutation|Point Accepted Mutation]] (PAM) matrices.  
+
'''BLOSUM''' (BLOcks SUbstitution Matrix) is a [[wikipedia:substitution matrix|substitution matrix]] used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They were first introduced in a paper by Henikoff and Henikoff.<ref>Henikoff and Henikoff (1992). ''PNAS 89:10915-10919''.</ref> They scanned the BLOCKS database very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitutions of the 20 standard amino acids. BLOSUM62 is the matrix calculated by using the observed substitutions between proteins which have ''less than'' 62% sequence identity.
  
 
== BLOSUM62 matrix ==
 
== BLOSUM62 matrix ==
Line 52: Line 52:
  
 
== References ==
 
== References ==
 +
<references/>
 
* [http://helix.biology.mcmaster.ca/721/distance/node10.html page on BLOSUM matrices]
 
* [http://helix.biology.mcmaster.ca/721/distance/node10.html page on BLOSUM matrices]
 
* [http://www.nature.com/nbt/journal/v22/n8/full/nbt0804-1035.html Nature Biotechnology Primer on BLOSUM62]
 
* [http://www.nature.com/nbt/journal/v22/n8/full/nbt0804-1035.html Nature Biotechnology Primer on BLOSUM62]

Latest revision as of 01:49, 17 October 2006

BLOSUM (BLOcks SUbstitution Matrix) is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They were first introduced in a paper by Henikoff and Henikoff.[1] They scanned the BLOCKS database very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitutions of the 20 standard amino acids. BLOSUM62 is the matrix calculated by using the observed substitutions between proteins which have less than 62% sequence identity.

BLOSUM62 matrix

#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4 
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4 
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4 
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4 
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4 
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4 
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4 
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4 
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4 
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4 
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4 
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4 
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4 
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4 
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4 
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4 
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4 
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4 
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4 
B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4 
Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4 
X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4 
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1
--source: NCBI

Notes

For proteins, the legal alphabet is:

  • ACDEFGHIKLMNPQRSTVWY (20) for amino acids
  • X for any amino acid
  • B for N or D
  • Z for Q or E
  • O for creating a free-insertion module (FIM)

For nucleic acids, the legal alphabet is:

  • ACGTU for nucleotides (with T and U considered equivalent)
  • Y for C or T
  • R for A or G
  • N for any nucleotide
  • O for creating a free-insertion module (FIM)

References

  1. Henikoff and Henikoff (1992). PNAS 89:10915-10919.

See also