Difference between revisions of "Bayesian Phylogenetic Analysis"

From Christoph's Personal Wiki
Jump to: navigation, search
(Probability distributions over other parameters: format)
(Examine marginal and joint distributions)
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
==Introduction to Bayesian phylogenetic analysis==
 +
 +
As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. (This means that, for a given set of parameter values, you can compute the probability of any possible data set.) You will recall that in '''[[Bayesian inference|Bayesian statistics]]''' the goal is to obtain a full probability distribution over all possible parameter values. To find this so-called '''[[Posterior probability|posterior probability distribution]]''' requires combining the '''likelihood''' and the '''[[Prior probability|prior probability distribution]]'''.
 +
 +
The '''prior probability distribution''' shows your beliefs about the parameters before seeing any data, while the '''likelihood''' shows what the data is telling about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. (This is the measure we have previously used to find the '''[[Maximum likelihood|maximum likelihood]]''' estimate.) If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability) then the posterior distribution is simply proportional to the likelihood distribution, and the parameter value with the maximum likelihood then also has the '''[[Maximum a posteriori|maximum posterior probability]]'''. However, even in this case, using a Bayesian approach still allows one to interpret the posterior as a probability distribution. If the prior is NOT flat, then it may have a substantial impact on the posterior although this effect will diminish with increasing amounts of data. A prior may be derived from the results of previous experiments. For instance, one can use the posterior of one analysis as the prior in a new, independent analysis.
 +
 +
In [http://www.eeb.uconn.edu/Courses/EEB372/class22.pdf Bayesian phylogeny] the parameters are of the same kind as in maximum likelihood phylogeny. Thus, typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as for instance the transition/transversion ratio or the gamma shape parameter. (The difference is that while we want to find the best point estimates of parameter values in maximum likelihood, the goal in Bayesian phylogeny is instead to find a full probability distribution over all possible parameter values.) The observed data is again usually taken to be the alignment, although it would of course be more reasonable to say that the sequences are what have been observed (and the alignment should then be inferred along with the phylogeny).
 +
 
==Overview==
 
==Overview==
  
Today's exercise will focus on phylogenetic analysis using Bayesian methods. We already discussed Bayesian statistics and Bayesian phylogeny in class, but if you want to refresh your memory, then you can read a brief introduction at the end of this page.
+
This article will focus on phylogenetic analysis using Bayesian methods.
  
In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.
+
see also: [[Bayesian statistics]], [[Bayesian phylogeny]]
 +
 
 +
We will explore how one can determine and use '''posterior probability distributions''' over ''trees'', over ''clades'', and over ''substitution parameters''. We will also touch upon the difference between '''[[Marginal distribution|marginal-]]''' and '''[[Joint distribution|joint-probability distributions]]'''.
  
 
==Getting started==
 
==Getting started==
  
* '''NEXUS files needed:'''
+
===NEXUS files===
 +
 
 +
The [[NEXUS file format|NEXUS files needed]] in this tutorial are the following:
  
 
<pre>
 
<pre>
primatemitDNA.nexus
+
foo_mitDNA.nexus
neanderthalsmall.nexus
+
fooSmall.nexus
 
hcvsmall.nexus
 
hcvsmall.nexus
 
</pre>
 
</pre>
  
You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.
+
We will now use Bayesian phylogenetic analysis.
  
* '''Start long run for later analysis:'''
+
===MrBayes===
  
Before we really start with today's exercise, you have to start a run that will take quite long time to finish. The results will be analyzed later today, but since we do not want to sit around waiting for it, we will start it now and discuss the results later. For now you should just issue the commands given below without wondering what they mean. I will explain them later in the exercise.
+
We will use the programme, '''[[MrBayes]]'''. For now we will just test the programme by issuing ''sample'' commands given below without wondering what they mean. They will be explained later.
  
 
<pre>mb</pre>
 
<pre>mb</pre>
 
 
This starts the program, giving you a prompt ("'''<tt>MrBayes> </tt>'''") where you can enter commands. The next set of commands will load the data file, specify outgroup, request that the program should not prompt you during the run, specify the model, and start the run:
+
This starts the programme, giving you a prompt ("'''<tt>MrBayes> </tt>'''") where you can enter commands. The next set of commands will load the data file, specify <tt>outgroup</tt>, request that the program should not prompt you during the run (<tt>autoclose</tt>), specify the model, and start the run:
  
<pre>execute neanderthalsmall.nexus
+
<pre>execute fooSmall.nexus
 
outgroup PAN6030
 
outgroup PAN6030
 
set autoclose=yes
 
set autoclose=yes
Line 32: Line 44:
 
</pre>
 
</pre>
  
The analysis will run for about 35 minutes. Once you have started the run, you should make sure to minimize the window so it won't get in your way. We will return and analyze the results later.
+
The analysis will run for about 35 minutes. We will return and analyze the results later.
  
==Posterior probability of trees==
+
===Sample output===
  
In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.
+
*'''<tt>fooSmall.nexus.p</tt>''' (the parameter sampling file):
 +
<pre>
 +
[ID: 9261943684]
 +
Gen    LnL    TL      kappa  pi(A)  pi(C)  pi(G)  pi(T)  alpha 
 +
1      -5378.260      1.700  1.000000        0.250000        0.250000        0.250000        0.250000        0.500000       
 +
100    -5026.561      1.746  2.687982        0.263367        0.288204        0.217714        0.230715        0.382365       
 +
200    -4807.226      1.955  2.208355        0.362426        0.264404        0.144240        0.228930        0.389880       
 +
300    -4802.165      1.990  2.204721        0.362426        0.264404        0.144240        0.228930        0.394108
 +
 
 +
...
 +
</pre>
 +
 
 +
*'''<tt>fooSmall.nexus.t</tt>''' (the file containing sampled trees):
 +
<pre>
 +
#NEXUS
 +
[ID: 9261943684]
 +
begin trees;
 +
  translate
 +
      1 Cow,
 +
      2 Carp,
 +
      3 Chicken,
 +
      4 Human,
 +
      5 Loach,
 +
      6 Mouse,
 +
      7 Rat,
 +
      8 Seal,
 +
      9 Whale,
 +
      10 Frog;
 +
  tree rep.1 = (((((9:0.100000,6:0.100000):0.100000,(8:0.100000,((4:0.100000,5:0.100000),2:0.100000):0.100000,1:0.100000); ...
 +
  tree rep.100 = (3:0.194139,((((6:0.071748,(4:0.141885,(5:0.206749,2:0.131233):0.225710):0.016019):0.050057,1:0.111190); ...
 +
  tree rep.200 = (((6:0.063875,7:0.121337):0.059020,((10:0.198564,(2:0.114438,5:0.150473):0.113267):0.062015,1:0.113690); ...
 +
  tree rep.300 = (9:0.096183,((((6:0.092165,7:0.084863):0.094932,((10:0.160747,(2:0.114438,1:0.124806); ...
 +
 
 +
...
 +
</pre>
 +
 
 +
==[[Posterior probability|Posterior probability]] of trees==
 +
 
 +
We will continue to use the programme "'''MrBayes'''" to perform '''Bayesian phylogenetic analysis'''. MrBayes is a programme that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.
  
 
However, you may notice that the command editor in MrBayes has fewer features than that in PAUP. For instance, you cannot recall previously issued commands by using the arrow-up button, and you can also not use the arrow-left button to move into a command and edit it. Notice that the command "'''<tt>help</tt>'''" will give you a list of all available commands. Issuing "'''<tt>help ''command''</tt>'''" will give you a more detailed description of the specified command along with current option values. This is similar to how "'''<tt>''command'' ?</tt>'''" works in  PAUP.
 
However, you may notice that the command editor in MrBayes has fewer features than that in PAUP. For instance, you cannot recall previously issued commands by using the arrow-up button, and you can also not use the arrow-left button to move into a command and edit it. Notice that the command "'''<tt>help</tt>'''" will give you a list of all available commands. Issuing "'''<tt>help ''command''</tt>'''" will give you a more detailed description of the specified command along with current option values. This is similar to how "'''<tt>''command'' ?</tt>'''" works in  PAUP.
  
Before you start the exercise, you should make sure that you are now in one of the un-used shell windows, and that you have cd'ed to today's directory! We will now explore how to find and use the posterior probability distribution over trees.
+
We will now explore how to find and use the posterior probability distribution over trees.
  
* '''Start program:'''
+
* '''Start programme:'''
  
 
<pre>mb</pre>
 
<pre>mb</pre>
Line 52: Line 102:
 
* '''Load your sequences:'''
 
* '''Load your sequences:'''
  
<pre>execute primatemitDNA.nexus</pre>
+
<pre>execute foo_mtDNA.nexus</pre>
  
This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in
+
This file contains [[mitochondrial DNA]] (mtDNA) sequences from 5 different primates. Note that MrBayes accepts input in [[NEXUS file format|NEXUS format]], and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.
MrBayes also.
+
  
 
* '''Inspect data set:'''
 
* '''Inspect data set:'''
Line 73: Line 122:
 
<pre>lset nst=2 rates=gamma</pre>
 
<pre>lset nst=2 rates=gamma</pre>
  
This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (<tt>nst=2</tt>), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, <tt>rates=gamma</tt> means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.
+
This command is, again, very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (<tt>nst=2</tt>), and this is automatically taken to mean that you want to distinguish between '''[[Transition (genetics)|transitions]]''' and '''[[Transversion (genetics)|transversions]]'''. Furthermore, <tt>rates=gamma</tt> means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.
 +
 
 +
===[[Markov chain Monte Carlo]] sampling===
 +
 
 +
see also: [[Markov chain]], [[Monte Carlo method]]
  
 
* '''Start Markov chain Monte Carlo sampling:'''
 
* '''Start Markov chain Monte Carlo sampling:'''
Line 79: Line 132:
 
<pre>mcmc ngen=300000 samplefreq=100 nchains=4 savebrlens=yes</pre>
 
<pre>mcmc ngen=300000 samplefreq=100 nchains=4 savebrlens=yes</pre>
  
What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the
+
What you are doing here is to use the method known as '''MCMCMC''' ("[[Metropolis-coupled Markov chain Monte Carlo]]" or '''(MC)<sup>3</sup>''') to empirically determine the '''posterior probability distribution of trees''', branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. The run will take about 5 minutes to finish (estimated time to finish is shown in the last column of output).
Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. The run will take about 5 minutes to finish (estimated time to finish is shown in the last column of output).
+
 
 +
Let us examine the command in detail. First, <tt>nchain=4</tt> means that the MCMCMC sampling uses 4 parallel chains: one "cold" from which sampling takes place, and three "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution. The options <tt>ngen=300000 samplefreq=100</tt> lets the search run for 300,000 steps ("'''generations'''") and saves a tree once every 100 rounds (meaning that a total of 3,000 trees will be saved).
  
Let us examine the command in detail. First, <tt>nchain=4</tt> means that the MCMCMC sampling uses 4 parallel chains: one "cold" from which sampling takes place, and three "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution. The options <tt>ngen=300000 samplefreq=100</tt> lets the search run for 300,000 steps ("generations") and saves a tree once every 100 rounds (meaning that a total of 3,000 trees will be saved).
+
Usually, you would want to sample from more generations (at least 1,000,000 or so, but the more the better). As you might have guessed, <tt>savebrlens=yes</tt> means we want the branch length information saved along with the sampled trees.
Usually you would probably want to sample from more generations (at least 1,000,000 or so, but the more the better). As you might have guessed, <tt>savebrlens=yes</tt> means we want the branch length information saved along with the sampled trees.
+
  
When the run starts you will see reports about the progress of the four chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the four chains. The cold chain is the one
+
When the run starts you will see reports about the progress of the four chains. Each line of output lists the '''generation number''' and the '''[[Likelihood-ratio test|log likelihoods]]''' of the current tree/parameter combination for each of the four chains. The '''cold chain''' is the one enclosed in ''brackets'' [], while the '''[[Heated chains|heated chains]]''' are enclosed in ''parentheses'' (). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).
enclosed in brackets [], while the heated chains are enclosed in parentheses (). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).
+
  
 
* '''Have a look at the resulting sample files:'''
 
* '''Have a look at the resulting sample files:'''
  
Wait until the run has finished (the output stops and the prompt returns), and then switch to your third shell-window (the one where you are not running MrBayes). Open the parameter sampling file in an nedit window:
+
Wait until the run has finished (the output stops and the prompt returns), and then switch to your third shell-window (the one where you are not running MrBayes). Open the parameter sampling file (<tt>.p</tt> extension):
  
<pre>nedit primatemitDNA.nexus.p</pre>
+
<pre>vi foo_mtDNA.nexus.p</pre>
  
This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each ''row'' corresponds to a certain sample time (or generation). Each ''column'' contains the sampled values of one
+
This file contains one line for each sampled point. Each ''row'' corresponds to a certain sample time (or generation). Each ''column'' contains the sampled values of one specific parameter. (The first line contains headings telling what the different columns are.) Note how the values of most parameters change a lot during the initial "'''[[Burnin|burnin]]'''" period, before they settle near their most probable values. Now, have a look at the file containing sampled trees (<tt>.t</tt> extension):
specific parameter. (The first line contains headings telling what the different columns are.) Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values. Now, close the nedit window and have a look at the file containing sampled trees:
+
  
<pre>nedit primatemitDNA.nexus.t</pre>
+
<pre>vi foo_mtDNA.nexus.t</pre>
  
Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical format used by most (all?) phylogeny software. Since there are 5 taxa in the
+
Tree topology is also a parameter in our model, and exactly like the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical format used by most (all?) [[Phylogeny software|phylogeny software]]. Since there are 5 taxa in the present data set, the tree-space consists of only 15 different possible trees. Since we have taken 3000 sample points, there must be many lines containing the same tree topology.
present data set, the tree-space consists of only 15 different possible trees. Since we have taken 3000 sample points, there must be many lines containing the same tree topology.
+
  
* '''Determine "burn-in":'''
+
===Determine "[[Burnin|burn-in]]"===
  
We now have to determine the "burnin", i.e., the number of sample points we have to exclude from our further analysis. Here we will do that by inspecting the likelihood values of the trees saved by MrBayes.
+
We now have to determine the "'''burnin'''" (i.e., the number of sample points we have to exclude from our further analysis). Here we will do that by inspecting the likelihood values of the trees saved by MrBayes.
  
Recall, that the idea in MCMCMC sampling is to move around in parameter space in such a way that the points will be visited according to their posterior probability (i.e., a region with very high posterior probability will be visited frequently). As
+
Recall, that the idea in '''MCMCMC sampling''' is to move around in parameter space in such a way that the points will be visited according to their '''posterior probability''' (i.e., a region with ''very high posterior probability'' will be ''visited frequently''). It takes a while after starting such a process before the sampling gets "out of the foot-hills and into the mountains" of the distribution, and only after this "burnin" period can the sample points be trusted. Here we will try to establish <tt>burnin</tt> by looking at the likelihoods of each sample point as recorded in the file <tt>foo_mtDNA.nexus.p</tt>. At first the likelihoods increase rapidly. Eventually the increase in likelihood declines, and the likelihoods converge on a steady value (a "plateau"). The burnin is the initial part before this plateau is reached. Now, plot the <tt>lnL</tt> value for the entire run file:
mentioned in the lecture it takes a while after starting such a process before the sampling gets "out of the foot-hills and into the mountains" of the distribution, and only after this "burnin" period can the sample points be trusted. Here we will try to establish burnin by looking at the likelihoods of each sample point as recorded in the file <tt>primatemitDNA.nexus.p</tt>. At first the likelihoods increase rapidly. Eventually the increase in likelihood declines, and the likelihoods converge on a steady value (a "plateau"). The burnin is the initial part before this plateau is reached. Now, plot the <tt>lnL</tt> value for the entire run file:
+
  
<pre>mbplot.pl primatemitDNA.nexus.p</pre>
+
<pre>mb-plot.pl foo_mtDNA.nexus.p</pre>
  
(mbplot.pl is a small script that extracts the relevant columns from the <tt>.p</tt> file and uses gnuplot to produce a plot of how the value changes with generation number.) The plot shows you how the likelihood value (y-axis) increases through the generations (x-axis), to finally reach a plateau. If the lnL increases very rapidly, then you may want to zoom in on the first part of the plot. This can be done by first closing the plot window (pressing the "ENTER" button in the shell window should work), and then running <tt>mbplot.pl</pl> with a new set of limits:
+
(<tt>mb-plot.pl</tt> is a small Perl script that extracts the relevant columns from the <tt>.p</tt> file and uses [[Gnuplot]] to produce a plot of how the value changes with generation number.) The plot shows you how the '''likelihood value''' (''y''-axis) increases through the '''generations''' (''x''-axis), to finally reach a plateau. If the <tt>lnL</tt> increases very rapidly, then you may want to zoom in on the first part of the plot. This can be done by first closing the plot window, and then running <tt>mbplot.pl</tt> with a new set of limits:
  
<pre>mbplot.pl primatemitDNA.nexus.p 0 50000</pre>
+
<pre>mb-plot.pl primate_mtDNA.nexus.p 0 50000</pre>
  
This will show you the <tt>lnL</tt> values from generation 0 to generation 50000. Feel free to use other limits if you want to zoom further in or out. Now, look at the x-axis to determine the number of generations required before the plateau is reached. Note down this number and close the window.
+
This will show you the <tt>LnL</tt> values from generation 0 to generation 50000. Feel free to use other limits if you want to zoom further in or out. Now, look at the ''x''-axis to determine the number of generations required before the plateau is reached. Note down this number and close the window.
  
* '''Determine number of sample points to discard:'''
+
===Determine number of sample points to discard===
  
For the analysis that we will perform below, you have to tell MrBayes how many '''sample points''' to discard, '''not''' how many generations were required to reach the plateau.You get the number of sample points by dividing the number of generations by 100 (recall that we only sampled once every 100 generations). For instance, a burnin of 6000 generations means that you want to discard the first 60 sample points.
+
For the analysis that we will perform below, you have to tell MrBayes how many '''sample points''' to discard, ''not'' how many generations were required to reach the plateau. You get the number of sample points by dividing the number of generations by 100 (recall that we only sampled once every 100 generations). For instance, a burnin of 6000 generations means that you want to discard the first 60 sample points.
  
* '''Investigate posterior probability distribution over trees:'''
+
===Investigate posterior probability distribution over trees===
  
Return to the shell window where you have MrBayes running. In the command below you should replace XX with the number of sample points you want to discard (60 for instance, but this depends on the plot you made above).
+
Return to the shell window where you have MrBayes running. In the command below you should replace <tt>XX</tt> with the number of sample points you want to discard (60 for instance, but this depends on the plot you made above).
  
 
<pre>sumt contype=halfcompat showtreeprobs=yes burnin=XX</pre>
 
<pre>sumt contype=halfcompat showtreeprobs=yes burnin=XX</pre>
  
(Scroll back so you can see the top of the output when the command is done.) This command gives you a summary of the trees that are in the file you  examined manually above. As mentioned, the option <tt>burnin=XX</tt> tells MrBayes how many sample points to discard before analyzing the data. The option <tt>contype=halfcompat</tt> requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged
+
(Scroll back so you can see the top of the output when the command is done.) This command gives you a summary of the trees that are in the file you  examined manually above. As mentioned, the option <tt>burnin=XX</tt> tells MrBayes how many sample points to discard before analyzing the data. The option <tt>contype=halfcompat</tt> requests that a '''majority rule consensus tree''' is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the '''consensus cladogram''', a '''consensus phylogram''' is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The tree also has "'''clade credibility'''" values. We will return to the meaning of these later.
over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The tree also has "clade credibility" values. We will return to the meaning of these later in today's exercise.
+
  
What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree
+
What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their '''posterior probability''' which is indicated by a ''lower-case p'' after the tree number. (The ''upper-case P'' gives the '''[[Probability distribution|cumulated probability]]''' of trees shown so far, and is useful for constructing a credible set). This list highlights how '''Bayesian phylogenetic analysis''' is different from '''maximum likelihood''': Instead of finding the ''best'' tree(s), we now get a full list of how probable ''any'' possible tree is. Now, draw a sketch of the tree with the highest posterior probability on a piece of paper so you are sure to remember which one it was. How did the less probable trees differ from the most likely one?
number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the
+
best tree(s), we now get a full list of how probable ''any'' possible tree is. Now, draw a sketch of the tree with the highest posterior probability on a piece of paper so you are sure to remember which one it was. How did the less probable trees differ
+
from the most likely one?
+
  
The list of trees and probabilities was printed because of the option <tt>showtreeprobs=yes</tt>. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named <tt>primatemitDNA.nexus.trprobs</tt> which is now present in the same directory as your other files (this file is automatically produced by the <tt>sumt</tt> command).
+
The list of trees and probabilities was printed because of the option <tt>showtreeprobs=yes</tt>. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named <tt>foo_mtDNA.nexus.trprobs</tt> which is now present in the same directory as your other files (this file is automatically produced by the <tt>sumt</tt> command).
  
 
==Posterior probability of clades==
 
==Posterior probability of clades==
  
By now the run you started using Neanderthal sequences at the beginning of today's exercise should be finished. Find the shell window and have a look.
+
By now the run you started using Neanderthal sequences at the start of this tutorial should be finished. Find the shell window and have a look.
  
* '''Determine burnin:'''
+
===Determine burnin===
  
 
Again use the procedure outlined above to determine burnin for this data set. (Move to free shell window first):
 
Again use the procedure outlined above to determine burnin for this data set. (Move to free shell window first):
  
<pre>mbplot.pl neanderthalsmall.nexus.p</pre>
+
<pre>mb-plot.pl neanderthalsmall.nexus.p</pre>
  
 
Remember that the number of sample points to discard is found by dividing the number of generations by 100).
 
Remember that the number of sample points to discard is found by dividing the number of generations by 100).
  
* '''Find posterior probability of clades:'''
+
===Find posterior probability of clades===
  
 
<pre>sumt contype=halfcompat showtreeprobs=no burnin=XX</pre>
 
<pre>sumt contype=halfcompat showtreeprobs=no burnin=XX</pre>
  
(Again remember to put your actual number of samples to discard instead of the XX in the command line above.) Now examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set). These numbers are different from the bootstrap values that we briefly investigated previously and that we will return to later in the course: unlike bootstrap support (which have no clear statistical meaning) these are actual probabilities. Furthermore, they have been found using a full probabilistic model, instead of neighbor joining, and have still finished in a reasonable amount of time. These features make Bayesian phylogeny very useful for assessing hypotheses about monophyly.
+
(Again remember to put your actual number of samples to discard instead of the <tt>XX</tt> in the command line above.) Now examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are '''clade-credibility values''', and are in fact the '''posterior probability''' that the ''clade is real'' (based on the present data set). These numbers are different from the ''[[Bootstrapping|bootstrap values]]'' that we briefly investigated previously and that we will return to later: unlike bootstrap support (which have no clear statistical meaning) these are actual probabilities. Furthermore, they have been found using a full probabilistic model, instead of neighbour joining, and have still finished in a reasonable amount of time. These features make Bayesian phylogeny very useful for assessing hypotheses about '''[[monophyly]]'''.
  
Based on your values, how much support does the present data set give to the out-of-Africa hypothesis?
+
Based on your values, how much support does the present data set give to the "out-of-Africa hypothesis"?
  
 
==Probability distributions over other parameters==
 
==Probability distributions over other parameters==
  
As the last thing today, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have also examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.
+
We will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have also examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.
  
 
* '''Load data set:'''
 
* '''Load data set:'''
Line 176: Line 221:
 
</pre>
 
</pre>
  
This is an alternative way of specifying that different sites have different rates. Instead of using a gamma distribution and learning which sites have what rates from the data, we are instead using our prior knowledge about the structure of the genetic code to specify that all 1st positions have the same rate, all 2nd positions have the same rate, and all 3rd positions have the same rate. Specifically, <tt>charset 1stpos=1-.\3</tt> means that we define a character set named "1stpos" which includes site 1 in the alignment followed by every third site ("<tt>\3</tt>", meaning it includes sites 1, 4, 7, 11, ...) until the end of the alignment (here denoted ".").
+
This is an alternative way of specifying that different sites have different rates. Instead of using a '''[[Gamma distribution|gamma distribution]]''' and learning which sites have what rates from the data, we are instead using our prior knowledge about the structure of the genetic code to specify that all 1st positions have the same rate, all 2nd positions have the same rate, and all 3rd positions have the same rate. Specifically, <tt>charset 1stpos=1-.\3</tt> means that we define a character set named "<tt>1stpos</tt>" which includes site 1 in the alignment followed by every third site ("<tt>\3</tt>", meaning it includes sites 1, 4, 7, 11, ...) until the end of the alignment (here denoted ".").
  
* '''Specify model:'''
+
===Specify model===
  
 
<pre>lset nst=6</pre>
 
<pre>lset nst=6</pre>
  
This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate ratios.
+
This specifies that we want to use a model of the '''[[Substitution_model#GTR:_Generalised_time_reversible|General Time Reversible]]''' ('''GTR''') type, where all 6 substitution types have separate rate ratios.
  
When the <tt>lset</tt> command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be  
+
When the <tt>lset</tt> command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the <tt>lset</tt> command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command <tt>prset</tt> can again be used for this). For instance, a model with equal nucleotide frequencies correspond to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.
estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command <tt>prset</tt> can again be used for this). For instance, a model with equal nucleotide frequencies correspond to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.
+
  
 
* '''Inspect model details:'''
 
* '''Inspect model details:'''
Line 191: Line 235:
 
<pre>showmodel</pre>
 
<pre>showmodel</pre>
  
This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "<tt>Statefreq</tt>") have a "Dirichlet" prior. We will not go into the grisly  details of what exactly the Dirichlet distribution looks like, but merely note that it is a distribution over many variables, and that depending on the exact parameters the distribution can be more or less flat. The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior <tt>dirichlet(1,1,1,1)</tt>.
+
This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "<tt>Statefreq</tt>") have a "Dirichlet" prior. We will not go into the grisly  details of what exactly the '''[[Dirichlet distribution|Dirichlet distribution]]''' looks like, but merely note that it is a distribution over many variables, and that depending on the exact parameters the distribution can be more or less flat. The Dirichlet distribution is a handy way of specifying the prior probability distribution of [[Nucleotide|nucleotide]] (or [[Amino acid|amino acid]]) frequency vectors. The default <tt>statefreq</tt> prior in MrBayes is the flat or un-informative prior <tt>dirichlet(1,1,1,1)</tt>.
  
 
We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).
 
We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).
 +
 +
===[[MCMC]] sampling===
  
 
* '''Start MCMC sampling:'''
 
* '''Start MCMC sampling:'''
Line 205: Line 251:
 
Again use the procedure outlined above to determine burnin for this data set. (Move to free shell window first):
 
Again use the procedure outlined above to determine burnin for this data set. (Move to free shell window first):
  
<pre>mbplot.pl hcvsmall.nexus.p</pre>
+
<pre>mb-plot.pl hcvsmall.nexus.p</pre>
  
 
Remember that the number of sample points to discard is found by dividing the number of generations by 100).
 
Remember that the number of sample points to discard is found by dividing the number of generations by 100).
Line 213: Line 259:
 
<pre>sump burnin=XX</pre>
 
<pre>sump burnin=XX</pre>
  
The <tt>sump</tt> command works much like the <tt>sumt</tt> command for the non-tree parameters. First, you get a plot of the <tt>lnL</tt> (not as nice as when using <tt>mbplot.pl</tt> though). Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.
+
The <tt>sump</tt> command works much like the <tt>sumt</tt> command for the non-tree parameters. First, you get a plot of the <tt>LnL</tt> (not as nice as when using <tt>mb-plot.pl</tt> though). Secondly, the posterior probability distribution of each parameter is summarized by giving the '''mean''', '''variance''', '''median''', and '''95% credible interval'''.
  
Based on the reported posterior means, does it seem that <tt>r(C<->G)</tt> is different from <tt>r(A<->C)</tt>?. Here <tt>r(C<->G)</tt> is the relative substitution rate between C and G. (Note that all substitution parameters are given as relative values, with the GT rate arbitrarily set at a value of 1.00).
+
Based on the reported posterior means, does it seem that <tt>r(C<->G)</tt> is different from <tt>r(A<->C)</tt>?. Here <tt>r(C<->G)</tt> is the relative substitution rate between C and G. (Note that all substitution parameters are given as relative values, with the '''GT rate''' arbitrarily set at a value of 1.00).
  
* '''Marginal vs. joint distributions:'''
+
===Marginal vs. joint distributions===
  
Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the <tt>r(C<->G)</tt> parameter and then compared its mean to the mean of the overall distribution of the <tt>r(A<->C)</tt> parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even  though the total distributions have a substantial overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.
+
see also: [[Marginal distribution]], [[Joint distribution]]
  
By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution. The figure below illustrates a joint distribution over two parameters, and shows how one can think of the marginal distribution as what you get when you view the joint distribution from the side, so to speak (hence the name "marginal").
+
Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the <tt>r(C<->G)</tt> parameter and then compared its mean to the mean of the overall distribution of the <tt>r(A<->C)</tt> parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions have a substantial overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "'''joint distribution'''" over the parameters.
 +
 
 +
By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the '''marginal distribution'''. The figure below illustrates a joint distribution over two parameters, and shows how one can think of the marginal distribution as what you get when you view the joint distribution from the side, so to speak (hence the name "marginal").
  
 
<center>  
 
<center>  
<IMG SRC="jointmarginalsmall.jpg"><BR>
+
<img><br />
 +
'''Joint vs. marginal distribution.'''<br />
 +
(''Figure from Berry, "Statistics&mdash;a Bayesian perspective".'')
 
</center>
 
</center>
  
<H3>Joint vs. marginal distribution.</H3> (Figure from Berry, "Statistics - a Bayesian perspective".)
+
===Examine marginal and joint distributions===
</CENTER>
+
  
<BR>
+
Again find a shell window where MrBayes is not running. In the command below I have arbitrarily set the burnin to 50 in order to save you the hassle of entering the value.
  
<LI><P><B>Examine marginal and joint distributions:</B></P>
+
'''grep''' -v '^Gen' hcvsmall.nexus.p | '''gawk''' '{if (NR > 50) print $6}' > rCG.data
  
<P>Again find a shell window where MrBayes is not running. In the command below I have
+
This command takes the parameter file, removes the header line, and for all lines that were sampled after the burnin period it prints the <tt>r(C<->G)</tt> parameter to the file named "<tt>rCG.data</tt>". Now extract a few extra interesting columns in a similar manner.
arbitrarily set the burnin to 50 in order to save you the hassle of entering the value.</P>
+
  
<BLOCKQUOTE>
+
'''grep''' -v '^Gen' hcvsmall.nexus.p | '''gawk''' '{if (NR > 50) print $9}'  > rAC.data
<pre>cat hcvsmall.nexus.p | grep -v '^Gen' | gawk '{if (NR > 50) print $6}' > rCG.data
+
'''grep''' -v '^Gen' hcvsmall.nexus.p | '''gawk''' '{if (NR > 50) print $14}' > rm1.data
</pre>
+
'''grep''' -v '^Gen' hcvsmall.nexus.p | '''gawk''' '{if (NR > 50) print $15}' > rm2.data
</BLOCKQUOTE>
+
'''grep''' -v '^Gen' hcvsmall.nexus.p | '''gawk''' '{if (NR > 50) print $16}' > rm3.data
  
<P>This command takes the parameter file, removes the header line, and for all lines that
+
We can now plot some interesting distributions:
were sampled after the burnin period it prints the <TT>r(C<->G)</TT> parameter to the  file
+
named "rCG.data". Now extract a few extra interesting columns in a similar manner.</P>
+
  
<BLOCKQUOTE>
+
'''histo''' -w 2 -L"CG":"AC" -m ll rCG.data rAC.data
<pre>cat hcvsmall.nexus.p | grep -v '^Gen' | gawk '{if (NR > 50) print $9}' > rAC.data
+
cat hcvsmall.nexus.p | grep -v '^Gen' | gawk '{if (NR > 50) print $14}' > rm1.data
+
cat hcvsmall.nexus.p | grep -v '^Gen' | gawk '{if (NR > 50) print $15}' > rm2.data
+
cat hcvsmall.nexus.p | grep -v '^Gen' | gawk '{if (NR > 50) print $16}' > rm3.data
+
</pre>
+
</BLOCKQUOTE>
+
  
<P>We can now plot some interesting distributions:</P>
+
We here use the <tt>histo</tt> programme to plot the marginal distributions of rCG and rAC (issue the command <TT>man histo</TT> if you want to learn how to use the <tt>histo</tt> programme). Note that while rAC tends to be larger than rCG, there is still a substantial overlap between the distributions. We therefore need to examine the joint distribution to find the real probability that rAC > rCG.
  
<BLOCKQUOTE>
+
'''wc''' -l rCG.data
<pre>histo -w 2 -L"CG":"AC" -m ll rCG.data rAC.data</pre>
+
</BLOCKQUOTE>
+
  
<P>We here use the histo program to plot the marginal distributions of rCG and rAC (issue
+
This gives us a count of how many points remain in the sample files after having discarded the burnin. Write down the number.
the command <TT>man histo</TT> if you want to learn how to use the histo program). Note
+
that while rAC tends to be larger than rCG, there is still a substantial overlap between
+
the distributions. We  therefore need to examine the joint distribution to find the real
+
probability that rAC > rCG.</P>
+
  
<BLOCKQUOTE>
+
'''paste''' rAC.data rCG.data | '''gawk''' '{if ($1>$2) print $0}' | '''wc''' -l
<pre>wc -l rCG.data</pre>
+
</BLOCKQUOTE>
+
  
<P>This gives us a count of how many points remain in the sample files after having
+
This counts the number of sample points where rAC > rCG. You can now find the probability that rAC > rCG by dividing the last number by the first. Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can also be performed using spread-sheet programs, and can obviously be used to answer many different questions.
discarded the burnin. Write down the number</P>
+
  
<BLOCKQUOTE>
+
We will finish off this tutorial by making one final plot:
<pre>paste rAC.data rCG.data | gawk '{if ($1>$2) print $0}' | wc -l</pre>
+
</BLOCKQUOTE>
+
  
<P>This counts the number of sample points where rAC > rCG. You can now find the
+
  '''histo''' -w 0.01 -L"1st":"2nd":"3rd" -m lll rm1.data rm2.data rm3.data
probability that rAC > rCG by dividing the last number by the first. Note how  examining
+
the joint distribution provides you with information that you could not get  from simply
+
comparing the marginal distributions. This very simple procedure can also be performed
+
using spread-sheet programs, and can obviously be used to answer many
+
different questions.</P>
+
  
<P>We will finish off  today's exercise by making one final plot:</P>
+
This command plots the relative mutation rates at the first, second, and third codon positions. How does this fit with your knowledge of the structure of the genetic code?
  
<BLOCKQUOTE>
+
==External links==
<pre>histo -w 0.01 -L"1st":"2nd":"3rd" -m lll rm1.data rm2.data rm3.data </pre>
+
</BLOCKQUOTE>
+
 
+
<P>This command plots the relative mutation rates at the first, second, and third codon
+
positions. How does this fit with your knowledge of the structure of the genetic code?</P>
+
 
+
<HR>
+
 
+
<H3>Resources</H3>
+
 
+
<UL>
+
<LI><A TARGET="_BLANK" HREF="http://morphbank.ebc.uu.se/mrbayes/">MrBayes</A>:
+
Bayesian Inference of Phylogeny. 
+
+
</UL>
+
  
 +
* [http://morphbank.ebc.uu.se/mrbayes/ MrBayes]: Bayesian Inference of Phylogeny. 
  
<H3><A NAME="bayesintro">Introduction to Bayesian phylogenetic analysis</A></H3>
+
==Appendix==
  
<P>As was the case for likelihood methods, Bayesian analysis is founded on having a
+
* The '''<tt>mb-plot.pl</tt>''' Perl script:
probabilistic model of how the observed data is produced. (This means that, for a  given
+
set of parameter values, you can compute the probability of any possible data set). You
+
will  recall from today's lecture that in Bayesian statistics the goal is to obtain a full
+
probability distribution over all possible parameter values. To find this so-called
+
posterior probability distribution requires combining the likelihood and the prior
+
probability distribution.</P>
+
  
<P>The prior probability distribution shows your beliefs about the parameters before
+
<span style="color:purple;">#! /usr/bin/perl</span>
seeing any data, while the likelihood shows what the data is telling about the
+
parameters. Specifically, the likelihood of a parameter value is the probability of
+
<span style="color:blue;"># Script that plots ".p" file output by MrBayes (v3.0B4). Default is to plot
the observed data given that parameter value. (This is the measure we have previously
+
# LnL in order to figure out how large the "burnin" is.
used to find the maximum likelihood estimate). If the prior probability distribution
+
# If one parameter is given, it is assumed to be the column to plot
is flat (i.e., if  all possible parameter values have the same prior probability) then
+
# If two extra parameters are given , they are assumed to be from-to plotrange.
the posterior distribution is simply proportional to the likelihood distribution, and
+
# If THREE parameters are given, they are assumed to be column no.
the parameter value with the maximum likelihood then also has the maximum posterior
+
# followed by from-to.
probability. However, even in this case, using a Bayesian approach still allows one to
+
#
interpret the posterior as a probability distribution. If the prior is NOT flat, then
+
# A.G. Pedersen, 2003</span>
it may have a substantial impact on the posterior although this effect will diminish
+
with increasing amounts of data. A prior may be derived from the results of previous
+
<span style="color:blue;"># Ugly commandline parsing</span>
experiments. For instance one can use the posterior of one analysis as the prior in a
+
if($#ARGV>3 || $#ARGV<0) {
new, independent analysis. </P>
+
        print "Usage: mb-plot.pl <parameterfile> [from to | column | column from to]\n";
 +
        exit;
 +
}
 +
 +
$range="";
 +
$column=2;
 +
 +
<span style="color:blue;"># Otherwise assume everything is just fine...</span>
 +
$parameterfile=$ARGV[0];
 +
if ($#ARGV==1) {
 +
        $column = $ARGV[1];
 +
  }
 +
 +
if ($#ARGV==2) {
 +
        $from = $ARGV[1];
 +
        $to = $ARGV[2];
 +
        $range = sprintf("[%d:%d]",$from,$to);
 +
  }
 +
 +
if ($#ARGV==3) {
 +
        $column = $ARGV[1];
 +
        $from = $ARGV[2];
 +
        $to = $ARGV[3];
 +
        $range = sprintf("[%d:%d]",$from,$to);
 +
}
 +
 +
<span style="color:blue;"># Print column heading so user knows that she is doing the right thing...</span>
 +
  printf "Plotting column ";
 +
$command = sprintf("head -2 %s | tail -1 | gawk  \'{print \$%d}\'",$parameterfile, $column);
 +
system ($command);
 +
printf "Press ENTER to exit\n";
 +
 +
<span style="color:blue;"># Plot</span>
 +
system ("cat $parameterfile | grep -v 'ID:'| sed \'s/^Gen/#Gen/\' > tempmb.data");
 +
$command = sprintf("plot %s\'tempmb.data\' u %s:%s w li\n",$range,1,$column);
 +
system ("echo \"$command\" > tempmb.gnu");
 +
system ("echo \"pause -1\" >> tempmb.gnu");
 +
system ("gnuplot tempmb.gnu");
 +
system ("rm tempmb.gnu tempmb.data");
  
<P>In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood
+
{{Phylogenetics}}
phylogeny. Thus, typical parameters include tree topology, branch lengths, nucleotide
+
[[Category:Phylogenetics]]
frequencies, and  substitution model parameters such as for instance the
+
transition/transversion ratio or the gamma shape parameter. (The difference is that
+
while we want to find the best point estimates of parameter values in maximum
+
likelihood,  the goal in Bayesian phylogeny is instead to find a full probability
+
distribution over all possible  parameter values). The observed data is again usually
+
taken to be the alignment, although it would of course be more reasonable to say that
+
the sequences are what have been observed (and the alignment should then be inferred
+
along with the phylogeny).</P>
+

Latest revision as of 03:28, 14 April 2008

Introduction to Bayesian phylogenetic analysis

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. (This means that, for a given set of parameter values, you can compute the probability of any possible data set.) You will recall that in Bayesian statistics the goal is to obtain a full probability distribution over all possible parameter values. To find this so-called posterior probability distribution requires combining the likelihood and the prior probability distribution.

The prior probability distribution shows your beliefs about the parameters before seeing any data, while the likelihood shows what the data is telling about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. (This is the measure we have previously used to find the maximum likelihood estimate.) If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability) then the posterior distribution is simply proportional to the likelihood distribution, and the parameter value with the maximum likelihood then also has the maximum posterior probability. However, even in this case, using a Bayesian approach still allows one to interpret the posterior as a probability distribution. If the prior is NOT flat, then it may have a substantial impact on the posterior although this effect will diminish with increasing amounts of data. A prior may be derived from the results of previous experiments. For instance, one can use the posterior of one analysis as the prior in a new, independent analysis.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Thus, typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as for instance the transition/transversion ratio or the gamma shape parameter. (The difference is that while we want to find the best point estimates of parameter values in maximum likelihood, the goal in Bayesian phylogeny is instead to find a full probability distribution over all possible parameter values.) The observed data is again usually taken to be the alignment, although it would of course be more reasonable to say that the sequences are what have been observed (and the alignment should then be inferred along with the phylogeny).

Overview

This article will focus on phylogenetic analysis using Bayesian methods.

see also: Bayesian statistics, Bayesian phylogeny

We will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal- and joint-probability distributions.

Getting started

NEXUS files

The NEXUS files needed in this tutorial are the following:

foo_mitDNA.nexus
fooSmall.nexus
hcvsmall.nexus

We will now use Bayesian phylogenetic analysis.

MrBayes

We will use the programme, MrBayes. For now we will just test the programme by issuing sample commands given below without wondering what they mean. They will be explained later.

mb

This starts the programme, giving you a prompt ("MrBayes> ") where you can enter commands. The next set of commands will load the data file, specify outgroup, request that the program should not prompt you during the run (autoclose), specify the model, and start the run:

execute fooSmall.nexus
outgroup PAN6030
set autoclose=yes
lset nst=2 rates=gamma
mcmc ngen=300000 savebrlens=yes

The analysis will run for about 35 minutes. We will return and analyze the results later.

Sample output

  • fooSmall.nexus.p (the parameter sampling file):
[ID: 9261943684]
Gen     LnL     TL      kappa   pi(A)   pi(C)   pi(G)   pi(T)   alpha   
1       -5378.260       1.700   1.000000        0.250000        0.250000        0.250000        0.250000        0.500000        
100     -5026.561       1.746   2.687982        0.263367        0.288204        0.217714        0.230715        0.382365        
200     -4807.226       1.955   2.208355        0.362426        0.264404        0.144240        0.228930        0.389880        
300     -4802.165       1.990   2.204721        0.362426        0.264404        0.144240        0.228930        0.394108

...
  • fooSmall.nexus.t (the file containing sampled trees):
#NEXUS
[ID: 9261943684]
begin trees;
   translate
       1 Cow,
       2 Carp,
       3 Chicken,
       4 Human,
       5 Loach,
       6 Mouse,
       7 Rat,
       8 Seal,
       9 Whale,
      10 Frog;
   tree rep.1 = (((((9:0.100000,6:0.100000):0.100000,(8:0.100000,((4:0.100000,5:0.100000),2:0.100000):0.100000,1:0.100000); ...
   tree rep.100 = (3:0.194139,((((6:0.071748,(4:0.141885,(5:0.206749,2:0.131233):0.225710):0.016019):0.050057,1:0.111190); ...
   tree rep.200 = (((6:0.063875,7:0.121337):0.059020,((10:0.198564,(2:0.114438,5:0.150473):0.113267):0.062015,1:0.113690); ...
   tree rep.300 = (9:0.096183,((((6:0.092165,7:0.084863):0.094932,((10:0.160747,(2:0.114438,1:0.124806); ...

...

Posterior probability of trees

We will continue to use the programme "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a programme that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

However, you may notice that the command editor in MrBayes has fewer features than that in PAUP. For instance, you cannot recall previously issued commands by using the arrow-up button, and you can also not use the arrow-left button to move into a command and edit it. Notice that the command "help" will give you a list of all available commands. Issuing "help command" will give you a more detailed description of the specified command along with current option values. This is similar to how "command ?" works in PAUP.

We will now explore how to find and use the posterior probability distribution over trees.

  • Start programme:
mb
  • Get a quick overview of available commands:
help
  • Load your sequences:
execute foo_mtDNA.nexus

This file contains mitochondrial DNA (mtDNA) sequences from 5 different primates. Note that MrBayes accepts input in NEXUS format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

  • Inspect data set:
showmatrix
  • Define outgroup:
outgroup Gibbon
  • Specify that program should not prompt you during the run:
set autoclose=yes
  • Specify your model of sequence evolution:
lset nst=2 rates=gamma

This command is, again, very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

Markov chain Monte Carlo sampling

see also: Markov chain, Monte Carlo method
  • Start Markov chain Monte Carlo sampling:
mcmc ngen=300000 samplefreq=100 nchains=4 savebrlens=yes

What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo" or (MC)3) to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. The run will take about 5 minutes to finish (estimated time to finish is shown in the last column of output).

Let us examine the command in detail. First, nchain=4 means that the MCMCMC sampling uses 4 parallel chains: one "cold" from which sampling takes place, and three "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution. The options ngen=300000 samplefreq=100 lets the search run for 300,000 steps ("generations") and saves a tree once every 100 rounds (meaning that a total of 3,000 trees will be saved).

Usually, you would want to sample from more generations (at least 1,000,000 or so, but the more the better). As you might have guessed, savebrlens=yes means we want the branch length information saved along with the sampled trees.

When the run starts you will see reports about the progress of the four chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the four chains. The cold chain is the one enclosed in brackets [], while the heated chains are enclosed in parentheses (). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

  • Have a look at the resulting sample files:

Wait until the run has finished (the output stops and the prompt returns), and then switch to your third shell-window (the one where you are not running MrBayes). Open the parameter sampling file (.p extension):

vi foo_mtDNA.nexus.p

This file contains one line for each sampled point. Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. (The first line contains headings telling what the different columns are.) Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values. Now, have a look at the file containing sampled trees (.t extension):

vi foo_mtDNA.nexus.t

Tree topology is also a parameter in our model, and exactly like the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical format used by most (all?) phylogeny software. Since there are 5 taxa in the present data set, the tree-space consists of only 15 different possible trees. Since we have taken 3000 sample points, there must be many lines containing the same tree topology.

Determine "burn-in"

We now have to determine the "burnin" (i.e., the number of sample points we have to exclude from our further analysis). Here we will do that by inspecting the likelihood values of the trees saved by MrBayes.

Recall, that the idea in MCMCMC sampling is to move around in parameter space in such a way that the points will be visited according to their posterior probability (i.e., a region with very high posterior probability will be visited frequently). It takes a while after starting such a process before the sampling gets "out of the foot-hills and into the mountains" of the distribution, and only after this "burnin" period can the sample points be trusted. Here we will try to establish burnin by looking at the likelihoods of each sample point as recorded in the file foo_mtDNA.nexus.p. At first the likelihoods increase rapidly. Eventually the increase in likelihood declines, and the likelihoods converge on a steady value (a "plateau"). The burnin is the initial part before this plateau is reached. Now, plot the lnL value for the entire run file:

mb-plot.pl foo_mtDNA.nexus.p

(mb-plot.pl is a small Perl script that extracts the relevant columns from the .p file and uses Gnuplot to produce a plot of how the value changes with generation number.) The plot shows you how the likelihood value (y-axis) increases through the generations (x-axis), to finally reach a plateau. If the lnL increases very rapidly, then you may want to zoom in on the first part of the plot. This can be done by first closing the plot window, and then running mbplot.pl with a new set of limits:

mb-plot.pl primate_mtDNA.nexus.p 0 50000

This will show you the LnL values from generation 0 to generation 50000. Feel free to use other limits if you want to zoom further in or out. Now, look at the x-axis to determine the number of generations required before the plateau is reached. Note down this number and close the window.

Determine number of sample points to discard

For the analysis that we will perform below, you have to tell MrBayes how many sample points to discard, not how many generations were required to reach the plateau. You get the number of sample points by dividing the number of generations by 100 (recall that we only sampled once every 100 generations). For instance, a burnin of 6000 generations means that you want to discard the first 60 sample points.

Investigate posterior probability distribution over trees

Return to the shell window where you have MrBayes running. In the command below you should replace XX with the number of sample points you want to discard (60 for instance, but this depends on the plot you made above).

sumt contype=halfcompat showtreeprobs=yes burnin=XX

(Scroll back so you can see the top of the output when the command is done.) This command gives you a summary of the trees that are in the file you examined manually above. As mentioned, the option burnin=XX tells MrBayes how many sample points to discard before analyzing the data. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The tree also has "clade credibility" values. We will return to the meaning of these later.

What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we now get a full list of how probable any possible tree is. Now, draw a sketch of the tree with the highest posterior probability on a piece of paper so you are sure to remember which one it was. How did the less probable trees differ from the most likely one?

The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named foo_mtDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

Posterior probability of clades

By now the run you started using Neanderthal sequences at the start of this tutorial should be finished. Find the shell window and have a look.

Determine burnin

Again use the procedure outlined above to determine burnin for this data set. (Move to free shell window first):

mb-plot.pl neanderthalsmall.nexus.p

Remember that the number of sample points to discard is found by dividing the number of generations by 100).

Find posterior probability of clades

sumt contype=halfcompat showtreeprobs=no burnin=XX

(Again remember to put your actual number of samples to discard instead of the XX in the command line above.) Now examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set). These numbers are different from the bootstrap values that we briefly investigated previously and that we will return to later: unlike bootstrap support (which have no clear statistical meaning) these are actual probabilities. Furthermore, they have been found using a full probabilistic model, instead of neighbour joining, and have still finished in a reasonable amount of time. These features make Bayesian phylogeny very useful for assessing hypotheses about monophyly.

Based on your values, how much support does the present data set give to the "out-of-Africa hypothesis"?

Probability distributions over other parameters

We will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have also examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

  • Load data set:
execute hcvsmall.nexus
  • Define site partition:
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable

This is an alternative way of specifying that different sites have different rates. Instead of using a gamma distribution and learning which sites have what rates from the data, we are instead using our prior knowledge about the structure of the genetic code to specify that all 1st positions have the same rate, all 2nd positions have the same rate, and all 3rd positions have the same rate. Specifically, charset 1stpos=1-.\3 means that we define a character set named "1stpos" which includes site 1 in the alignment followed by every third site ("\3", meaning it includes sites 1, 4, 7, 11, ...) until the end of the alignment (here denoted ".").

Specify model

lset nst=6

This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate ratios.

When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can again be used for this). For instance, a model with equal nucleotide frequencies correspond to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

  • Inspect model details:
showmodel

This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. We will not go into the grisly details of what exactly the Dirichlet distribution looks like, but merely note that it is a distribution over many variables, and that depending on the exact parameters the distribution can be more or less flat. The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

MCMC sampling

  • Start MCMC sampling:
mcmc ngen=300000 samplefreq=100 nchains=4 savebrlens=yes

The run will take about 5 minutes to finish.

  • Determine burnin:

Again use the procedure outlined above to determine burnin for this data set. (Move to free shell window first):

mb-plot.pl hcvsmall.nexus.p

Remember that the number of sample points to discard is found by dividing the number of generations by 100).

  • Compute summary of parameter values:
sump burnin=XX

The sump command works much like the sumt command for the non-tree parameters. First, you get a plot of the LnL (not as nice as when using mb-plot.pl though). Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

Based on the reported posterior means, does it seem that r(C<->G) is different from r(A<->C)?. Here r(C<->G) is the relative substitution rate between C and G. (Note that all substitution parameters are given as relative values, with the GT rate arbitrarily set at a value of 1.00).

Marginal vs. joint distributions

see also: Marginal distribution, Joint distribution

Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(C<->G) parameter and then compared its mean to the mean of the overall distribution of the r(A<->C) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions have a substantial overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution. The figure below illustrates a joint distribution over two parameters, and shows how one can think of the marginal distribution as what you get when you view the joint distribution from the side, so to speak (hence the name "marginal").

<img>
Joint vs. marginal distribution.
(Figure from Berry, "Statistics—a Bayesian perspective".)

Examine marginal and joint distributions

Again find a shell window where MrBayes is not running. In the command below I have arbitrarily set the burnin to 50 in order to save you the hassle of entering the value.

grep -v '^Gen' hcvsmall.nexus.p | gawk '{if (NR > 50) print $6}' > rCG.data

This command takes the parameter file, removes the header line, and for all lines that were sampled after the burnin period it prints the r(C<->G) parameter to the file named "rCG.data". Now extract a few extra interesting columns in a similar manner.

grep -v '^Gen' hcvsmall.nexus.p | gawk '{if (NR > 50) print $9}'  > rAC.data
grep -v '^Gen' hcvsmall.nexus.p | gawk '{if (NR > 50) print $14}' > rm1.data
grep -v '^Gen' hcvsmall.nexus.p | gawk '{if (NR > 50) print $15}' > rm2.data
grep -v '^Gen' hcvsmall.nexus.p | gawk '{if (NR > 50) print $16}' > rm3.data

We can now plot some interesting distributions:

histo -w 2 -L"CG":"AC" -m ll rCG.data rAC.data

We here use the histo programme to plot the marginal distributions of rCG and rAC (issue the command man histo if you want to learn how to use the histo programme). Note that while rAC tends to be larger than rCG, there is still a substantial overlap between the distributions. We therefore need to examine the joint distribution to find the real probability that rAC > rCG.

wc -l rCG.data

This gives us a count of how many points remain in the sample files after having discarded the burnin. Write down the number.

paste rAC.data rCG.data | gawk '{if ($1>$2) print $0}' | wc -l

This counts the number of sample points where rAC > rCG. You can now find the probability that rAC > rCG by dividing the last number by the first. Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can also be performed using spread-sheet programs, and can obviously be used to answer many different questions.

We will finish off this tutorial by making one final plot:

histo -w 0.01 -L"1st":"2nd":"3rd" -m lll rm1.data rm2.data rm3.data

This command plots the relative mutation rates at the first, second, and third codon positions. How does this fit with your knowledge of the structure of the genetic code?

External links

  • MrBayes: Bayesian Inference of Phylogeny.

Appendix

  • The mb-plot.pl Perl script:
#! /usr/bin/perl

# Script that plots ".p" file output by MrBayes (v3.0B4). Default is to plot
# LnL in order to figure out how large the "burnin" is.
# If one parameter is given, it is assumed to be the column to plot
# If two extra parameters are given , they are assumed to be from-to plotrange.
# If THREE parameters are given, they are assumed to be column no.
# followed by from-to.
#
# A.G. Pedersen, 2003

# Ugly commandline parsing
if($#ARGV>3 || $#ARGV<0) {
        print "Usage: mb-plot.pl <parameterfile> [from to | column | column from to]\n";
        exit;
}

$range="";
$column=2;

# Otherwise assume everything is just fine...
$parameterfile=$ARGV[0];
if ($#ARGV==1) {
        $column = $ARGV[1];
} 

if ($#ARGV==2) {
        $from = $ARGV[1];
        $to = $ARGV[2];
        $range = sprintf("[%d:%d]",$from,$to);
} 

if ($#ARGV==3) {
        $column = $ARGV[1];
        $from = $ARGV[2];
        $to = $ARGV[3];
        $range = sprintf("[%d:%d]",$from,$to);
}

# Print column heading so user knows that she is doing the right thing...
printf "Plotting column ";
$command = sprintf("head -2 %s | tail -1 | gawk  \'{print \$%d}\'",$parameterfile, $column);
system ($command);
printf "Press ENTER to exit\n";

# Plot
system ("cat $parameterfile | grep -v 'ID:'| sed \'s/^Gen/#Gen/\' > tempmb.data");
$command = sprintf("plot %s\'tempmb.data\' u %s:%s w li\n",$range,1,$column);
system ("echo \"$command\" > tempmb.gnu");
system ("echo \"pause -1\" >> tempmb.gnu");
system ("gnuplot tempmb.gnu");
system ("rm tempmb.gnu tempmb.data");
Topics in phylogenetics
Relevant fields: phylogenetics | computational phylogenetics | molecular phylogeny | cladistics
Basic concepts: synapomorphy | phylogenetic tree | phylogenetic network | long branch attraction
Phylogeny inference methods: maximum parsimony | maximum likelihood | neighbour joining | UPGMA