# Bootstrapping

Bootstrapping, when applied to phylogenetics, tests whether your entire dataset is supporting your phylogenetic tree, or if the tree is just a marginal winner among many nearly equal alternatives.

"[Bootstrapping is accomplished] by taking random subsamples of the dataset, building trees from each of these and calculating the frequency with which the various parts of your tree are reproduced in each of these random subsamples. If group X is found in every subsample tree, then its bootstrap support is 100%; if it is found in only two-thirds of the subsample trees, its bootstrap support is 67%. Each of the subsamples is the same size as the original, which is accomplished by allowing repeat sampling of sites; that is, random sampling with replacement. It is a simple test, but bootstrap analyses of known phylogenies (viral populations evolved in the laboratory) show that is is a generally dependable measure of phylogenetic accuracy, and that values of 70% or higher are likely to indicate reliable groupings." — by Sandra L. Baldauf (2003).

## Potential problems with the bootstrap

• Sites may not evolve independently
• Sites may not come from a common distribution (but can consider them sampled from a mixture of possible distributions)
• If do not know which branch is of interest at the outset, a "multiple-tests" problem means P values are overstated
• P values are biased (too conservative)
• Bootstrapping does not correct biases in phylogeny methods

## Statistics

In statistics bootstrapping is a method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample. It is distinguished from the jackknife procedure, used to detect outliers, and cross-validation, whose purpose is to make sure that results are repeatable. There are more complicated bootstraps for sampling without replacement, two-sample problems, regression, time series, hierarchical sampling, and other statistical problems.

See also particle filter for the general theory of Sequential Monte Carlo methods, as well as details on some common implementations.

## Conventions

Bootstrap values should be displayed as percentages, not raw values. This makes the tree easier to read and to compare with other trees (Baldauf, 2003).

By convention, only bootstrap values of 50% or higher are reported; lower values mean that the node in question was found in less than half of the bootstrap replicates (Baldauf, 2003).

## References

### Phylogenetics

• Baldauf SL (2003). Phylogeny for the faint of heart: a tutorial. TRENDS in Genetics 19(6):345-351.
• Efron B, Halloran E, and Holmes S (1996). Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci USA 93:13429-13434.
• Farris JS, Albert VA, Kallersjö M, Lipscomb D, and Kluge AG (1996). Parsimony jackknifing outperforms neighbor-joining. Cladistics 12:99-124.
• Sanderson MJ (1995). Objections to bootstrapping phylogenies: a critique. Systematic Biology 44:299-320.
• Harshman J (1994). The effect of irrelevant characters on bootstrap values. Systematic Zoology 43:419-424.
• Efron B and Tibshirani RJ (1993). An Introduction to the Bootstrap. Chapman and Hall (New York).
• Hillis DM and Bull JJ (1993). An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analyses. Sys Biol 42:182-192.
• Felsenstein J and Kishino H (1993). Is there something wrong with the bootstrap on phylogenies? A reply to Hillis and Bull. Syst Biol 42:193–200.
• Goldman N (1993). Statistical tests of models of DNA substitution. J Mol Evol 36:182–198.
• Zharkikh A and Li WH (1992). Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. I. Four taxa with a molecular clock. Molecular Biology and Evolution 9:1119-1147.
• Künsch HR (1989). The jackknife and the bootstrap for general stationary observations. Annals of Statistics 17:1217-1241.
• Wu CFJ (1986). Jackknife, bootstrap and other resampling plans in regression analysis. Annals of Statistics 14:1261-1295.
• Efron B (1985). Bootstrap confidence intervals for a class of parametric problems. Biometrika 72:45-58.
• Felsenstein J (1985). Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783-791.
• Margush T and McMorris FR (1981). Consensus n-trees. Bulletin of Mathematical Biology 43:239-244.
• Efron B (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics 7:1-26.

### Statistics

Topics in phylogenetics
Relevant fields: phylogenetics | computational phylogenetics | molecular phylogeny | cladistics
Basic concepts: synapomorphy | phylogenetic tree | phylogenetic network | long branch attraction
Phylogeny inference methods: maximum parsimony | maximum likelihood | neighbour joining | UPGMA