Maximum likelihood

From Christoph's Personal Wiki
Jump to: navigation, search

Maximum likelihood estimation (MLE) is a popular statistical method used to make inferences about parameters of the underlying probability distribution of a given data set.

The method was pioneered by geneticist and statistician Sir Ronald A. Fisher between 1912 and 1922 (see external resources below for more information on the history of MLE).


The following discussion assumes that the reader is familiar with basic notions in probability theory such as probability distributions, probability density functions, random variables and expectation. It also assumes s/he is familiar with standard basic techniques of maximising continuous real-valued functions, such as using differentiation to find a function's maxima.

The philosophy of MLE

Given a probability distribution <math>D</math>, associated with either a known probability density function (continuous distribution) or a known probability mass function (discrete distribution), denoted as <math>f_D</math>, and distributional parameter <math>\theta</math>, we may draw a sample <math>X_1, X_2, ..., X_n</math> of <math>n</math> values from this distribution and then using <math>f_D</math> we may compute the probability associated with our observed data:

<math>\mathbb{P}(x_1,x_2,\dots,x_n) = f_D(x_1,\dots,x_n \mid \theta)</math>

However, it may be that we don't know the value of the parameter <math>\theta</math> despite knowing (or believing) that our data comes from the distribution <math>D</math>. How should we estimate <math>\theta</math>? It is a sensible idea to draw a sample of <math>n</math> values <math>X_1, X_2, ... X_n</math> and use this data to help us make an estimate.

Once we have our sample <math>X_1, X_2, ..., X_n</math>, we may seek an estimate of the value of <math>\theta</math> from that sample. MLE seeks the most likely value of the parameter <math>\theta</math> (i.e., we maximise the likelihood of the observed data set over all possible values of <math>\theta</math>). This is in contrast to seeking other estimators, such as an unbiased estimator of <math>\theta</math>, which may not necessarily yield the most likely value of <math>\theta</math> but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of <math>\theta</math>.

To implement the MLE method mathematically, we define the likelihood:

<math>\mbox{lik}(\theta) = f_D(x_1,\dots,x_n \mid \theta)</math>

and maximise this function over all possible values of the parameter <math>\theta</math>. The value <math>\hat{\theta}</math> which maximises the likelihood is known as the maximum likelihood estimator (MLE) for <math>\theta</math>.


  • The likelihood is a function of <math>\theta</math> for fixed values of <math>x_1,x_2,\ldots,x_n</math>.
  • The maximum likelihood estimator may not be unique, or indeed may not even exist.


Functional invariance

If <math>\widehat{\theta}</math> is the maximum likelihood estimator (MLE) for <math>\theta</math>, then the MLE for <math>\alpha = g(\theta)</math> is <math>\widehat{\alpha} = g(\widehat{\theta})</math>. The function g need not be one-to-one. For detail, please refer to the proof of Theorem 7.2.10 of Statistical Inference by George Casella and Roger L. Berger.

Asymptotic behaviour

Maximum likelihood estimators achieve minimum variance (as given by the Cramer-Rao lower bound) in the limit as the sample size tends to infinity. When the MLE is unbiased, we may equivalently say that it has minimum mean squared error in the limit.

For independent observations, the maximum likelihood estimator often follows an asymptotic normal distribution.


The bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution). If n is unknown, then the maximum-likelihood estimator of n is the value on the drawn ticket, even though the expectation is only <math>(n+1)/2</math>. In estimating the highest number n, we can only be certain that it is greater than or equal to the drawn ticket number.

See also

  • The mean squared error is a measure of how 'good' an estimator of a distributional parameter is (be it the maximum likelihood estimator or some other estimator).
  • The article on the Rao-Blackwell theorem for a discussion on finding the best possible unbiased estimator (in the sense of having minimal mean squared error) by a process called Rao-Blackwellisation. The MLE is often a good starting place for the process.
  • The reader may be intrigued to learn that the MLE (if it exists) will always be a function of a sufficient statistic for the parameter in question.

External resources

External links

Topics in phylogenetics
Relevant fields: phylogenetics | computational phylogenetics | molecular phylogeny | cladistics
Basic concepts: synapomorphy | phylogenetic tree | phylogenetic network | long branch attraction
Phylogeny inference methods: maximum parsimony | maximum likelihood | neighbour joining | UPGMA