Newick phylogenetic tree format

From Christoph's Personal Wiki
Jump to: navigation, search

The Newick phylogenetic tree format (aka Newick Standard or New Hampshire format) for representing trees in computer-readable form makes use of the correspondence between trees and nested parentheses, noticed in 1857 by the famous English mathematician Arthur Cayley.

Joseph Felsenstein's formal examples

If we have this rooted tree:

           C
       A   |   E
        \  |  /
     B   \ | /   D
      \   \|/   /
       \   |   /
        \  |  /
         \ | /
          \|/
           |
           |

then in the tree file it is represented by the following sequence of printable characters:

(B,(A,C,E),D);

The tree ends with a semicolon. The bottommost node in this tree is an interior node, not a tip. Interior nodes are represented by a pair of matched parentheses. Between them are representations of the nodes that are immediately descended from that node, separated by commas. In the above tree, the immediate descendants are B, another interior node, and D. The other interior node is represented by a pair of parentheses, enclosing representations of its immediate descendants, A, C, and E. In our example these happen to be tips, but in general they could also be interior nodes and the result would be further nestings of parentheses, to any level.

Tips are represented by their names. A name can be any string of printable characters except blanks, colons, semicolons, parentheses, and square brackets.

Because you may want to include a blank in a name, it is assumed that an underscore character ("_") stands for a blank; any of these in a name will be converted to a blank when it is read in. Any name may also be empty: a tree like

(,(,,),);

is allowed. Trees can be multifurcating at any level.

Branch lengths can be incorporated into a tree by putting a real number, with or without decimal point, after a node and preceded by a colon. This represents the length of the branch immediately below that node. Thus the above tree might have lengths represented as:

(B:6.0,(A:5.0,C:3.0,E:4.0):5.0,D:11.0);

The tree starts on the first line of the file, and can continue to subsequent lines. It is best to proceed to a new line, if at all, immediately after a comma. Blanks can be inserted at any point except in the middle of a species name or a branch length.

The above description is actually of a subset of the Newick Standard. For example, interior nodes can have names in that standard. These names follow the right parenthesis for that interior node, as in this example:

(B:6.0,(A:5.0,C:3.0,E:4.0)Ancestor1:5.0,D:11.0);

To help you understand this tree representation, here are some trees in the above form:

((raccoon:19.19959,bear:6.80041):0.84600,((sea_lion:11.99700, seal:12.00300):7.52973,((monkey:100.85930,cat:47.14069):20.59201, weasel:18.87953):2.09460):3.87382,dog:25.46154); 

(Bovine:0.69395,(Gibbon:0.36079,(Orang:0.33636,(Gorilla:0.17147,(Chimp:0.19268, Human:0.11927):0.08386):0.06124):0.15057):0.54939,Mouse:1.21460):0.10; 

(Bovine:0.69395,(Hylobates:0.36079,(Pongo:0.33636,(G._Gorilla:0.17147, (P._paniscus:0.19268,H._sapiens:0.11927):0.08386):0.06124):0.15057):0.54939, Rodent:1.21460); 

A; 

((A,B),(C,D)); 

(Alpha,Beta,Gamma,Delta,,Epsilon,,,); 

The Newick Standard does not make a unique representation of a tree, for two reasons. First, the left-right order of descendants of a node affects the representation, even though it is biologically uninteresting. Thus

(A,(B,C),D);

is the same tree as

(A,(C,B),D);

In addition, the standard is representing a rooted tree. For many biological purposes we may not be able to infer the position of the root. We would like to have a representation of an unrooted tree when decribing inferences in such cases. Here the convention is simply to arbitrarily root the tree and report the resulting rooted tree. Thus

(B,(A,D),C);

would be the same un-rooted tree as

(A,(B,C),D);

and as

((A,D),(C,B));

In spite of this limitation of non-uniqueness the readability of the resulting representation (for trees of modest size) and the ease of writing programs that read it have kept this standard in widespread use.

The Newick Standard was adopted 26 June 1986 by an informal committee meeting convened by me during the Society for the Study of Evolution meetings in Durham, New Hampshire and consisting of James Archie, William H.E. Day, Wayne Maddison, Christopher Meacham, F. James Rohlf, David Swofford, and Joseph Felsenstein. (The committee was not an activity of the SSE nor endorsed by it). The reason for the name is that the second and final session of the committee met at Newick's restaurant in Dover, and we enjoyed the meal of lobsters. There has been as yet no formal publication of the Newick Standard. Gary Olsen has produced a formal description of it which is available below.

Gary Olsen's Interpretation of the "Newick Tree Format Standard"

Conventions

  • Items in { } may appear zero or more times.
  • Items in [ ] are optional, they may appear once or not at all.
  • All other punctuation marks (colon, semicolon, parentheses, comma and single quote) are required parts of the format.
               tree ==> descendant_list [ root_label ] [ : branch_length ] ;

    descendant_list ==> ( subtree { , subtree } )

            subtree ==> descendant_list [internal_node_label] [: branch_length]
                    ==> leaf_label [: branch_length]

         root_label ==> label
internal_node_label ==> label
         leaf_label ==> label

              label ==> unquoted_label
                    ==> quoted_label

     unquoted_label ==> string_of_printing_characters
       quoted_label ==> ' string_of_printing_characters '

      branch_length ==> signed_number
                    ==> unsigned_number

Notes

  • Unquoted labels may not contain blanks, parentheses, square brackets, single_quotes, colons, semicolons, or commas.
  • Underscore characters in unquoted labels are converted to blanks.
  • Single quote characters in a quoted label are represented by two single quotes.
  • Blanks or tabs may appear anywhere except within unquoted labels or branch_lengths.
  • Newlines may appear anywhere except within labels or branch_lengths.
  • Comments are enclosed in square brackets and may appear anywhere newlines are permitted.

Other notes

  • PAUP (David Swofford) allows nesting of comments.
  • TreeAlign (Jotun Hein) writes a root node branch length (with a value of 0.0).
  • PHYLIP (Joseph Felsenstein) requires that an unrooted tree begin with a trifurcation; it will not "uproot" a rooted tree.

Example

(((One:0.2,Two:0.3):0.3,(Three:0.5,Four:0.3):0.2):0.3,Five:0.7):0.0;

           +-+ One
        +--+
        |  +--+ Two
     +--+
     |  | +----+ Three
     |  +-+
     |    +--+ Four
     +
     +------+ Five

More examples

The Newick format is used to describe a phylogenetic tree as a string of text. Parentheses are used to group sequence names and branch lengths are included using colons followed by the length. The text string is ended by a semicolon.

An example phylogeny:

((Human:0.1,Gorilla:0.1):0.4,(Mouse:0.2,Rat:0.2):0.3);

This describes the following phylogeny:

Newick format example.png

The topology of the phylogeny is specified by omitting branch lengths:

((Human,Gorilla),(Mouse,Rat));

Copyrights

© Copyright 1986-2004 by The University of Washington. Written by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.

Algorithm by Guy Zinman and Christoph Champ

  • printable characters:
(A(B(C,D(E))F(G,H)));
  • Tree:
       A
      / \
     /   \
    /     \
   B       F
  / \     / \
 |   |   |   |
 C   D   G   H
     |
     E
  • Matrix 1:
  A B C D E F G H
A 0 
B 1 0
C 2 1 0
D 2 1 0 0
E 3 2 0 1 0
F 1 0 0 0 0 0
G 2 0 0 0 0 1 0
H 2 0 0 0 0 1 0 0
  • Matrix 2:
  A B C D E F G H
A 0
B 1 0
C 2 1 0
D 2 1 2 0
E 3 2 3 1 0
F 1 2 3 3 4 0
G 2 3 4 4 5 1 0
H 2 3 4 4 5 1 2 0

External links

Topics in phylogenetics
Relevant fields: phylogenetics | computational phylogenetics | molecular phylogeny | cladistics
Basic concepts: synapomorphy | phylogenetic tree | phylogenetic network | long branch attraction
Phylogeny inference methods: maximum parsimony | maximum likelihood | neighbour joining | UPGMA