Difference between revisions of "Dot-star file format"
(→Example#2: Added more info.) |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | The '''dot-star file format''' (or "'''.* format'''") is a standard used in [[phylogenetics]] to represent the | + | The '''dot-star file format''' (or "'''.* format'''") is a standard used in [[phylogenetics]] to represent a [[taxon]] bibartition that is specified by removing a branch (edge), thereby dividing the species into those to the left and those to the right of the branch. Here, taxa to one side of the removed branch are denoted "." and those to the other side are denoted "*". |
; dots ("<tt>.</tt>") : for the [[Taxon|taxa]] that are on one side of the partition | ; dots ("<tt>.</tt>") : for the [[Taxon|taxa]] that are on one side of the partition | ||
− | ; stars ("<tt>*</tt>") : for the | + | ; stars ("<tt>*</tt>") : for the taxa that are on the other side of the partition |
− | Example dot-star file | + | == Example dot-star file == |
<pre> | <pre> | ||
− | ********* | + | ..********* 651 |
− | + | *********.. 736 | |
− | + | **..******* 620 | |
− | + | ******.**.. 103 | |
− | + | ******..*.. 88 | |
− | ....................... | + | **..**..*.. 125 |
− | + | **..**..... 271 | |
+ | ....**..... 312 | ||
+ | 11112311111 | ||
+ | </pre> | ||
+ | |||
+ | The above is a diagrammatic representation of the below tree. Each row represents 1 tree cycle; defining 2 groups. Each column is 1 sequence; the stars in each line show 1 group; the dots show the other. Numbers show occurences in bootstrap samples. | ||
+ | |||
+ | This is an UNROOTED tree (note: numbers in parentheses are branch lengths): | ||
+ | <pre> | ||
+ | Cycle 1 = SEQ: 1 ( 0.00000) joins SEQ: 2 ( 0.00000) | ||
+ | Cycle 2 = SEQ: 10 ( 0.00894) joins SEQ: 11 ( 0.01190) | ||
+ | Cycle 3 = SEQ: 3 ( -0.00940) joins SEQ: 4 ( 0.00940) | ||
+ | Cycle 4 = SEQ: 7 ( -0.00030) joins Node: 10 ( 0.00974) | ||
+ | Cycle 5 = Node: 7 ( 0.00018) joins SEQ: 8 ( -0.00018) | ||
+ | Cycle 6 = Node: 3 ( 0.00950) joins Node: 7 ( 0.00011) | ||
+ | Cycle 7 = Node: 3 ( 0.00000) joins SEQ: 9 ( -0.00011) | ||
+ | Cycle 8 = Node: 1 ( 0.01887) joins Node: 3 ( 0.00011) | ||
+ | Cycle 9 (Last cycle, trichotomy): | ||
+ | Node: 1 ( 0.00000) joins | ||
+ | SEQ: 5 ( 0.01887) joins | ||
+ | SEQ: 6 ( 0.01887) | ||
</pre> | </pre> | ||
The columns of stars and dots in the table represent the sequences in the dataset, 1 to n from left to right. Each row represents the separation of the sequences into two groups (clades), the stars and the dots. The branch tahat separates the star clade from the dot clade occurs in the resampled trees the number of times indicated at the right end of each line out of the total number of resamplings. Thus the validity of any predicted branch can be quantified. | The columns of stars and dots in the table represent the sequences in the dataset, 1 to n from left to right. Each row represents the separation of the sequences into two groups (clades), the stars and the dots. The branch tahat separates the star clade from the dot clade occurs in the resampled trees the number of times indicated at the right end of each line out of the total number of resamplings. Thus the validity of any predicted branch can be quantified. | ||
− | == | + | == Simple Example == |
For each taxonomic assignment in your database data, you then check how many times the query sequence is a member of at least one partition (one of the two sets defined by an edge in the tree) which except for the query sequence only counts sequences belonging to that taxonomic assignment as its members. | For each taxonomic assignment in your database data, you then check how many times the query sequence is a member of at least one partition (one of the two sets defined by an edge in the tree) which except for the query sequence only counts sequences belonging to that taxonomic assignment as its members. | ||
Line 32: | Line 52: | ||
where the last sequence is the query sequence, then the probability of the query sequence belonging to the waggadoodles is 60% because it formed a unqiue ([[Phylogenetics#Groups|monophyletic]]) group with at least some waggadoodles in 3 out of 5 cases (case 1, 2 and 3). | where the last sequence is the query sequence, then the probability of the query sequence belonging to the waggadoodles is 60% because it formed a unqiue ([[Phylogenetics#Groups|monophyletic]]) group with at least some waggadoodles in 3 out of 5 cases (case 1, 2 and 3). | ||
− | == | + | == Complex Example == |
A taxon bibartition is specified by removing a branch, thereby dividing the species into those to the left and those to the right of the branch. Here, taxa to one side of the removed branch are denoted "." and those to the other side are denoted "*". The output includes the bipartition number (ID; sorted from highest to lowest probability), bipartition (e.g., ...**..), number of times the bipartition was observed (#obs), the posterior probability of the bipartition, and, if branch lengths were recorded on the trees in the file, the average (Mean(v)) and variance (Var(v)) of the lengths. Each "." or "*" in the bipartition represents a taxon that is to the left or right of the removed branch. A list of the taxa in the bipartition is given before the list of bipartitions. | A taxon bibartition is specified by removing a branch, thereby dividing the species into those to the left and those to the right of the branch. Here, taxa to one side of the removed branch are denoted "." and those to the other side are denoted "*". The output includes the bipartition number (ID; sorted from highest to lowest probability), bipartition (e.g., ...**..), number of times the bipartition was observed (#obs), the posterior probability of the bipartition, and, if branch lengths were recorded on the trees in the file, the average (Mean(v)) and variance (Var(v)) of the lengths. Each "." or "*" in the bipartition represents a taxon that is to the left or right of the removed branch. A list of the taxa in the bipartition is given before the list of bipartitions. | ||
Line 38: | Line 58: | ||
The first partition (ID 1) is the terminal branch leading to taxon 8 since it has a star in the 8th position and a dot in all other positions). Then it gives the number of times the partition was sampled (<tt>#obs</tt>), the probability of the partition (<tt>Probab.</tt>), the standard deviation of the partition frequency (<tt>Stdev(s)</tt>), the mean (<tt>Mean(v)</tt>) and variance (<tt>Var(v)</tt>) of the branch length, the Potential Scale Reduction Factor (<tt>PSRF</tt>), and finally the number of runs in which the partition was sampled (Nruns). In our analysis, there is overwhelming support for a single tree, so all partitions have a posterior probability of 1.0. | The first partition (ID 1) is the terminal branch leading to taxon 8 since it has a star in the 8th position and a dot in all other positions). Then it gives the number of times the partition was sampled (<tt>#obs</tt>), the probability of the partition (<tt>Probab.</tt>), the standard deviation of the partition frequency (<tt>Stdev(s)</tt>), the mean (<tt>Mean(v)</tt>) and variance (<tt>Var(v)</tt>) of the branch length, the Potential Scale Reduction Factor (<tt>PSRF</tt>), and finally the number of runs in which the partition was sampled (Nruns). In our analysis, there is overwhelming support for a single tree, so all partitions have a posterior probability of 1.0. | ||
− | + | '''Summary statistics for taxon bipartitions:''' | |
− | + | <pre> | |
− | + | ID -- Partition #obs Probab. Stdev(s) Mean(v) Var(v) PSRF Nruns | |
− | + | ------------------------------------------------------------------------------ | |
− | + | 123456789012 | |
− | + | 1 -- .*********** 1502 1.000000 0.000000 0.493059 0.004961 1.032 2 | |
− | + | 2 -- ..********** 1502 1.000000 0.000000 0.262748 0.003413 1.016 2 | |
− | + | 3 -- ..*********. 1502 1.000000 0.000000 0.125550 0.001955 1.000 2 | |
− | + | 4 -- .......****. 1502 1.000000 0.000000 0.247413 0.002098 1.000 2 | |
− | + | 5 -- .......***.. 1501 0.999334 0.000942 0.047174 0.000350 1.012 2 | |
− | + | 6 -- .......**... 1502 1.000000 0.000000 0.035448 0.000098 1.009 2 | |
− | + | 7 -- ..*****..... 1502 1.000000 0.000000 0.126285 0.001214 0.999 2 | |
− | + | 8 -- ..****...... 1495 0.995340 0.002825 0.052450 0.000364 0.999 2 | |
− | + | 9 -- ..***....... 1502 1.000000 0.000000 0.082370 0.000350 1.001 2 | |
− | + | 10 -- ..**........ 1498 0.997337 0.003766 0.029311 0.000116 1.000 2 | |
− | + | 11 -- ...........* 1502 1.000000 0.000000 0.422998 0.003393 1.043 2 | |
− | + | 12 -- ..........*. 1502 1.000000 0.000000 0.071104 0.000444 1.001 2 | |
− | + | 13 -- .........*.. 1502 1.000000 0.000000 0.056921 0.000138 1.002 2 | |
− | + | 14 -- ........*... 1502 1.000000 0.000000 0.021871 0.000035 1.031 2 | |
− | + | 15 -- .......*.... 1502 1.000000 0.000000 0.017406 0.000029 0.999 2 | |
− | + | 16 -- ......*..... 1502 1.000000 0.000000 0.163211 0.001024 1.018 2 | |
− | + | 17 -- .....*...... 1502 1.000000 0.000000 0.142974 0.000696 1.010 2 | |
− | + | 18 -- ....*....... 1502 1.000000 0.000000 0.059229 0.000191 1.003 2 | |
− | + | 19 -- ...*........ 1502 1.000000 0.000000 0.063484 0.000151 1.010 2 | |
− | + | 20 -- ..*......... 1502 1.000000 0.000000 0.050141 0.000136 1.004 2 | |
− | + | 21 -- .*.......... 1502 1.000000 0.000000 0.323419 0.003698 1.036 2 | |
− | + | 123456789012 | |
− | </ | + | ------------------------------------------------------------------------------ |
+ | </pre> | ||
− | The clade credibility tree | + | The clade credibility tree gives the probability of each partition or clade in the tree. |
− | + | '''Clade credibility values:''' | |
− | + | <pre> | |
− | + | ||
/--------------------------------------------------------- Tarsius_syrichta (1) | /--------------------------------------------------------- Tarsius_syrichta (1) | ||
| | | | ||
Line 95: | Line 115: | ||
| | | | ||
\------------------------------------------------- Saimiri_sciureus (12) | \------------------------------------------------- Saimiri_sciureus (12) | ||
− | + | </pre> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | </ | + | |
− | + | {{Phylogenetics}} | |
[[Category:Phylogenetics]] | [[Category:Phylogenetics]] |
Latest revision as of 04:57, 13 September 2006
The dot-star file format (or ".* format") is a standard used in phylogenetics to represent a taxon bibartition that is specified by removing a branch (edge), thereby dividing the species into those to the left and those to the right of the branch. Here, taxa to one side of the removed branch are denoted "." and those to the other side are denoted "*".
- dots (".")
- for the taxa that are on one side of the partition
- stars ("*")
- for the taxa that are on the other side of the partition
Example dot-star file
..********* 651 *********.. 736 **..******* 620 ******.**.. 103 ******..*.. 88 **..**..*.. 125 **..**..... 271 ....**..... 312 11112311111
The above is a diagrammatic representation of the below tree. Each row represents 1 tree cycle; defining 2 groups. Each column is 1 sequence; the stars in each line show 1 group; the dots show the other. Numbers show occurences in bootstrap samples.
This is an UNROOTED tree (note: numbers in parentheses are branch lengths):
Cycle 1 = SEQ: 1 ( 0.00000) joins SEQ: 2 ( 0.00000) Cycle 2 = SEQ: 10 ( 0.00894) joins SEQ: 11 ( 0.01190) Cycle 3 = SEQ: 3 ( -0.00940) joins SEQ: 4 ( 0.00940) Cycle 4 = SEQ: 7 ( -0.00030) joins Node: 10 ( 0.00974) Cycle 5 = Node: 7 ( 0.00018) joins SEQ: 8 ( -0.00018) Cycle 6 = Node: 3 ( 0.00950) joins Node: 7 ( 0.00011) Cycle 7 = Node: 3 ( 0.00000) joins SEQ: 9 ( -0.00011) Cycle 8 = Node: 1 ( 0.01887) joins Node: 3 ( 0.00011) Cycle 9 (Last cycle, trichotomy): Node: 1 ( 0.00000) joins SEQ: 5 ( 0.01887) joins SEQ: 6 ( 0.01887)
The columns of stars and dots in the table represent the sequences in the dataset, 1 to n from left to right. Each row represents the separation of the sequences into two groups (clades), the stars and the dots. The branch tahat separates the star clade from the dot clade occurs in the resampled trees the number of times indicated at the right end of each line out of the total number of resamplings. Thus the validity of any predicted branch can be quantified.
Simple Example
For each taxonomic assignment in your database data, you then check how many times the query sequence is a member of at least one partition (one of the two sets defined by an edge in the tree) which except for the query sequence only counts sequences belonging to that taxonomic assignment as its members.
For example, if you have 8 database sequences and sequence 1, 2, 3, and 5 belong to group 'waggadoodles', and you have the following output:
1: ...*.***. 2: *.......* 3: .*......* 4: ******..* 5: ..**....*
where the last sequence is the query sequence, then the probability of the query sequence belonging to the waggadoodles is 60% because it formed a unqiue (monophyletic) group with at least some waggadoodles in 3 out of 5 cases (case 1, 2 and 3).
Complex Example
A taxon bibartition is specified by removing a branch, thereby dividing the species into those to the left and those to the right of the branch. Here, taxa to one side of the removed branch are denoted "." and those to the other side are denoted "*". The output includes the bipartition number (ID; sorted from highest to lowest probability), bipartition (e.g., ...**..), number of times the bipartition was observed (#obs), the posterior probability of the bipartition, and, if branch lengths were recorded on the trees in the file, the average (Mean(v)) and variance (Var(v)) of the lengths. Each "." or "*" in the bipartition represents a taxon that is to the left or right of the removed branch. A list of the taxa in the bipartition is given before the list of bipartitions.
The first partition (ID 1) is the terminal branch leading to taxon 8 since it has a star in the 8th position and a dot in all other positions). Then it gives the number of times the partition was sampled (#obs), the probability of the partition (Probab.), the standard deviation of the partition frequency (Stdev(s)), the mean (Mean(v)) and variance (Var(v)) of the branch length, the Potential Scale Reduction Factor (PSRF), and finally the number of runs in which the partition was sampled (Nruns). In our analysis, there is overwhelming support for a single tree, so all partitions have a posterior probability of 1.0.
Summary statistics for taxon bipartitions:
ID -- Partition #obs Probab. Stdev(s) Mean(v) Var(v) PSRF Nruns ------------------------------------------------------------------------------ 123456789012 1 -- .*********** 1502 1.000000 0.000000 0.493059 0.004961 1.032 2 2 -- ..********** 1502 1.000000 0.000000 0.262748 0.003413 1.016 2 3 -- ..*********. 1502 1.000000 0.000000 0.125550 0.001955 1.000 2 4 -- .......****. 1502 1.000000 0.000000 0.247413 0.002098 1.000 2 5 -- .......***.. 1501 0.999334 0.000942 0.047174 0.000350 1.012 2 6 -- .......**... 1502 1.000000 0.000000 0.035448 0.000098 1.009 2 7 -- ..*****..... 1502 1.000000 0.000000 0.126285 0.001214 0.999 2 8 -- ..****...... 1495 0.995340 0.002825 0.052450 0.000364 0.999 2 9 -- ..***....... 1502 1.000000 0.000000 0.082370 0.000350 1.001 2 10 -- ..**........ 1498 0.997337 0.003766 0.029311 0.000116 1.000 2 11 -- ...........* 1502 1.000000 0.000000 0.422998 0.003393 1.043 2 12 -- ..........*. 1502 1.000000 0.000000 0.071104 0.000444 1.001 2 13 -- .........*.. 1502 1.000000 0.000000 0.056921 0.000138 1.002 2 14 -- ........*... 1502 1.000000 0.000000 0.021871 0.000035 1.031 2 15 -- .......*.... 1502 1.000000 0.000000 0.017406 0.000029 0.999 2 16 -- ......*..... 1502 1.000000 0.000000 0.163211 0.001024 1.018 2 17 -- .....*...... 1502 1.000000 0.000000 0.142974 0.000696 1.010 2 18 -- ....*....... 1502 1.000000 0.000000 0.059229 0.000191 1.003 2 19 -- ...*........ 1502 1.000000 0.000000 0.063484 0.000151 1.010 2 20 -- ..*......... 1502 1.000000 0.000000 0.050141 0.000136 1.004 2 21 -- .*.......... 1502 1.000000 0.000000 0.323419 0.003698 1.036 2 123456789012 ------------------------------------------------------------------------------
The clade credibility tree gives the probability of each partition or clade in the tree.
Clade credibility values:
/--------------------------------------------------------- Tarsius_syrichta (1) | |--------------------------------------------------------- Lemur_catta (2) | | /-------- Homo_sapiens (3) | /--100--+ | | \-------- Pan (4) | /--100--+ | | \---------------- Gorilla (5) | /---100--+ + | \------------------------ Pongo (6) | /--100--+ | | \--------------------------------- Hylobates (7) | | | | /-------- Macaca_fuscata (8) | /--100--+ /--100--+ | | | | \-------- M_mulatta (9) | | | /--100--+ | | | | \---------------- M_fascicularis (10) \--100--+ \-------100------+ | \------------------------ M_sylvanus (11) | \------------------------------------------------- Saimiri_sciureus (12)
Topics in phylogenetics |
---|
Relevant fields: phylogenetics | computational phylogenetics | molecular phylogeny | cladistics |
Basic concepts: synapomorphy | phylogenetic tree | phylogenetic network | long branch attraction |
Phylogeny inference methods: maximum parsimony | maximum likelihood | neighbour joining | UPGMA |