In any evolutionary process, speciation events cause a new species to split off from an existing one, thus creating the diversity of life forms we know today.
A key issue in evolutionary biology is to reconstruct the history of these speciation events. Given the properties of the leaf nodes, reconstruct what the tree is.
Much of the current interest in phylogenic trees follows from the increasing availability of DNA sequence data.
Biological applications include evolution studies (e.g. the out of Africa debate) and medical research (tracing HIV infection).
However, phylogenic trees play an important role in analyzing the history of languages, religions, chain letters, and medieval manuscripts, as well as biology.
Although the available for analysis varies by application, it can usual be partitioned into distance and feature/character data.
Distance data measures (directly or indirectly)
, the length
of time since species
and
diverged.
Such time can be estimated from the distance between DNA sequences,
assuming a `molecular clock' governing the frequency of mutations.
Feature/character data measures taxonomical properties such as `warm blooded', `has wings', or `walks upright'. If such features are hard to develop twice, they describe branch points in the phylogenic tree.
Different types of reconstruction algorithms are necessary for such distinct types of data.
Observe that there are many tree topologies
possible for any set of
leaves.
Every binary tree on
leaves has
internal nodes,
and thus
vertices and
edges.
Every unrooted binary tree
leaves has
vertices and
edges, since the in-degree 0 root can be contacted to a single edge.
Since the root can be positioned at any edge, there are
more rooted
trees than unrooted trees on
leaves.
Further, any rooted tree on
leaves corresponds to an unrooted tree
on
leaves, since we can take the highest numbered leaf to be the root.
Thus
For
, there are about 2,000,000 unrooted trees,
and for
about
, so the number of
possible topologies grows very fast.
This makes it hopeless to exhaustively search for the best possible tree beyond 10 or so species.
The business of reconstructing trees is very messy for several reasons:
Phylogenic tree problems have the same flavor as Steiner tree problems in graphs, where we must deduce the positions of intermediate nodes to find the best possible fit.
Distance data tree construction algorithms bare a strong relationship to clustering algorithms, particularly agglomerative algorithms which explicitly construct a tree as they merge clusters together.
Representative algorithms include nearest neighbor joining, and repeatedly merging the nearest cluster centroids (the unweighted pair group method using arithmetic averages (UPGMA)).
Such algorithms can yield reasonable and informative trees, but there seems no good reason to believe that they yield the correct tree.
Defining mathematical properties which real trees obey enable us to define optimization criteria which make it plausible to define the best tree.
An ultrametric tree is a rooted tree where
Note that if the labels mean ``time units ago'', this is true of any evolutionary tree.
Each pairwise distance
represents the label of the
least common ancestor of
and
.
There is an efficient
algorithm to reconstruct an ultrametric
tree, if one exists.
Observe that the labels on the path from a leaf
to the root
follow from sorting the distances in row
, since a branching
point corresponds to each unique distance.
Shared distances on the path partition the other leaves into groups in the other subtrees.
Thus each of the resulting partitions is fixed, and can be refined by considering other rows.
Unless we get a contradiction, we get an ultrametric tree. Further, this tree must be unique.
The problem becomes hard when you seek the ``most'' ultrametric tree in noisy data. A little noise can throw the topology off considerably.
A weaker but well defined condition assumes that we have the distance between all pairs of leaves, and seek an unrooted, edge-weighted tree such that the sum of the distances along each path adds up to the defined distance.
An algorithm for additive tree reconstruction follows from a reduction to ultrametric trees.
First, lift one of the nodes
with a maximum distance entry
to the root.
Second, add weight to each leaf edge
so that the total distance from root
to each leaf is
.
Finally, assign each node the largest weight below it to give us matrix
:
These node labels of
define an ultrametric tree, since they decrease
along each root-to-leaf path.
Thus if we could construct this matrix
without the knowledge of the
tree, we could solve the additive tree problem as an ultrametric tree
problem by reversing this construction.
Since the additive input matrix
determines
and
, we need
to compute the distance in the unknown tree from
to the least
common ancestor
.
Thus if
is an additive matrix, then
is an ultrametric matrix,
where
The technical assumption that all species in an ultrametric tree are
leaves can be restored by hanging
directly off the root.
Suppose we have
species, and
features each of which only evolved
once ( parsimony).
Each feature can be represented by an
-element
-vector where
iff species
has feature
.
The perfect phylogeny problem asks for a phylogenic tree
given an
binary feature matrix, if one exists.
Note that this is an edge-labeled tree, and certain features do not
necessarily contribute to a split (e.g. 3), particularly when
.
Suppose we know that all features evolve from 0 to 1, i.e. the characters are ordered.
Claim: Matrix
has a perfect phylogenic tree iff
for every pair of columns
,
the set of 1s are either (1) disjoint, or (2)
contains
.
If column
contains all the 1s of column
, then feature
evolved before feature
.
If two columns are disjoint, then the features evolved independently.
If species
has feature
but not
, and
has a feature
but
not
, then no perfect phylogeny can exist.
This immediately gives an
algorithm to test the matrix
for this property, which can be improved to
using radix sorting.
Define an
matrix where
equals the number of
characters which species
shares with
for a given feature matrix
.
Given a perfect phylogeny for
, we can label each node of the tree
with the number of labels encountered from the root:
These numbers are increasing along each root-to-leaf path, so negating them given an ultrametric matrix.
Thus if
has a perfect phylogeny, then
must be ultrametric.
This gives a reconstruction algorithm since ultrametric trees are unique.
The perfect phylogeny problem can be made more general by allowing non-binary features, e.g. the locomotion feature might be `fixed', `crawling', or `walking'.
In general, each feature is one of
states.
In ordered phylogeny problems, we know the directed sequence of
transitions for each character as we move down the tree.
There are polynomial ordered perfect phylogeny algorithms for
any constant
, i.e.
is in the exponent of the running time.
For unordered problems it is NP-complete for general
, but polynomial
for
.
Another approach to reconstructing phylogenic trees is to carefully analyze small subsets of species to reconstruct their relative history, and then integrate a set of resulting trees into a consistent whole.
The smallest interesting unrooted trees contain four species, and are called quartets.
There are three possible quartets on any set of species.
The set of
quartets induced from any tree
uniquely defines the topology of the tree.
Note that rooted triplets can be modeled as quartets if one species is ancestral.
In general, it is difficult to analyze the data to construct all possible quartets in a consistent manner. The problem of constructing the tree maximizing the number of satisfied quartets is NP-complete.
Because the various algorithms and heuristics give different trees on the same data, more evidence is needed than a single tree to define the history.
For this reason, there are suites of programs (e.g. Phylip) which contain implementations of many different tree construction algorithms and heuristics.
Thus there is a need for algorithms which find the consensus of a set of trees, i.e. the branch points that all (or most) of the trees share in common.
Also there are tree compatibility problems, where we seek the most refined tree which do not contradict evidence from any of a set of partially specified trees:
Point mutations are only the simplest type of genetic modification event.
Duplication events happen when a second copy of a given gene is inserted into the chromosome. This allows for one copy of the gene to evolve a new function without preventing the production of the protein.
Reversal events happen when a portion of a chromosome is deleted and replaced with the reversed sequence. Such events occur during crossover operations, and can create entirely new genes as well as copies.
Translocation events happen when genes moved to different locations or chromosomes. Translocated genes may be regulated in a different way in the new location. Genes can even jump across bacterial species (lateral transfer).
Reconstructing evolutionary history requires accounting for these large-scale events.
Each sperm/egg cell contains 23 chromosomes, representing the parent's genetic contribution to their offspring.
Other human cells contain 23 pairs of chromosomes, one of each pair being inherited from each parent.
Gametes form a single chromosome from each pair through recombination, where a crossover operation randomly alternates between the two chromosomes to select which gene copy to pass on.
Such sexual reproduction provides explanations for many things, including why introns may be good, and sex-linked and dominant/recessive diseases.
Errors in this process plausibly account for many genome rearrangement events.
Because there are second copies of most genes, cells can be surprisingly robust in the face of large scale changes.
Cancer cells often lose parts of chromosomes, and certain interesting but non-fatal diseases occur when extra copies of chromosomes occur.
As we have seen, large sets of homologous genes exist between pairs of organisms.
Crossover mutations cause reversals of a sequence of contiguous genes.
Biologists seek to reconstruct the evolutionary history between two species by finding the shortest sequence of reversals (crossover operations) necessary to bring all homologous pairs of genes into alignment.
Nadeau and Taylor estimate that
crossovers occurred between
mouse and man.
History reconstruction motivates the problem of sorting with reversals, since
Any reversal operation changes the orientation of all reversed genes. Thus a sorting problem can be signed or unsigned depending upon whether we know/care or do not know/care about the orientation of each gene.
There are two distinct theoretical problems in sorting with reversal problems:
The diameter question asks which two length
permutations are farthest
apart over all
pairs - how many reversals
always suffice to sort.
The distance question asks, for a given pair
,
of length
permutations what is the fewest reversals to bring
to
-
how many reversals are necessary.
Suppose we seek to sort a stack of pancakes by size using a spatula.
How many prefix reversals do we need to sort
pancakes.
By flipping the largest unsorted pancake to the top, and then into position
(i.e. selection sort) then any permutation can be sorted in at most
reversals.
Any strategy must use at least
reversals in the worst case,
since each reversal removes at most one breakpoint, i.e.
(1,3,5,7,2,4,6,8).
Thus the diameter
is between
, but
tighter bounds are well known...
Any permutation can be partitioned into strips of consecutive increasing or decreasing elements.
The gaps between strips are breakpoints, and the number of breakpoints is a lower bound on the number of reversals needed to sort.
Suppose we modify the selection sort strategy to move the strip containing the largest unpositioned element into position.
One reversal brings the strip to the top, another brings the biggest element in the strip to the top (if necessary), and one last flip puts it into position.
Thus this strategy uses at most three times the number required flips and approximates the actual pancake distance for every pair of permutations.
Suppose instead we just want to merge the strip on top with one of its two neighbors.
If the orientation is correct, one flip suffices; if not two flips suffice.
Since the number of breakpoints is equivalent to the number of strips, this gives a factor 2 approximation.
More careful analysis can lower the constant somewhat, but no polynomial algorithm is known to compute the minimum pancake distance.
Lower bounds on the length of the optimal solution are helpful in reasonably efficient branch and bound exhaustive search algorithms.
The motivating biological problem demands that we find the shortest sequence of general (not prefix) reversals.
Note that there are
general reversals but only
prefix
reversals, so we have much more freedom.
Recently it was shown that computing the reversal distance of unsigned permutations is NP-complete.
However, there is a polynomial algorithm for computing reversal distance for signed permutations.
The signed case is biologically important since the orientation of genes can be determined from the DNA sequence.
The exact algorithm for signed permutations requires careful, technical combinatorial arguments.
Since this problem is NP-complete, we seek an approximation algorithm.
A breakpoint occurs whenever neighboring elements are not consecutive, or with out of position endpoints.
has breakpoints 03, 24, 51, and 16.
Note that the number of breakpoints/2 is a lower bound on reversal distance, since any reversal can erase at most two breakpoints.
A strip of consecutive numbers between two breakpoints is increasing if the numbers strictly increase. Otherwise it is decreasing. A singleton strip will be defined as decreasing.
Claim 1: Any non-identity permutation without decreasing strips has a reversal which does not increase breakpoints but leaves a non-trivial decreasing strip.
Proof: Flip the end strip, e.g. 45123 goes to 45321.
Claim 2: If a permutation has a decreasing strip, then there exists a reversal which decreases the number of breakpoints.
Proof: Match up the smallest endpoint of a decreasing strip.
E.g:
goes to
and
goes to
.
The following examples don't work - but the decreasing strip isn't smallest:
and
If there is a decreasing strip, use one reversal and remove a breakpoint.
If not, use one reversal to create a decreasing strip.
Since one breakpoint is removed every other iteration (at least), twice the number of breakpoints is an upper bound on the reversal distance.
Since this is 4 times the lower bound we have a factor 4 approximation.
The approximation factor can be lowered to 1.5 with more careful structural analysis.