In any evolutionary process, speciation events cause a new species to split off from an existing one, thus creating the diversity of life forms we know today.
A key issue in evolutionary biology is to reconstruct the history of these speciation events. Given the properties of the leaf nodes, reconstruct what the tree is.
Much of the current interest in phylogenic trees follows from the increasing availability of DNA sequence data.
Biological applications include evolution studies (e.g. the out of Africa debate) and medical research (tracing HIV infection).
However, phylogenic trees play an important role in analyzing the history of languages, religions, chain letters, and medieval manuscripts, as well as biology.
Although the available for analysis varies by application, it can usual be partitioned into distance and feature/character data.
Distance data measures (directly or indirectly) dij, the length of time since species i and j diverged. Such time can be estimated from the distance between DNA sequences, assuming a `molecular clock' governing the frequency of mutations.
Feature/character data measures taxonomical properties such as `warm blooded', `has wings', or `walks upright'. If such features are hard to develop twice, they describe branch points in the phylogenic tree.
Different types of reconstruction algorithms are necessary for such distinct types of data.
Observe that there are many tree topologies possible for any set of n leaves.
Every binary tree on n leaves has n-1 internal nodes, and thus 2n-1 vertices and 2n-2 edges.
Every unrooted binary tree n leaves has 2n-2 vertices and 2n-3 edges, since the in-degree 0 root can be contacted to a single edge.
Since the root can be positioned at any edge, there are 2n-3 more rooted trees than unrooted trees on n leaves.
Further, any rooted tree on n leaves corresponds to an unrooted tree on n+1 leaves, since we can take the highest numbered leaf to be the root.
Thus
For n=10, there are about 2,000,000 unrooted trees,
and for n=20 about
,
so the number of
possible topologies grows very fast.
The business of reconstructing trees is very messy for several reasons:
Phylogenic tree problems have the same flavor as Steiner tree problems in graphs, where we must deduce the positions of intermediate nodes to find the best possible fit.
Distance data tree construction algorithms bare a strong relationship to clustering algorithms, particularly agglomerative algorithms which explicitly construct a tree as they merge clusters together.
Representative algorithms include nearest neighbor joining, and repeatedly merging the nearest cluster centroids (the unweighted pair group method using arithmetic averages (UPGMA)).
Such algorithms can yield reasonable and informative trees, but there seems no good reason to believe that they yield the correct tree.
Defining mathematical properties which real trees obey enable us to define optimization criteria which make it plausible to define the best tree
An ultrametric tree is a rooted tree where
Note that if the labels mean ``time units ago'', this is true of any evolutionary tree.
Each pairwise distance d(i,j) represents the label of the least common ancestor of i and i.
There is an efficient O(n2) algorithm to reconstruct an ultrametric tree, if one exists.
Observe that the labels on the path from a leaf a to the root follow from sorting the distances in row a, since a branching point corresponds to each unique distance.
Shared distances on the path partition the other leaves into groups in the other subtrees.
Thus each of the resulting partitions is fixed, and can be refined by considering other rows.
Unless we get a contradiction, we get an ultrametric tree. Further, this tree must be unique.
The problem becomes hard when you seek the ``most'' ultrametric tree in noisy data, and note that a little noise can throw the topology off considerably.
A weaker but well defined condition assumes that we have the distance between all pairs of leaves, and seek an unrooted, edge-weighted tree such that the sum of the distances along each path adds up to the defined distance.
An algorithm for additive tree reconstruction follows from the following reduction to ultrametric trees:
In an ultrametric tree, all nodes are equal distance from the root, the maximum distance in the matrix.
If D is an additive matrix, then D' is an ultrametric matrix,
where
This construction follows from:
Suppose we have n species, and m features each of which only evolved once (parsimony).
Each feature can be represented by an n-element
-vector where
fi(j) = 1 iff species j has feature i.
The perfect phylogeny problem asks for a phylogenic tree
given an
binary feature matrix, if one exists.
Note that this is an edge-labeled tree, and certain features do not necessarily contribute to a split (e.g. 3), particularly when m>n.
Suppose we know that all features evolve from 0 to 1, i.e. the characters are ordered.
Claim: Matrix M has a perfect phylogenic tree iff for every pair of columns i, j the set of 1s are either (1) disjoint, or (2) i contains j.
If column i contains all the 1s of column j, then feature j evolved before feature i.
If two columns are disjoint, then the features evolved independently.
If species x has feature i but not j, and y has a feature j but not i, then no perfect phylogeny can exist.
This immediately gives an O(m2 n) algorithm to test the matrix for this property, which can be improved to O(nm) using radix sorting.
The perfect phylogeny problem can be made more general by allowing non-binary features, e.g. the locomotion feature might be `fixed', `crawling', or `walking'.
In general, each feature is one of r states. In ordered phylogeny problems, we know the directed sequence of transitions for each character as we move down the tree.
There are polynomial ordered perfect phylogeny algorithms for any constant r, i.e. r is in the exponent of the running time.
For unordered problems it is NP-complete for general r, but polynomial for r=2.
Another approach to reconstructing phylogenic trees is to carefully analyze small subsets of species to reconstruct their relative history, and then integrate a set of resulting trees into a consistent whole.
The smallest interesting unrooted trees contain four species, and are called quartets.
There are three possible quartets on any set of species.
The set of
quartets induced from any tree
uniquely defines the topology of the tree.
Note that rooted triplets can be modeled as quartets if one species is ancestral.
In general, it is difficult to analyze the data to construct all possible quartets in a consistent manner. The problem of constructing the tree maximizing the number of satisfied quartets is NP-complete.
Because the various algorithms and heuristics give different trees on the same data, more evidence is needed than a single tree to define the history.
For this reason, there are suites of programs (e.g. Phylip) which contain implementations of many different tree construction algorithms and heuristics.
Thus there is a need for algorithms which find the consensus of a set of trees, i.e. the branch points that all (or most) of the trees share in common.
Point mutations are only the simplest type of genetic modification event.
Duplication events happen when a second copy of a given gene is inserted into the chromosome. This allows for one copy of the gene to evolve a new function without preventing the production of the protein.
Reversal events happen when a portion of a chromosome is deleted and replaced with the reversed sequence. Such events occur during crossover operations, and can create entirely new genes as well as copies.
Translocation events happen when genes moved to different locations or chromosomes. Translocated genes may be regulated in a different way in the new location. Genes can even jump across bacterial species (lateral transfer).
Reconstructing evolutionary history requires accounting for these large-scale events.
Each sperm/egg cell contains 23 chromosomes, representing the parent's genetic contribution to their offspring.
Other human cells contain 23 pairs of chromosomes, one of each pair being inherited from each parent.
Gametes form a single chromosome from each pair through recombination, where a crossover operation randomly alternates between the two chromosomes to select which gene copy to pass on.
Such sexual reproduction provides explanations for many things, including why introns may be good, and sex-linked and dominant/recessive diseases.
Errors in this process plausibly account for many genome rearrangement events.
Because there are second copies of most genes, cells can be surprisingly robust in the face of large scale changes.
Cancer cells often lose parts of chromosomes, and certain interesting but non-fatal diseases occur when extra copies of chromosomes occur.
As we have seen, large sets of homologous genes exist between pairs of organisms.
Crossover mutations cause reversals of a sequence of contiguous genes.
Biologists seek to reconstruct the evolutionary history between two species by finding the shortest sequence of reversals (crossover operations) necessary to bring all homologous pairs of genes into alignment.
Nadeau and Taylor estimate that
crossovers occurred between
mouse and man.
History reconstruction motivates the problem of sorting with reversals, since
Any reversal operation changes the orientation of all reversed genes. Thus a sorting problem can be signed or unsigned depending upon whether we know/care or do not know/care about the orientation of each gene.
There are two distinct theoretical problems in sorting with reversal problems:
The diameter question asks which two length n
permutations are farthest
apart over all
pairs - how many reversals
always suffice to sort.
The distance question asks, for a given pair p1, p2 of length n permutations what is the fewest reversals to bring p1 to p2 - how many reversals are necessary.
Suppose we seek to sort a stack of pancakes by size using a spatula. How many prefix reversals do we need to sort n pancakes.
By flipping the largest unsorted pancake to the top, and then into position (i.e. selection sort) then any permutation can be sorted in at most 2n reversals.
Any strategy must use at least n-1 reversals in the worst case, since each reversal removes at most one breakpoint, i.e. (1,3,5,7,2,4,6,8).
Thus the diameter d is between
,
but
tighter bounds are well known...
Any permutation can be partitioned into strips of consecutive increasing or decreasing elements.
The gaps between strips are breakpoints, and the number of breakpoints is a lower bound on the number of reversals needed to sort.
Suppose we modify the selection sort strategy to move the strip containing the largest unpositioned element into position.
One reversal brings the strip to the top, another brings the biggest element in the strip to the top (if necessary), and one last flip puts it into position.
Thus this strategy uses at most three times the number required flips and approximates the actual pancake distance for every pair of permutations.
Suppose instead we just want to merge the strip on top with one of its two neighbors.
If the orientation is correct, one flip suffices; if not two flips suffice.
Since the number of breakpoints is equivalent to the number of strips, this gives a factor 2 approximation.
More careful analysis can lower the constant somewhat, but no polynomial algorithm is known to compute the minimum pancake distance.
Lower bounds on the length of the optimal solution are helpful in reasonably efficient branch and bound exhaustive search algorithms.
The motivating biological problem demands that we find the shortest sequence of general (not prefix) reversals.
Note that there are
general reversals but only n prefix
reversals, so we have much more freedom.
Recently it was shown that computing the reversal distance of unsigned permutations is NP-complete.
However, there is a polynomial algorithm for computing reversal distance for signed permutations.
The signed case is biologically important since the orientation of genes can be determined from the DNA sequence.
The exact algorithm for signed permutations requires careful, technical combinatorial arguments.
Since this problem is NP-complete, we seek an approximation algorithm.
A breakpoint occurs whenever neighboring elements are not consecutive, or with out of position endpoints.
(3,2,4,5,1) has breakpoints 03, 24, 51, and 16.
Note that the number of breakpoints/2 is a lower bound on reversal distance, since any reversal can erase at most two breakpoints.
A strip of consecutive numbers between two breakpoints is increasing if the numbers strictly increase. Otherwise it is decreasing. A singleton strip will be defined as decreasing.
Claim 1: Any non-identity permutation without decreasing strips has a reversal which does not increase breakpoints but leaves a non-trivial decreasing strip.
Proof: Flip the end strip, e.g. 45123 goes to 45321.
Claim 2: If a permutation has a decreasing strip, then there exists a reversal which decreases the number of breakpoints.
Proof: Match up the smallest endpoint of a decreasing strip.
E.g: 543...12.. goes to 54321..... and 12...543.. goes to 12345......
The following examples don't work - but the decreasing strip isn't smallest: 543...21.. and 21...543..
If there is a decreasing strip, use one reversal and remove a breakpoint.
If not, use one reversal to create a decreasing strip.
Since one breakpoint is removed every other iteration (at least), twice the number of breakpoints is an upper bound on the reversal distance.
Since this is 4 times the lower bound we have a factor 4 approximation.
The approximation factor can be lowered to 1.5 with more careful structural analysis.