The sequencing the human genome was a tremendous scientific accomplishment, requiring large-scale collaboration between computational and life sciences.
However, the intellectual breakthrough that lead to successful sequencing came from computer scientists, not life scientists.
We will study the basic technology underlying all sequencing projects, and compare the somewhat different experimental strategies employed by the two groups.
Amazing progress in the scale of sequencing projects has been achieved, largely through automation:
Phage Phi-x174: 5kb, 1977.
Bacteriophage lambda: 50kb, 1982.
Haemophilus influenzae: 1.8Mb, July 1995.
Drosophila: 180Mb, March 2000.
Human: 3 billion bases, February 2001.
Sequencing machines today use the same basic principles as the original Gilbert-Sanger method. There has been tremendous progress in automating the procedure, however.
Read lengths have gotten only slightly longer with time, perhaps from 500 bp to 700 bp.
The sample to be sequenced is replicated in four distinct bins, using a
distinct fluorescently labeled PCR primer.
The bin associated with a given base
is given a mixture of functional and
non-functional versions of
, where non-functional
bases terminate transcription.
This creates labeled fragments of all sizes ending in
.
Using gel electrophoresis, the fragments are separated by length. The presence or absence of a labeled band in each lane denotes whether the sequence has the given base in each position.
Modern capillary machines use smaller amounts of reagents and avoid problems with wandering lanes.
In the good regions of a read, the base error rate should be below 2%.
In traditional shotgun sequencing, whole genomes are sequenced by making clones, breaking them into small pieces, and trying to put the pieces together again based on overlaps.
Note that the fragments are randomly sampled, and thus no positional information is available.
Since we rely on fragment overlaps to identify their position, we must sample sufficient fragments to ensure enough overlaps.
Let
be the length of the target molecule being sequenced using
random fragments of length
, where we recognize all overlaps
of length
or greater.
The Lander-Waterman equation gives the expected number of gaps
as
Where does the
come from?
Suppose we have as many fragments as bases, i.e.
and each fragment is length 1.
The probability
that base
is not sampled is
The coverage of a sequencing project is the
ratio of the total sequenced fragment length to the genome length, i.e.
.
Gaps are very difficult and expensive to close in any sequencing strategy, meaning that very high coverage is necessary to use shotgun sequencing on a large genome.
The effectiveness of a genome sequencing strategy depends upon the degree of coverage, the length of the inserts, and the auxiliary mapping information available to help assembly.
The DNA fragments or clones are replicated by inserting them into a living organism, the cloning vector.
Small fragments (40,000 bp) can be cut and pasted into a bacterial cosmid. Bigger fragments (up to 2,000,000 bp) can be replicated as a bacterial or yeast artificial chromosome, a BAC or YAC.
After sequencing both ends of a given insert, we know roughly how far apart they should be in the final assembly.
Selecting the right mix of insert sizes can simplify assembly. Small inserts give tight assembly constraints, but big inserts help us build a scaffolding across the entire genome.
The internals of clones can be sequenced, but it is much more expensive than end sequencing. Thus it is done only in the closing gaps.
The high coverage necessary to sequence large genomes without gaps frightened most laboratories away from pure shotgun sequencing strategies.
A different approach is to construct a map showing where each clone lies on the human genome, and use this map to guide end sequencing and assembly.
Mapping data can be based on (1) using hybridization to detect the presence or absence of a given short sequence ( STS) in a given clone, or (2) using restriction enzymes to cut each clone at a given pattern, and looking for similar fragment lengths.
With a good enough map, the required coverage might go down to 2 or 3.
Note that the correct ordering is a Hamiltonian path on the clones. Reconstructing clone order from mapping data tends to be an NP-complete.
However, the difficulty is due to errors and ambiguity in the mapping data since the problem of recognizing interval graphs can be done in linear time!
The public consortium used a sequencing strategy based on mapping the clones first.
Celera used hundreds of high-throughput sequencing machines to obtain enough coverage to shotgun sequence the human genome.
The most natural notion of assembly is to order the fragments so as to form the shortest string containing all of them.
However, the problem of finding the shortest common superstring of a set of strings is NP-complete.
Even worse, we have to deal with significant errors in the sequence fragments.
Even worse, genomes tend to have many repeats (approximate copies of the same sequence), which are very hard to identify and reconstruct.
Due to repeats, the shortest common superstring is typically shorter than the real sequence.
Even worse, the size of the problem is very large. Celera's Human Genome sequencing project contained roughly 26.4 million fragments, each about 550 bases long.
To decide what overlaps what, we could
compare each fragment against each
other fragment via
dynamic programming, but faster methods are needed.
Celera's assembly involved 500 million trillion base to base comparisons, requiring over 20,000 CPU (central processor unit) hours on their supercomputer.
Thus efficient overlap detection is critical, more critical than the NP-complete part of the problem!
Overlap detection must be tolerant of sequencing error, but even an error
rate of 2% means one should be able to find fairly long (
25 bp)
exact matches in a long overlap.
There are several interesting data structures for speeding up exact pattern matches in strings..
A trie is a data structure which permits access to any string
in
an
word dictionary in
time for constant-sized alphabets.
This is optimal and independent of the dictionary size!
Note that binary search of an
word dictionary would take
time.
A trie has a node for each character position, with prefixes shared:
Searching in a trie is easy: just match the character and traverse down the correct path.
Building the trie is also easy: insert a new string by matching until you fail, then split the last node.
A very special set of patterns to put in a trie are the complete set of all suffixes of a string.
With such a tree, we can perform substring searches efficiently, since every substring is the prefix of some suffix.
Further, the set of all instances of a given substring
are the leaves of the
subtree rooted at
, and can be found by DFS.
A suffix tree can be stored in linear space, by collapsing degree-1 nodes into paths, and paths into references to the original string.
The incremental insertion algorithm to build a suffix tree might take
time to build the tree, because finding the split-point for each insertion
might require
time in matching.
However, there are more sophisticated algorithms (Weiner's, Ukkonen's, McCreight's) which can build the entire tree in linear time.
Given two strings
and
, what is the longest contiguous
substrings they have in common.
Example: livestock and sea liver
The naive
algorithm fully tests each alignment of
against
.
In 1970, Knuth conjectured that a linear-time algorithm was impossible. Can you prove him wrong?
Build a suffix tree of the length
concatenated string
, where
does not occur in either string.
Label each leaf node of the suffix tree with the name of the string it is contained in. Label each internal node with the union of the labels of its descendents.
By doing a DFS on the
node tree, we can find the deepest node which has
both an
and
descendant.
This defines the longest common substring!
A palindrome is a string which reads the same forwards and backwards, e.g. A man, a plan, a canal - Panama or Madam I'm Adam.
In DNA sequence analysis, a palindrome is a sequence equal to its reverse complement, e.g. GAATTC.
Such palindromes bind/fold to create secondary structures in sequences, which often have biological significance.
How can we find the longest self-binding substring in a DNA sequence?
Use the longest common substring algorithm with
as the input sequence
and
as its reverse complement sequence!
This does not guarantee that the lcs of these strings starts/ends in the same place, so does not necessarily find a palindrome.
However, after we augment the suffix tree so as to answer lowest common ancestor queries in constant time...
...we can find the longest sub-palindrome in linear time by asking
the `length' of the LCA of
and
for each
.
Certain genomic structures ( plasmids) have closed circular DNA molecules rather than linear molecules.
To look such a molecule up in a database, we much find a canonical place to break it to leave a linear string.
The most obvious place to break it uniquely is so as to always leave the lexicographically smallest string, i.e. the string which appears first in sorted order.
Building and sorting all
such strings takes
time.
Can you do better?
Break the string arbitrarily to create a linear string
.
Now build the suffix tree for string
.
This is linear in the size of the input.
Example:
so
.
Do a traversal down from the root, always picking the lexicographically
smallest character.
Assume that
is at the end of the alphabet.
The suffix array is an amazing data structure for efficiently searching
whether
is a substring of string
.
For a given string
, we construct the lexicographically
sorted array of all its suffixes.
For
, the suffix array is:
11 : i
8 : ippi
5 : issippi
2 : ississippi
1 : mississippi
10 : pi
9 : ppi
7 : sippi
4 : sissippi
6 : ssippi
3 : ssissippi
Since every substring is the prefix of some suffix,
Substring search now reduces to binary search in this array.
Example: is ``sip'' a substring of
?
Once you have the suffix array, the search time is
,
where
is the length of
and
the length of the matched substring.
Note that we can just as easily find all the occurrences of a given
string
in
by binary searching just before/after
.
The really amazing thing is that one need only store the original
string and the sorted start positions to do the search!
The
th character of the
th prefix is at
.
But how fast can be built the suffix array of an
character string?
Radix sorting
strings of
characters can be done in
,
linear in the size of the input.
But what is really amazing is that suffix arrays can be built in both linear time and space!
First, build a suffix tree
in
time.
Performing a lexicographic depth first search of a suffix tree yields a suffix array.
Suffix arrays use many times less space than suffix trees
(say
vs.
bytes), which is often
the dominating factor in large text search problems.
Through clever use of suffix arrays, the entire overlap graph can be built in near-linear time.
After building the array of all suffixes of all fragments, potential overlaps will share a prefix of a suffix, and hence be near each other in sorted order.
Accepting a fragment pair as overlapping may require several significant long matches.
Since there are
possible DNA sequences of length
, and
places for such
-mers to start if
, matches start to
get significant if
.
For human, longer than 16-mers start to get interesting, so we can expect to find significant exact matches.
Several engineering issues arise in building any assembler:
Calling sequence bases from sequence trace data is not a trivial problem for several reasons, including (1) varying density of the gel along a lane, (2) varying density of the gel across lanes, (3) separation between bands shrinks as we move along the lane.
Phred, by Phil Green, is the most famous base-calling program for automated sequencer traces. It assigns an error probability to each base called and yielded 40-50% lower errors than the software included with Applied Biosystems (ABI) sequencing machines.
Phred's base-calling pipeline consists of several phases:
The quality score
is based on an estimate of the probability of error
, where
.
These were designed to assembly small (contig-size) sequencing projects, say 40 kb to 100 kb or so. Thus quadratic time/space algorithms potentially suffice.
CAP, by Xiaoqiu Huang, is a representative contig assembly program. It does a dynamic programming alignment between pairs of fragments, taking into account that errors increase at the beginning and the end of the read.
Phrap was designed to work with Phred, and exploits base quality scores if they exist. Important processing steps include:
Assemblers for bacterial-sized genomes and beyond (such as those used by TIGR and Celera) must use more sophisticated data structures to avoid comparing every pair of fragments.
Typically, an initial assembly of `unitigs' of high-quality, long overlaps is constructed.
The expected number of fragments at every position is a function of coverage. Stacks of fragments which are too high likely indicate improperly combined repeat regions.
These unitigs are assembled into a `scaffold' of pieces. Most reads are in `mate pairs', since the left and right ends of clones of sizes up to 10 kb or so have both been sequenced. The scaffold is assembled using this pairing and distance information.
Other less high-quality reads and unitigs can now be positioned on this scaffold. Remaining gaps can be closed by sequencing the clones spanning different contigs of the scaffold.
Celera used the public consortium's Human sequence data from Genbank. To avoid being fooled by its assembly, they computationally shredded this sequence into artificial reads and incorporated this as raw data into its own project.
Certain sequencing projects do not attempt to sequence full genomic DNA. To identify the genes, they translate back the RNA transcripts of genes into cDNA and sequence this.
Although the genes are a very important part of what we are looking for, a drawback such expressed sequence tags (ESTs) is that more highly expressed genes are significantly over-represented, making it hard to find rare genes.
Sequencing a mixed population (i.e. fragments from multiple individuals of a given species) should not compromise assembly because of the overwhelming in-species similarity, while differences in bases helps study variation (single nucleotide polymorphisms or SNPs)