Sequencing machines today use the same basic principles of the original Gilbert-Sanger method. There has been tremendous progress in automating the procedure, however.
Read lengths have gotten only slightly longer with time, perhaps from 500 bp to 600 bp.
The sample to be sequenced is replicated. One end of each molecule is radio-/florescently labeled. The sample is divided into four parts, each of which is exposed to an enzyme which cuts at a particular type of base.
Using gel electrophoresis, the fragments are separated by length. The presence or absence of a labeled band in each lane denotes whether the sequence has the given base in each position.
Modern capillary machines use smaller amounts of reagents and avoid problems with wandering lanes.
In the good regions of a read, the base error rate should be below 2%.
In traditional shotgun sequencing, whole genomes are sequenced by making clones, breaking them into small pieces, and trying to put the pieces together again based on overlaps.
Note that the fragments are randomly sampled, and thus no positional information is available.
Since we rely on fragment overlaps to identify their position, we must sample sufficient fragments to ensure enough overlaps.
Let T be the length of the target molecule being sequenced using n random fragments of length l, where we recognize all overlaps of length t or greater.
Then the expected number of gaps g is
The coverage of a sequencing project is the ratio of the total sequenced fragment length to the genome length, i.e. n l / T.
The effectiveness of a genome sequencing strategy depends upon the degree of coverage, the length of the inserts, and the auxiliary mapping information available to help assembly.
The DNA fragments or clones are replicated by inserting them into a living organism, the cloning vector.
Small fragments (40,000 bp) can be cut and pasted into a bacterial cosmid. Bigger fragments (up to 2,000,000 bp) can be replicated as a bacterial or yeast artificial chromosome, a BAC or YAC.
After sequencing both ends of a given insert, we know roughly how far apart they should be in the final assembly.
Selecting the right mix of insert sizes can simplify assembly. Small inserts give tight assembly constraints, but big inserts help us build a scaffolding across the entire genome.
The internals of clones can be sequenced, but it is much more expensive than end sequencing. Thus it is done only in the closing gaps.
The high coverage necessary to sequence large genomes without gaps frightened most laboratories away from pure shotgun sequencing strategies.
A different approach is to construct a map showing where each clone lies on the human genome, and use this map to guide end sequencing and assembly.
Mapping data can be based on (1) using hybridization to detect the presence or absence of a given short sequence (STS) in a given clone, or (2) using restriction enzymes to cut each clone at a given pattern, and looking for similar fragment lengths.
With a good enough map, the required coverage might go down to 2 or 3.
Reconstructing clone order from mapping data tends to be an NP-complete problem.
The public consortium used a sequencing strategy based on mapping the clones first.
Celera used hundreds of high-throughput sequencing machines to obtain enough coverage to shotgun sequence the human genome.
The problem of finding the shortest common superstring of a set of strings is NP-complete.
Even worse, we have to deal with significant errors in the sequence fragments.
Even worse, genomes tend to have many repeats (approximate copies of the same sequence), which are very hard to identify and reconstruct.
Due to repeats, the shortest common superstring is typically shorter than the real sequence.
Even worse, the size of the problem is very large. Celera's Human Genome sequencing project contained roughly 26.4 million fragments, each about 550 bases long.
Celera's assembly involved 500 million trillion base to base comparisons, requiring over 20,000 CPU (central processor unit) hours on their supercomputer.
Thus efficient overlap detection is critical, more critical than the NP-complete part of the problem!
Overlap detection must be tolerant of sequencing error, but even an error
rate of 2% means one should be able to find fairly long (
25 bp)
exact matches in a long overlap.
The suffix array is an amazing data structure for efficiently searching whether S is a substring of string T.
For a given string T, we construct the lexicographically sorted array of all its suffixes.
For T = mississippi, the suffix array is:
11 : i
8 : ippi
5 : issippi
2 : ississippi
1 : mississippi
10 : pi
9 : ppi
7 : sippi
4 : sissippi
6 : ssippi
3 : ssissippi
Since every substring is the prefix of some suffix, Substring search now reduces to binary search in this array. Example: is ``sip'' a substring of T?
Once you have the suffix array, the search time is
,
where n is the length of T and m the length of the matched substring.
Note that we can just as easily find all the occurrences of a given string S in T by binary searching just before/after S.
The really amazing thing is that one need only store the original string and the sorted start positions to do the search! The jth character of the ith prefix is at T[ start[i]+j-1 ].
But how fast can be built the suffix array of an n character string?
Radix sorting n strings of n characters can be done in O(n2), linear in the size of the input.
But what is really amazing is that suffix arrays can be built in both linear time and space!
This requires the use of an even more theoretically interesting data structure, the suffix tree, which can be built in O(n) time and supports substring matching in O(m) time.
Doing a lexicographic depth first search of a suffix tree yields a suffix array.
Suffix arrays use many times less space than suffix trees (say 3n vs. 17n bytes), which is often the dominating factor in large text search problems.
Through clever use of suffix arrays, the entire overlap graph can be built in near-linear time.
After building the array of all suffixes of all fragments and their reverses, potential overlaps will share a prefix of a suffix, and hence be near each other in sorted order.
Accepting a fragment pair as overlapping may require several significant long matches.
Since there are 4k possible DNA sequences of length k, and
n places for such k-mers to start if |T|=n, matches start to
get significant if
.
For human, longer than 16-mers start to get interesting, so we can expect to find significant exact matches.
Several engineering issues arise in building any assembler: