Sequence
Assembly
Part I: The Drosophila Genome
Project
Johannes
Jaeger
Outline
1. Drosophila and its
Genome
2. Sequencing Strategy
3. References and Links
1. Drosophila and its
Genome
Fig. 1: The fruit fly, Drosophila
melanogaster (from FlyBase)
Facts about Genomes
-
Chromosomes: Genomes of higher organisms
are arranged on chromosomes.
-
Diploidy: In
animals and plants, each organism contains two copies of each chromosome,
one from the mother, one from the father.
-
Linkage Groups: Genes that lie on the same
chromosome are usually inherited together and therefore form a linkage
group.
-
Crossover: Often, maternal and paternal
chromosomes
exchange pieces of their DNA. This is called genetic crossover.
-
Genetic Mapping: The relative position (or
locus) of a gene on a chromosome can be evaluated using by calculating
the probability of crossover events.
-
Physical Mapping: The absolute locus of a
gene on a chromosome is determined by for example:
-
Cytological mapping: DNA probes are attached
to the chromosome and localized under the microscope (you can actually
see where the gene is on the chromosome!).
-
Deletion mapping: Visible chromosome
rearrangements
are aligned along the chromosomes and compared to gene loci.
-
Sequence Tagged Site (STS) Mapping:
The
positions of certain known short sequences in the genome are determined
to arrange the STS along the chromosomes.
-
Sequencing: The actual sequence of the whole
genome is determined.
-
Euchromatin: Is
the part of the chromosome that contains most of the genes. This is the
part of the genome that is actually sequenced.
-
Heterochromatin: Contains
mostly short repetitive sequences. Its function is unknown. Heterochromatin
is difficult to clone and probably impossible to sequence.
-
Why Drosophila?
The Drosophila Genome
-
Length: about 180 million base pairs (humans:
more than 3000 mio bp), 120 Mbp euchromatin, 60 Mpb heterochromatin
-
Contains an estimated 13600 genes (this is
a very small number for an organism as complex as a fly)
-
4 chromosomes: X/Y, 2, 3 and a very small
4th chromosome (Fig. 3)
Fig. 3: The Drosophila genome (taken from 2)
2. Sequencing
Strategy
Whole-Genome Shotgun (WGS)
Sequencing
-
Traditional sequencing approaches like the
one used by the public Berkeley
and
European Drosophila Genome Projects
have used physical maps to determine the position of the clones to be sequenced
on the chromosome. In addition to this, the public projects have created
an STS based physical map of chromosomes 2 and 3.
-
WGS sequencing
as used by Celera shears the whole
genome into random fragments of equal size, before sequencing only
their ends and aligning the sequences obtained using sequence assembly
software (3). The main advantage of this approach
is that the whole genome can be tackled at once using large arrays of
sequencers,
which makes the whole process much more efficient than the traditional
approaches.
-
Fragments used in the Celera Drosophila Genome Project
were of 2000, 10'000 and 130'000 base pairs length. At both ends
of these fragments, stretches of about 500 bp were sequenced using the
Polymerase Chain Reaction (PCR) with the flanking vectors sequences as
primers (2,3).
-
The coverage for the 2 kbp and the 10 kbp
fragments was 10X, coverage for the 130 kbp fragments was 15X (2).
-
The Data Sets (3)
-
Two different data sets were used for the sequence
assembly process:
-
The WGS data set includes all the sequences
obtained from sequencing the ends of the 2, 10 and 130 kbp fragments as
well as known distances (the clone lengths) between these
sequences.
-
The joint data set includes the WGS data
set and the additional sequence data that was obtained from the public
genome projects in their traditional approach.
-
The STS data from the public genome project was
not used in the assembly process, but later on to crosscheck the sequence
draft and improve the quality of the data.
-
Scaffolding (3)
-
The sequences in the data sets were checked for
unique sequence overlaps. Contiguous sequences were assembled into
contigs.
-
Contigs can be arranged into scaffolds when
distances between contigs can be calculated from the data set (based on
the length of the clones).
-
Within the scaffolds, sequence gaps remain
(i.e. the unsequenced interior parts of the inserts). These gaps can often
be closed by sequencing whole inserts.
-
Between scaffolds, physical gaps remain,
mostly stretching highly repetitive sequences. The lengths of some of these
gaps can be evaluated using the STS data.
-
The biggest difficulties for sequencing are
encountered, when DNA sequences are repeated in the genome. Short stretches
of repetitive DNA, such as retrotransposons are easily bridged by
the longer (10 and 130 kpb) fragments. However, long stretches of repetitive
DNA sequences such as repeated rRNA genes or heterochromatin
repeats leave persistent gaps in the sequence and might never be sequenced
to completion at all.
3.
References
-
Rubin, G. M and Lewis E. B. (2000):
A
Brief History of Drosophila's Contributions to Genome Research.
Science
287, pp. 2216-2218.
-
Adams et al. (2000): The
Genome Sequence of Drosophila melanogaster. Science 287,
pp. 2185-2195
-
Myers et al. (2000): A
Whole-Genome Assembly of Drosophila. Science 287, pp.
2196-2198.
Some additional Drosophila links:
The Berkeley
Drosophila
Genome Project
The European
Drosophila
Genome Project
FlyBase
The
Interactive Fly
The Drosophila genome sequences were deposited
in GenBank, accession numbers
AE002566-AE003403.