Drosophila Genome Assembler Description
General Points
- When assembled, the genome of drosophila will read like a long sentence
written with the same four letters repeated over and over in varying order.
- The Challenge ? To recreate the whole sentence using overlapping sentence
fragments of 500 letters each.
- The three million fly fragments are sampled from the gene-rich regions of
the genome (about 120 million letters).
- These fragments are enough DNA to cover the genome 14 times over.
- The genome is sequenced at this scale to reduce the chances that the
random approach missed any of the targeted regions.
Steps in assembly
- The first stage in assembly is the heavy-lifting: The assembler compares
the millions of fragments against each other, finding all common segments
between two fragments that are at least 40 letters long.
- We now have to identify what are the "true" overlaps and what are the
"repeat-induced" overlaps.
- The assembler now searches for groups of overlapping fragments that 1)
together spell a common sequence, and 2) do not overlap fragments with
sequences that dispute, or contest, the common sequence.
- Uncontested groups of fragments are assembled into what are called
"unitigs." Each unitig contains on average about 30 fragments.There are 100
times fewer overlaps between unitigs than overlaps between fragments.

- A statistic called the Discriminator is used to find stacks of fragments
that are suspiciously high.Correctly assembled unitigs that do not spell
repetitive DNA are the equivalent of no more than one deck of cards deep.
These are called U-unitigs. The assembler identifies unitigs consisting of
repeats by looking at the"depth" of the total number of fragments.
- The next phase is called scaffolding. This is analogous to setting up the
frame of a building.
- A contiguous sequence of ordered unitigs is a contig. During scaffolding,
the assembler orients contigs using mates.
- Most mate pairs are reliable landmarks-they stick together and remain the
same distance apart.
- If mates from the same pair lie on different contigs, for instance, the
contigs are likely to be neighbors about 1% of the time.
- Sets of contigs that are ordered and oriented using enforcing pairs are
called scaffolds.

- At this point, the scaffolding is continuous except for gaps. Some of
these gaps are due to missing sequence; this is unavoidable in shotgun
sequencing.
- Other gaps contain repetitive sequence that can now be closed using the
unitigs that were set aside earlier by the discriminating statistic.
- The assembler classifies repeat sequences by size and reliability, calling
the largest and most reliable repeats "rocks."
- Rocks must be linked to the contigs on either side of a gap by two or more
mates.

- Stones are linked to the contigs by only one mate. Their position in a gap
is confirmed by overlaps.

- Pebbles are placed in a gap based on the quality of the overlaps between
each other and the adjoining contigs.

- Assembly has created a path across the unique, gene-bearing regions of the
genome and characterized the intervening repeats.
- Check Celera for more
information.
Information compiled from linked sources by: Rohan Jude Fernandes