Phrap and Phred

A Chromatogram showing the sequencing output.
Why Phred ?
- The output of sequencing contains errors
- Because of anomalous migration of very short fragments and unreacted
dye-primer or dye-terminator molecules, the first 50 or so peaks of a trace
are noisy and unevenly spaced.
- Toward the end of the trace, the peaks become progressively less evenly
spaced as a result of less accurate trace processing.
- In better resolved regions of the trace, the most commonly seen
electrophoretic anomalies are compressions. These are visible as a peak
shifted left of its expected position.
- Weak or variable signal strength and noise peaks not corresponding to a
base.
What Phred does ?
- The phred base-caller uses a four-phase procedure to determine a sequence
of base-calls from the processed trace.
- In the first phase, idealized peak locations (predicted peaks) are
determined. This is done on the basis of the even placement of peaks in most
regions of the gel.
- The peak prediction attempts to find the idealized location of the base
peaks,using simple fourier methods.
- In the second phase the observed peaks are identified in the trace.
- In the third phase observed peaks are matched to the predicted peak
locations, omitting some peaks and and splitting others.
- In the final phase, the uncalled observed peaks are checked for any any
peak that appears to represent a base but could not be assigned to a predicted
peak in the third phase.
- The calling in this phase is done by means of a criterion that checks
- If the peak has the largest signal.
- meets a minimum size criterion
- is unsplit
- is flanked by resolved peaks and
- adding the peak improved peak spacing.
- Returns a quality value = -10 * log_10(P_e)
- Is designed to give output to Phrap.
- Check more on Phred at Base-Calling
of Automated Sequencer Traces Using Phred. I. Accuracy Assessment Genome Res.
1998 8: 175-185. and Base-Calling
of Automated Sequencer Traces Using Phred. II. Error Probabilities
Genome
Res. 1998 8: 186-194.
Phrap's Algorithm
- Find pairs of reads with matching words. Eliminate exact duplicate reads.
Do swat comparisons of pairs of reads which have matching words, compute
(complexity-adjusted) swat score.
- Find probable vector matches and mark so they aren't used in assembly.
- Find near duplicate reads.
- Find reads with self-matches.
- Find matching read pairs that are "node-rejected" i.e. do not have "solid"
matching segments.
- Use pairwise matches to identify confirmed parts of reads; use these to
compute revised quality values.
- Compute LLR scores for each match (based on qualities of discrepant and
matching bases).).(Iterate above two steps).
- Find best alignment for each matching pair of reads that have more than
one significant alignment in a given region (highest LLR-scores among several
overlapping).
- Identify probable chimeric and deletion reads (the latter are withheld
from assembly).
- Construct contig layouts, using consistent pairwise matches in decreasing
score order (greedy algorithm). Consistency of layout is checked at pairwise
comparison level.
- Construct contig sequence as a mosaic of the highest quality parts of the
reads.Align reads to contig; tabulate inconsistencies (read / contig
discrepancies) & possible sites of misassembly. Adjust LLR-scores of
contig sequence.
Check more on Phrap at Phrap.
Information compiled from linked sources by: Rohan Jude Fernandes