Presentation by Gregorio Valdez
Background:
A) DNA sequencing
1) Manually
2) Automated
Dye primer chemistry
Dye terminator chemistry
3) Problems in the sequencing reaction(s)
B) Computer analysis of sequenced data
Phred
Phrap

A) DNA Sequencing: the dideoxy chain-termination method of Sanger is still the preferred method.
1) Manual sequencing- laborious, two-day minimum to complete, max of 4-6 different samples sequenced per two days, the terminating nucleotide is labeled with a radioactive tracer.
2) Automated sequencing- easy, reactions are completed from 12-24 hours, can run hundreds of reactions per day, the primer (dye primer chemistry) or terminating nucleotides (dye terminator chemistry) are labeled with fluorescent dyes.
-In the dye terminator chemistry each of the four terminating nucleotide is labeled with a different dye. Therefore, the entire sequencing reaction can be run as one. A laser excites the fluorescent dyes as they come out of the gel and detectors collect the emission intensities of the four different wavelengths and generates a chromatogram.
3) Problems in the sequencing reaction(s): imperfections in the sequencing reaction, gel electrophoresis.
For example: the 1st 50 or so peaks of a trace are noisy and unevenly spaced due to greater effects of the dye and specificity of base sequences on mobility. At the end of a reaction, diffusion effects increase, mass difference of the base sequence decrease, and the number of labeled fragments decreases. Also, in the dye terminator chemistry method, the polymerase has a reduced affinity for the dye labeled nucleotide, especially for a labeled G following and A.
An ideal trace would be composed of peaks evenly spaced and nonoverlapping.
B) Computer analysis of sequences: converts the gel image to an inferred sequence.
Four phases: 1- lane tracking
Use for identifying lane boundaries
2- lane profiling
Each of the four signals is summed across the lane width to create a profile
3-trace processing
Smoothes and deconvolve the trace, reduce noise and account for errors in the sequencing reactions.
4- base-calling
Translates the processed trace into a sequence of bases (Phred)
Phred
A base-calling program for automated sequencer traces. It assigns an error probability to each base called and yields 40-50% lower errors than the ABI software: Input are in the form of chromatogram files in ABI format or standard chromatogram format (SCF).
Base-calling pipeline: consists of four-phases
Output files are in SCF and PHD format with base calls and quality information and FASTA format either as a sequence or quality information file.
Quality scores (Q) are based on probability errors (P).
Q=-10 log10 (P)
|
Processing Options |
|
|
-trim <enzyme sequence> |
Perform sequence trimming on the current sequence. Bases are trimmed from the start and end of the sequence on the basis of trace quality. In addition, <enzyme sequence> specifies a base sequence that is usedto trim bases off the start of the current sequence. You can specify a NULL enzyme sequence using empty double quotes, "". See the note below on the effect of using the trim option. |
|
-trim_alt <enzyme sequence> |
Perform sequence trimming on the current sequence. Bases are trimmed from the start and end of the sequence on the basis of trace quality. Specifically, the value .05 is subtracted from each phred error probability value and the maximum scoring subsequence is retained. Furthermore, the subsequence must have a minimum number of bases and a minimum average base quality. In addition, <enzyme sequence> specifies a base sequence that is used sequence. You can specify a NULL enzyme sequence using empty double quotes, "". |
|
-nocmpqv |
Force phred to use the four parameter quality values. By default, phred uses five parameter quality values for dye primer data (only) in order to reduce the quality values of merged CC and GG peaks. (Phred uses the four parameter quality values for dye terminator chemistry data automatically. If phred cannot determine the chemistry, it uses the four parameter quality values.) |
|
-ceilqv <ceil_qv> |
Specifies a maximum quality value assigned to bases. Bases with quality value parameters that correspond to quality values greater than <ceil_qv> are assigned the value <ceil_qv>. |
Phrap
The phragmet assembly program is used to assemble shotgun DNA sequences accurately using Phreds quality score files.
How PHRAP works
The following is just a very brief and incomplete description of how PHRAP generates assemblies. For more detailed information, please check the "Additional Readings" section.
1. PHRAP reads the sequences from the input file, and quality values from the corresponding quality file (if present).
2. PHRAP masks likely "garbage" sequence at the beginning and end of reads, which is converted to 'N's. Specifically, PHRAP looks for regions that consist almost entirely of a single base; such regions are most likely due to poor data quality, and could give rise to spurious matches.
3. PHRAP identifies all potentially overlapping pairs of sequences. Two sequences must have at least one "word" of length minmatch (typically 14 bases) in common, and the alignment between the sequences must have an alignment score of at least minscore (default: 30).
4. PHRAP calculates modified quality scores for each base in each read, taking into account confirmation by overlapping reads as well as orientation and sequencing chemistry. PHRAP also calculates "LLR" scores for each pairwise alignment. LLR scores are a measure of overlap length and quality; high quality discrepancies that might indicate different copies of a repeat lead to low LLR scores. Potential problem clones like chimeras and deletions clones are also identified.
5. Merge reads into contigs, starting at the pairwise overlaps with the highest LLR scores. Probable deletion clones and chimeras are not used in any merges.
6. PHRAP calculate the contig ("consensus") sequences, which are pieced together from individual sequence reads with the highest adjusted quality scores at any base. Since the adjusted qualities take confirmation and discrepancies in other reads into account, the quality of the consensus sequence set to be identical to the revised quality of the highest quality read. If the original quality values were linked to error probabilities (like PHRED quality values are), the quality scores of the contig sequence are conservative estimates of the error probabilities in the contig sequences.
Input Files
PHRAP typically reads input sequences from a single file, which must be in FASTA format. PHRAP will also read base-specific quality values, if it can find a file that has ".qual" appended to the name of the sequence file (for example quality files generated by PHRED or PHD2FASTA).
Any non-alphabetic characters, including '*', in the sequence lines are automatically stripped out when the file is read in, except for '-' which is converted to 'N'. The character '>' must not appear anywhere within the sequence, since it is assumed to start the header of a new sequence, even if it is not at the beginning of a line. Lower case letters are converted to upper case on readin, so it is no longer possible (as in previous PHRAP versions) to use case as a crude indication of quality.
Naming Conventions For Sequence Reads
PHRAP uses information about sequencing chemistry (dye primer or dye terminator) when calculating consensus quality values: confirmation of a dye primer read with a dye terminator counts as much as confirmation by an opposite-strand read.
If sequence from both ends of a clone is present in the assembly, this information is currently only used for consistency checking after assembly, not during the assembly itself (some independently developed finishing tools can also utilize the double-ended information).
Running PHRAP
After generating a FASTA file with all the individual sequence reads that are to be used in the assembly (for example by running PHRED and then CROSS_MATCH for vector screening), you are ready to run PHRAP. Let us assume that you the sequence file is called Reads.fasta, and that it is located in a directory called C:\Project\Asm1. Open an MS-DOS window, and type the following commands:
cd c:\Project\Asm1
phrap Reads.fasta -ace > readslog.txt
Depending on the project size, PHRAP will run for a few seconds to several hours (see next section). PHRAP will generate a number output files:
readslog.txt - a file with a lot of information about the assembly.
The following table give an overview of the files PHRAP creates:
|
File name |
Description |
|
Reads.fasta.contigs |
The contig sequences in multiple FASTA format. |
|
Reads.fasta.contigs.qual |
Quality scores for the contig sequences (this file will be produced even if PHRAP could not find quality values for the individual reads). |
|
Reads.fasta.singlets |
Sequence reads that did not overlap with any other sequences. |
|
Reads.fasta.ace |
A file that contains "the entire assembly" in ACE format. This file is read by CONSED for contig editing. |
|
readslog.txt |
A file containing additional information about the assembly (this file is typically ignored, but can be helpful when working on assembly problems). |
Memory And Speed Considerations
The amount of memory (RAM) PHRAP needs depends on both the project size and the number of repeats in a project. For cosmid-sized projects with several hundred sequences, PHRAP may need 25 MB of RAM, and complete the assembly in less than three minutes on a 300 MHz Pentium II computer (cosmids with many repeats may need more memeory and time). On Unix workstations, PHRAP has successfully been used to assemble more than 30,000 reads; such large runs may require 4 GB of memory, and several days to complete.
The actual amount of RAM available on the computer that PHRAP is running on should be larger than the amount of memory PHRAP needs; for example, computer for cosmid assembly should have at least 32 MB of RAM. If the amount of memory PHRAP needs is not available as RAM, PHRAP's performance can be drastically slower, since the computer will spend most of the time swapping block of memory to and from the hard disks.
For optimal performance on larger projects, we suggest that you quit all open applications before starting PHRAP.
Fine-tuning PHRAP
In the Windows and Unix versions, PHRAP uses command line options to control input, processing, and output. The command line options are delimited by a dash, "-".
The most commonly used command line options for PHRAP are:
|
Parameter |
Description |
Default |
|
-penalty |
mismatch (substitution) penalty for SWAT comparisons |
-2 |
|
-gap_init |
gap initiating penalty for SWAT comparisons |
- 2 |
|
-gap_ext |
gap extension penalty for SWAT comparisons |
- 1 |
|
-minmatch |
minimum length of matching word to nucleate SWAT comparison; if minmatch = 0, a full (non-banded) comparison is done |
14 |
|
-minscore |
minimum SWAT score |
30 |
|
-bandwidth |
half band width for banded SWAT searches (full width is 2 n + 1) |
14 |
|
-repeat_stringency |
Controls stringency of match required for joins. Must be less than 1, and greater than 0. The closer to 1, the higher the stringency. |
0.95 |
|
-matrix |
score matrix for SWAT comparisons (supersedes -penalty) |
[default has +1 for matches, -penalty for mismatches] |
|
-raw |
use raw rather than complexity-adjusted Smith-Waterman scores. |
(default is to adjust scores to reflect complexity of matching sequence). |
|
-tags |
tags selected output lines, to facilitate parsing |
(This is a flag; no value should be provided). |
|
-indexwordsize |
size of indexing (hashing) words in words.c. The value of this parameter has a (small) effect on run time and memory usage. |
10 |
|
-ace |
Create ".ace" file with the assembly information. |
(This is a flag; no value should be provided). |
|
-new_ace |
Create ".ace" file in a newer format |
(This is a flag; no value should be provided). |
|
-view |
create ".view" file suitable for input to phrapview |
(This is a flag; no value should be provided). |
|
-maxgap |
Maximum gap (unmatched region) allowed in merging contigs. |
30 |
|
-trim_start |
Number of bases to be removed at beginning of each read |
0 |
|
-vector_bound |
Number of bases at beginning of each read, matches within which are assumed to be vector |
60 |
|
-trim_penalty |
Penalty used for identifying degenerate sequence at beginning/end of read |
-2 |
|
-trim_score |
Minimum score for identifying degenerate sequence at beginning/end of read |
20 |
|
-confirm_length |
Minimum size of confirming segment (starts at 3d distinct nuc following discrepancy) |
8 |
|
-confirm_trim |
Amount by which confirming segments are trimmed at edges |
1 |
|
-confirm_penalty |
Penalty used in aligning against "confirming" reads |
-5 |
|
-confirm_score |
Minimum alignment score for a read to be allowed to "confirm" part of another read |
30 |
|
-qual_show |
LLR cutoff for flagging "low_quality" regions in contig sequence and "high quality" discrepancies between read and contig. Bases in the .contigs and .ace file are lower case if their LLR-converted quality is below this value. |
20 |
|
-node_seg |
Minimum segment size (for purposes of traversing weighted directed graph) |
8 |
|
-node_space |
Spacing between nodes (in weighted directed graph) |
4 |
|
-revise_greedy |
Splits initial greedy assembly into pieces at "weak joins", and then tries to reattach them to give higher overall score. |
(This is a flag; no value should be provided). |
|
-shatter_greedy |
Breaks assembly at weak joins (as with revise_greedy) but does not try to reattach pieces. |
(This is a flag; no value should be provided). |
|
-forcelevel |
Relaxes stringency to varying degree during final contig merge pass. Ranges from 0 (most stringent) to 10 (least stringent). |
0 |
|
-force_high |
Allows ignoring high-quality discrancies during final contig merge pass. |
(This is a flag; no value should be provided). |
An Example run from CodonCode corp
.|
Command |
Explanation |
|
cd C:\Progra~1\CodonC~1\Demo |
Switch to the Demo directory. |
|
mkdir test |
Create a directory that we'll work in. |
|
cd test |
Go there. |
|
mkdir phd_dir |
Create a sub-directory for the PHD files PHRED will generate. |
|
P HRED -id ../demo7 -pd phd_dir |
Run PHRED, and tell it to process all files in the input directory (-id) ../demo7, and to generate PHD files in the output directory (-pd) phd_dir. |
|
File name |
Description |
|
test.fasta.screen.contigs |
The consensus sequences of all contigs in multiple fasta formats. In our example, the file should contain just one sequence, named "Contig1". |
|
test.fasta.screen.contigs.qual |
Quality values for each base in each consensus sequence, also in fasta format. |
|
test.fasta.screen.singlets |
A multiple fasta file containing all input sequence which does not have a significant overlap with any other input sequence. In our example, this file should be empty. |