Presentation by Gregorio Valdez

Background:

A) DNA sequencing

1) Manually

2) Automated

Dye primer chemistry

Dye terminator chemistry

3) Problems in the sequencing reaction(s)

B) Computer analysis of sequenced data

Phred

Phrap

A) DNA Sequencing: the dideoxy chain-termination method of Sanger is still the preferred method.

1) Manual sequencing- laborious, two-day minimum to complete, max of 4-6 different samples sequenced per two days, the terminating nucleotide is labeled with a radioactive tracer.

2) Automated sequencing- easy, reactions are completed from 12-24 hours, can run hundreds of reactions per day, the primer (dye primer chemistry) or terminating nucleotides (dye terminator chemistry) are labeled with fluorescent dyes.

-In the dye terminator chemistry each of the four terminating nucleotide is labeled with a different dye. Therefore, the entire sequencing reaction can be run as one. A laser excites the fluorescent dyes as they come out of the gel and detectors collect the emission intensities of the four different wavelengths and generates a chromatogram.

3) Problems in the sequencing reaction(s): imperfections in the sequencing reaction, gel electrophoresis.

For example: the 1st 50 or so peaks of a trace are noisy and unevenly spaced due to greater effects of the dye and specificity of base sequences on mobility. At the end of a reaction, diffusion effects increase, mass difference of the base sequence decrease, and the number of labeled fragments decreases. Also, in the dye terminator chemistry method, the polymerase has a reduced affinity for the dye labeled nucleotide, especially for a labeled G following and A.

An ideal trace would be composed of peaks evenly spaced and nonoverlapping.

 

 

 

 

B) Computer analysis of sequences: converts the gel image to an inferred sequence.

Four phases: 1- lane tracking

Use for identifying lane boundaries

2- lane profiling

Each of the four signals is summed across the lane width to create a profile

3-trace processing

Smoothes and deconvolve the trace, reduce noise and account for errors in the sequencing reactions.

4- base-calling

Translates the processed trace into a sequence of bases (Phred)

 

 

 

 

 

 

 

 

 

 

Phred

A base-calling program for automated sequencer traces. It assigns an error probability to each base called and yields 40-50% lower errors than the ABI software: Input are in the form of chromatogram files in ABI format or standard chromatogram format (SCF).

Base-calling pipeline: consists of four-phases

    1. It determines the ideal or predicted peak locations using the fact that fragments should be locally evenly spaced. It then uses this information to determine the correct number of bases in a region where peaks are not well resolved, noisy or displaced.
    2. Observed peaks are identified in the trace
    3. Observed peaks are matched to predicted peak location, omitting some and splitting other. This is the main base sequence.
    4. Unmatched peaks are analyzed to see if they represent bases. If they represent bases, the base is inserted into the sequence.

Output files are in SCF and PHD format with base calls and quality information and FASTA format either as a sequence or quality information file.

Quality scores (Q) are based on probability errors (P).

Q=-10 log10 (P)

 

 

 

Processing Options

-trim <enzyme sequence>

Perform sequence trimming on the current sequence. Bases are trimmed from the start and end of the sequence on the basis of trace quality. In addition, <enzyme sequence> specifies a base sequence that is usedto trim bases off the start of the current sequence. You can specify a NULL enzyme sequence using empty double quotes, "". See the note below on the effect of using the trim option.

-trim_alt <enzyme sequence>

Perform sequence trimming on the current sequence. Bases are trimmed from the start and end of the sequence on the basis of trace quality. Specifically, the value .05 is subtracted from each phred error probability value and the maximum scoring subsequence is retained. Furthermore, the subsequence must have a minimum number of bases and a minimum average base quality. In addition, <enzyme sequence> specifies a base sequence that is used sequence. You can specify a NULL enzyme sequence using empty double quotes, "".

-nocmpqv

Force phred to use the four parameter quality values. By default, phred uses five parameter quality values for dye primer data (only) in order to reduce the quality values of merged CC and GG peaks. (Phred uses the four parameter quality values for dye terminator chemistry data automatically. If phred cannot determine the chemistry, it uses the four parameter quality values.)

-ceilqv <ceil_qv>

Specifies a maximum quality value assigned to bases. Bases with quality value parameters that correspond to quality values greater than <ceil_qv> are assigned the value <ceil_qv>.

 

 

 

 

 

 

Phrap

The phragmet assembly program is used to assemble shotgun DNA sequences accurately using Phred’s quality score files.

    1. Allows use of entire reads (2) uses a combination of user-supplied and internally computed data quality information to improve accuracy of assembly in the presence of repeats; (3) constructs contig sequence as a mosaic of the highest quality parts of reads, rather than a consensus.

 

How PHRAP works

The following is just a very brief and incomplete description of how PHRAP generates assemblies. For more detailed information, please check the "Additional Readings" section.

1. PHRAP reads the sequences from the input file, and quality values from the corresponding quality file (if present).

2. PHRAP masks likely "garbage" sequence at the beginning and end of reads, which is converted to 'N's. Specifically, PHRAP looks for regions that consist almost entirely of a single base; such regions are most likely due to poor data quality, and could give rise to spurious matches.

3. PHRAP identifies all potentially overlapping pairs of sequences. Two sequences must have at least one "word" of length minmatch (typically 14 bases) in common, and the alignment between the sequences must have an alignment score of at least minscore (default: 30).

4. PHRAP calculates modified quality scores for each base in each read, taking into account confirmation by overlapping reads as well as orientation and sequencing chemistry. PHRAP also calculates "LLR" scores for each pairwise alignment. LLR scores are a measure of overlap length and quality; high quality discrepancies that might indicate different copies of a repeat lead to low LLR scores. Potential problem clones like chimeras and deletions clones are also identified.

5. Merge reads into contigs, starting at the pairwise overlaps with the highest LLR scores. Probable deletion clones and chimeras are not used in any merges.

6. PHRAP calculate the contig ("consensus") sequences, which are pieced together from individual sequence reads with the highest adjusted quality scores at any base. Since the adjusted qualities take confirmation and discrepancies in other reads into account, the quality of the consensus sequence set to be identical to the revised quality of the highest quality read. If the original quality values were linked to error probabilities (like PHRED quality values are), the quality scores of the contig sequence are conservative estimates of the error probabilities in the contig sequences.

 

Input Files

PHRAP typically reads input sequences from a single file, which must be in FASTA format. PHRAP will also read base-specific quality values, if it can find a file that has ".qual" appended to the name of the sequence file (for example quality files generated by PHRED or PHD2FASTA).

Any non-alphabetic characters, including '*', in the sequence lines are automatically stripped out when the file is read in, except for '-' which is converted to 'N'. The character '>' must not appear anywhere within the sequence, since it is assumed to start the header of a new sequence, even if it is not at the beginning of a line. Lower case letters are converted to upper case on readin, so it is no longer possible (as in previous PHRAP versions) to use case as a crude indication of quality.

 

Naming Conventions For Sequence Reads

PHRAP uses information about sequencing chemistry (dye primer or dye terminator) when calculating consensus quality values: confirmation of a dye primer read with a dye terminator counts as much as confirmation by an opposite-strand read.

If sequence from both ends of a clone is present in the assembly, this information is currently only used for consistency checking after assembly, not during the assembly itself (some independently developed finishing tools can also utilize the double-ended information).

 

 

 

 

 

Running PHRAP

After generating a FASTA file with all the individual sequence reads that are to be used in the assembly (for example by running PHRED and then CROSS_MATCH for vector screening), you are ready to run PHRAP. Let us assume that you the sequence file is called Reads.fasta, and that it is located in a directory called C:\Project\Asm1. Open an MS-DOS window, and type the following commands:

cd c:\Project\Asm1

phrap Reads.fasta -ace > readslog.txt

 

Depending on the project size, PHRAP will run for a few seconds to several hours (see next section). PHRAP will generate a number output files:

readslog.txt - a file with a lot of information about the assembly.

The following table give an overview of the files PHRAP creates:

File name

Description

Reads.fasta.contigs

The contig sequences in multiple FASTA format.

Reads.fasta.contigs.qual

Quality scores for the contig sequences (this file will be produced even if PHRAP could not find quality values for the individual reads).

Reads.fasta.singlets

Sequence reads that did not overlap with any other sequences.

Reads.fasta.ace

A file that contains "the entire assembly" in ACE format. This file is read by CONSED for contig editing.

readslog.txt

A file containing additional information about the assembly (this file is typically ignored, but can be helpful when working on assembly problems).

 

 

Memory And Speed Considerations

The amount of memory (RAM) PHRAP needs depends on both the project size and the number of repeats in a project. For cosmid-sized projects with several hundred sequences, PHRAP may need 25 MB of RAM, and complete the assembly in less than three minutes on a 300 MHz Pentium II computer (cosmids with many repeats may need more memeory and time). On Unix workstations, PHRAP has successfully been used to assemble more than 30,000 reads; such large runs may require 4 GB of memory, and several days to complete.

The actual amount of RAM available on the computer that PHRAP is running on should be larger than the amount of memory PHRAP needs; for example, computer for cosmid assembly should have at least 32 MB of RAM. If the amount of memory PHRAP needs is not available as RAM, PHRAP's performance can be drastically slower, since the computer will spend most of the time swapping block of memory to and from the hard disks.

For optimal performance on larger projects, we suggest that you quit all open applications before starting PHRAP.

 

Fine-tuning PHRAP

In the Windows and Unix versions, PHRAP uses command line options to control input, processing, and output. The command line options are delimited by a dash, "-".

The most commonly used command line options for PHRAP are:

Parameter

Description

Default

-penalty

mismatch (substitution) penalty for SWAT comparisons

-2

-gap_init

gap initiating penalty for SWAT comparisons

- 2

-gap_ext

gap extension penalty for SWAT comparisons

- 1

-minmatch

minimum length of matching word to nucleate SWAT comparison; if minmatch = 0, a full (non-banded) comparison is done

14

-minscore

minimum SWAT score

30

-bandwidth

half band width for banded SWAT searches (full width is 2 n + 1)

14

-repeat_stringency

Controls stringency of match required for joins. Must be less than 1, and greater than 0. The closer to 1, the higher the stringency.

0.95

-matrix

score matrix for SWAT comparisons (supersedes -penalty)

[default has +1 for matches, -penalty for mismatches]

-raw

use raw rather than complexity-adjusted Smith-Waterman scores.

(default is to adjust scores to reflect complexity of matching sequence).
(This is a flag; no value should be provided).

-tags

tags selected output lines, to facilitate parsing

(This is a flag; no value should be provided).

-indexwordsize

size of indexing (hashing) words in words.c. The value of this parameter has a (small) effect on run time and memory usage.

10

-ace

Create ".ace" file with the assembly information.

(This is a flag; no value should be provided).

-new_ace

Create ".ace" file in a newer format

(This is a flag; no value should be provided).

-view

create ".view" file suitable for input to phrapview

(This is a flag; no value should be provided).

-maxgap

Maximum gap (unmatched region) allowed in merging contigs.

30

-trim_start

Number of bases to be removed at beginning of each read

0

-vector_bound

Number of bases at beginning of each read, matches within which are assumed to be vector

60

-trim_penalty

Penalty used for identifying degenerate sequence at beginning/end of read

-2

-trim_score

Minimum score for identifying degenerate sequence at beginning/end of read

20

-confirm_length

Minimum size of confirming segment (starts at 3d distinct nuc following discrepancy)

8

-confirm_trim

Amount by which confirming segments are trimmed at edges

1

-confirm_penalty

Penalty used in aligning against "confirming" reads

-5

-confirm_score

Minimum alignment score for a read to be allowed to "confirm" part of another read

30

-qual_show

LLR cutoff for flagging "low_quality" regions in contig sequence and "high quality" discrepancies between read and contig. Bases in the .contigs and .ace file are lower case if their LLR-converted quality is below this value.

20

-node_seg

Minimum segment size (for purposes of traversing weighted directed graph)

8

-node_space

Spacing between nodes (in weighted directed graph)

4

-revise_greedy

Splits initial greedy assembly into pieces at "weak joins", and then tries to reattach them to give higher overall score.

(This is a flag; no value should be provided).

-shatter_greedy

Breaks assembly at weak joins (as with revise_greedy) but does not try to reattach pieces.

(This is a flag; no value should be provided).

-forcelevel

Relaxes stringency to varying degree during final contig merge pass. Ranges from 0 (most stringent) to 10 (least stringent).

0

-force_high

Allows ignoring high-quality discrancies during final contig merge pass.

(This is a flag; no value should be provided).

 

 

 

 

An Example run from CodonCode corp.

  1. Collect all trace files in one directory. Readable formats are sample files generated by Applied Biosystems software, or SCF ("Standard Chromatogram Format") files generated by other programs.
    In our example, seven example trace files in
    C:\Program Files\CodonCode\Demo\demo7 will be used.
  2. Run PHRED to create PHD files. PHD files are text file created by PHRED that contain the actual base picks, their position in each trace, and a quality value for each base. To run PHRED, open a DOS window, and switch to a directory you want to work in; for example:
  3.  Command

     Explanation

     cd C:\Progra~1\CodonC~1\Demo

     Switch to the Demo directory.

     mkdir test

     Create a directory that we'll work in.

     cd test

     Go there.

     mkdir phd_dir

     Create a sub-directory for the PHD files PHRED will generate.

    PHRED -id ../demo7 -pd phd_dir

     Run PHRED, and tell it to process all files in the input directory (-id) ../demo7, and to generate PHD files in the output directory (-pd) phd_dir.

  4. PHRED will generate one PHD file for each trace file it finds, and name it by appending ".phd.1" to the original file name.
  5.  

  6. Run PHD2FASTA to create a sequence file and a quality file. PHD2FASTA is a little utility program that extracts information from PHD files and create input files for CROSS_MATCH and PHRAP. The command is:
    PHD2FASTA -id phd_dir -os test.fasta -oq test.fasta.qual
  7. Run CROSS_MATCH to screen the input sequences for vector sequences. CROSS_MATCH can compare all input sequences to a file containing one of more vector sequences, and mask all vector sequences it finds by replacing the bases with "X". In this example, the vector sequence is in the file "vector.seq" one directory up. The command is:
    CROSS_MATCH test.fasta ../vector.seq -screen -minscore 20 -minmatch 12 > screen.txt
    The result of this run will be a new file named "test.fasta.screen", which is identical with the test.fasta file - except that all vector sequences of sufficient length have been replaced by "X". Since CROSS_MATCH is rather verbose, we re-direct output to a log file called screen.txt; some output will still appear on the screen.
  8. Rename the quality file (or the sequence files). PHRAP expects that the quality file has the same name as the sequence file, with ".qual" appended. Therefore, we have to rename either the .qual file or the .fasta file and the .fasta.screen file. In this example, we will rename the .qual file:
    ren test.fasta.qual test.fasta.screen.qual
  9. Run PHRAP. The command to run PHRAP is:
    PHRAP test.fasta.screen -ace > test.log.txt
    Again, we re-direct PHRAP output to a file, this time called test.log.txt. The -ace option tells PHRAP to generate a file in "ACE" format, which can be read by a number of contig editing and finishing programs, including CONSED. In our example, the output file will be called test.fasta.screen.ace. PHRAP will also generate a number of additional output files, even without the -ace option. The most important ones are:
  10. File name

    Description

    test.fasta.screen.contigs

    The consensus sequences of all contigs in multiple fasta formats. In our example, the file should contain just one sequence, named "Contig1".

    test.fasta.screen.contigs.qual

    Quality values for each base in each consensus sequence, also in fasta format.

    test.fasta.screen.singlets

    A multiple fasta file containing all input sequence which does not have a significant overlap with any other input sequence.

    In our example, this file should be empty.

  11.  The next section gives a description of command line options for all programs.
  12. Copyright © 1998-1999 Codoncode Corporation