next up previous
Next: About this document ... Up: My Home Page

Projects

I am eager to see some interesting projects done by pairing biologists and computer scientists.

For next class, give me a list of your three favorite project topics and I will do initial matchings based on interest and qualifications since many of you don't know each other.

An initial progress report is due in three weeks, enough time to determine whether progress is likely. You can switch projects then.

Talk to me if you have your own idea for a project. Group projects are preferred.

Each team will submit a report and a WWW site, and present the results to me during finals week.

I anticipate that at least one project will eventually lead to a publication if the class takes this seriously.

String Comparison

The two ``killer apps'' of modern string processing algorithms have been computational biology and searching the Internet.

While exact matching algorithms are more important in text processing than biology, they are important (1) to illustrate techniques, and (2) as a component of heuristics for approximate matching.

Three primary classes of string matching problems arise, determining whether preprocessing techniques are appropriate:

Efficient Exact String Matching

We have seen the naive brute force algorithm searching for pattern p in text t in O(mn) time, where m = |p| and n = |t|.

The brute force algorithm can be viewed as sliding the pattern across from left to right by one position when we detect a mismatch between the pattern and the text.

But in certain circumstances we can slide the pattern to the right by more than one position:

pattern:        ABCDABCE
text:       ....ABCDABCD....

Since we know the last seven characters of the text must be ABCDABC, we can shift the pattern four positions without missing any matches.

The Knuth-Morris-Pratt Algorithm

Whenever a character match fails, we can shift the pattern forward according to the failure function fail(q), which is the length of the longest prefix of P which is a proper suffix of Pq

          i:        1 2 3 4 5 6 7 8 9 10
          P[i]:     a b a b a b a b c a
          fail[i]:  0 0 1 2 3 4 5 6 0 1

NOTE: IS THIS FAILURE FUNCTION CORRECT?

NOTE: I SHOULD ADD PSEUDOCODE

NOTE: I SHOULD ADD MORE EXAMPLES

Given this prefix function we can match efficiently - on a character match we increment the pointers, on a mismatch we slide back according to the failure function.

Note that we will not match the pattern from the begining, but according to where we are in the text - we never look backwards in the text.

Computing the failure function for the pattern can be done in O(m) preprocessing, and hence does not change the asymptotic complexity.

Time Complexity

If the pattern rejects near the beginning, we did not waste much time in the search.

If the pattern rejects near the end, there is an opportunity to slide it back many positions.

Note that we never move backwards through the text to compare a character again - we slide the pattern forward accordingly.

The linearity of KMP follows from an amortized analysis. Each time we move forward in the text we add c steps to our account, while each mismatch costs us one step.

Provided the account never goes negative, we only did O(n) work total since there were a total of $\leq c n$ steps put in the account.

Thus if many consecutive mismatches requires the pattern to shift repeatedly, it is only because we have enough previous forward moves to compensate.

Correctness is harder to verify in the details than time complexity.

There are a host of ``improved'' failure functions which do even cleverer preprocessing to reduce the number of pattern shifts.

However, such improvements have little effect in practice and are tricky to program correctly.

The Boyer-Moore Algorithm

An alternate linear algorithm, tBoyer-Moore, starts matching from right side of the pattern instead of the left side:

pattern:    ABCDABCD
text:       ABCDABCE....

If the last character in the window does not occur anywhere in the pattern, we can shift the pattern m positions after one comparison.

Thus the best case time for Boyer-Moore is sub-linear, i.e. O(n/m). The worst-case is O(n+rm), where r is the number of times the pattern occurs in the text.

The algorithm precomputes a table of size $\vert\Sigma\vert$ describing how far to shift for a mismatch on each letter of the alphabet, i.e. depending upon the position of the rightmost occurrence of that letter in the pattern.

Further, since we know that the suffix of the pattern matched up to the point of mismatch, we can precompute a table recording the next place where the suffix occurs in the pattern.

We can use the bigger of the two shifts, since no intermediate position can define a legal match.

Boyer-Moore may be the fastest algorithm for alphanumeric string matching in practice when implemented correctly.

NOTE: I SHOULD ADD PSEUDOCODE

NOTE: I SHOULD ADD EXAMPLES

Randomized String Matching

The Rabin-Karp algorithm computes an appropriate hash function on all m-length strings of the text, and does a brute force comparison only if the hash value is the same for the text window and the pattern.

An appropriate hash function is

\begin{displaymath}H(S) = \sum_{i=1}^m d^i S_i ~\mbox{mod}~ q \end{displaymath}

which treats each string as an m-digit base-d number, mod q.

This hash function can be computed incrementally in constant time as we slide the window from left to right since

H(Sj+1) = d H(Sj) + Sj+1 - dm Sj-m

Further, if q is a random prime the expected number of false positives is small enough to yield a randomized linear algorithm.

Multiple Exact Patterns

Many applications require searching a text for occurrences of any one of many patterns, e.g. searching text for dirty words.

Pattern matching with wild card characters (ACG?T) is an important special case of multiple patterns.

Techniques from automata theory come into play, since any finite set of patterns can be modeled by regular expressions, and many interesting infinite sets (e.g. G(AT)*C) as well.

The standard UNIX tool grep stands for ``general regular expression pattern matcher''.

The Aho-Corasick algorithm is Thompson's NFA matching.

NOTE: I SHOULD ADD EXAMPLES AND PSEUDOCODE

Approximate String Comparison

An important generalization of exact string matching is measuring the distance between two or more strings.

These problems arise naturally in biology because DNA/protein sequences tend to be structurally conserved across species over the course of evolution, so functionally similar genes in different organisms can be detected via approximate string matching.

Computer science applications of approximate string matching included spell checking and file difference testing.

Edit Distance and Similarity Scores

Computer scientists typically measure distance between strings x and y as the minimum number of insertions, deletions, and substitutions required to transform x to y.

Certain mathematical properties are expected of any distance measure, or metric:

1.
$d(x,y) \geq 0$ for all x, y.
2.
d(x,y) = 0 iff x=y.
3.
d(x,y) = d(y,x) (symmetry)
4.
$d(x,y) \leq d(x,z) + d(z,y)$ for all x, y, and z. (triangle inequality)

Biologists typically instead measure a sequence similarity score which gets larger the more similar the sequences are.

Similar algorithms can be used to optimize both measures, but we will generally stick to the biological scoring convention.

Pairwise Alignments

NOTE: GIVE SOME EXAMPLES






































DNA Scoring Matrices

The weights/scores governing the cost of changing between bases are given by PAM matrices.

DNA PAM matrices include Blast similarity and transition / transversion matrices.

            A  T  C  G           A  T  C  G
         A  5 -4 -4 -4        A  0  5  5  1
         T -4  5 -4 -4        T  5  0  1  5
         C -4 -4  5 -4        C  5  1  0  5
         G -4 -4 -4  5        G  1  5  5  0

The four nucleotide bases are classified as either purines (Adenine and Guanine) or pyrimidines (Cytosine and Thymine).

Transitions are mutations which stay within the class (e.g. A $\rightarrow$G or C $\rightarrow$T), while transversions cross classes (e.g. A $\rightarrow$C, A $\rightarrow$T etc.).

Transitions are more common than transversions.

Genetic Code Matrix

The following genetic code matrix calculates the minimum number of DNA base changes to go from a codon for i to a codon for j.

Met $\rightarrow$Tyr requires all 3 positions to change.

       A  S  G  L  K  V  T  P  E  D  N  I  Q  R  F  Y  C  H  M  W  Z  B  X
Ala=A  0  1  1  2  2  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2
Ser=S  1  0  1  1  2  2  1  1  2  2  1  1  2  1  1  1  1  2  2  1  2  2  2
Gly=G  1  1  0  2  2  1  2  2  1  1  2  2  2  1  2  2  1  2  2  1  2  2  2
Leu=L  2  1  2  0  2  1  2  1  2  2  2  1  1  1  1  2  2  1  1  1  2  2  2
Lys=K  2  2  2  2  0  2  1  2  1  2  1  1  1  1  2  2  2  2  1  2  1  2  2
Val=V  1  2  1  1  2  0  2  2  1  1  2  1  2  2  1  2  2  2  1  2  2  2  2
Thr=T  1  1  2  2  1  2  0  1  2  2  1  1  2  1  2  2  2  2  1  2  2  2  2
Pro=P  1  1  2  1  2  2  1  0  2  2  2  2  1  1  2  2  2  1  2  2  2  2  2
Glu=E  1  2  1  2  1  1  2  2  0  1  2  2  1  2  2  2  2  2  2  2  1  2  2
Asp=D  1  2  1  2  2  1  2  2  1  0  1  2  2  2  2  1  2  1  2  2  2  1  2
Asn=N  2  1  2  2  1  2  1  2  2  1  0  1  2  2  2  1  2  1  2  2  2  1  2
Ile=I  2  1  2  1  1  1  1  2  2  2  1  0  2  1  1  2  2  2  1  2  2  2  2
Gln=Q  2  2  2  1  1  2  2  1  1  2  2  2  0  1  2  2  2  1  2  2  1  2  2
Arg=R  2  1  1  1  1  2  1  1  2  2  2  1  1  0  2  2  1  1  1  1  2  2  2
Phe=F  2  1  2  1  2  1  2  2  2  2  2  1  2  2  0  1  1  2  2  2  2  2  2
Tyr=Y  2  1  2  2  2  2  2  2  2  1  1  2  2  2  1  0  1  1  3  2  2  1  2
Cys=C  2  1  1  2  2  2  2  2  2  2  2  2  2  1  1  1  0  2  2  1  2  2  2
His=H  2  2  2  1  2  2  2  1  2  1  1  2  1  1  2  1  2  0  2  2  2  1  2
Met=M  2  2  2  1  1  1  1  2  2  2  2  1  2  1  2  3  2  2  0  2  2  2  2
Trp=W  2  1  1  1  2  2  2  2  2  2  2  2  2  1  2  2  1  2  2  0  2  2  2
Glx=Z  2  2  2  2  1  2  2  2  1  2  2  2  1  2  2  2  2  2  2  2  1  2  2
Asx=B  2  2  2  2  2  2  2  2  2  1  1  2  2  2  2  1  2  1  2  2  2  1  2
???=X  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2

Hydrophobicity Matrix

Amino acids differ to the extent that they like (hydrophilic) or don't like (hydrophobic) water.

Hydrophobic residues do not want to be on the surface of a protein.

       R  K  D  E  B  Z  S  N  Q  G  X  T  H  A  C  M  P  V  L  I  Y  F  W
Arg=R 10 10  9  9  8  8  6  6  6  5  5  5  5  5  4  3  3  3  3  3  2  1  0
Lys=K 10 10  9  9  8  8  6  6  6  5  5  5  5  5  4  3  3  3  3  3  2  1  0
Asp=D  9  9 10 10  8  8  7  6  6  6  5  5  5  5  5  4  4  4  3  3  3  2  1
Glu=E  9  9 10 10  8  8  7  6  6  6  5  5  5  5  5  4  4  4  3  3  3  2  1
Asx=B  8  8  8  8 10 10  8  8  8  8  7  7  7  7  6  6  6  5  5  5  4  4  3
Glx=Z  8  8  8  8 10 10  8  8  8  8  7  7  7  7  6  6  6  5  5  5  4  4  3
Ser=S  6  6  7  7  8  8 10 10 10 10  9  9  9  9  8  8  7  7  7  7  6  6  4
Asn=N  6  6  6  6  8  8 10 10 10 10  9  9  9  9  8  8  8  7  7  7  6  6  4
Gln=Q  6  6  6  6  8  8 10 10 10 10  9  9  9  9  8  8  8  7  7  7  6  6  4
Gly=G  5  5  6  6  8  8 10 10 10 10  9  9  9  9  8  8  8  8  7  7  6  6  5
???=X  5  5  5  5  7  7  9  9  9  9 10 10 10 10  9  9  8  8  8  8  7  7  5
Thr=T  5  5  5  5  7  7  9  9  9  9 10 10 10 10  9  9  8  8  8  8  7  7  5
His=H  5  5  5  5  7  7  9  9  9  9 10 10 10 10  9  9  9  8  8  8  7  7  5
Ala=A  5  5  5  5  7  7  9  9  9  9 10 10 10 10  9  9  9  8  8  8  7  7  5
Cys=C  4  4  5  5  6  6  8  8  8  8  9  9  9  9 10 10  9  9  9  9  8  8  5
Met=M  3  3  4  4  6  6  8  8  8  8  9  9  9  9 10 10 10 10  9  9  8  8  7
Pro=P  3  3  4  4  6  6  7  8  8  8  8  8  9  9  9 10 10 10  9  9  9  8  7
Val=V  3  3  4  4  5  5  7  7  7  8  8  8  8  8  9 10 10 10 10 10  9  8  7
Leu=L  3  3  3  3  5  5  7  7  7  7  8  8  8  8  9  9  9 10 10 10  9  9  8
Ile=I  3  3  3  3  5  5  7  7  7  7  8  8  8  8  9  9  9 10 10 10  9  9  8
Tyr=Y  2  2  3  3  4  4  6  6  6  6  7  7  7  7  8  8  9  9  9  9 10 10  8
Phe=F  1  1  2  2  4  4  6  6  6  6  7  7  7  7  8  8  8  8  9  9 10 10  9
Trp=W  0  0  1  1  3  3  4  4  4  5  5  5  5  5  6  7  7  7  8  8  8  9 10

PAM Matrices

This variety of possible matrix criteria make judging the significance of an alignment tricky.

Most widely used are PAM matrices, for ``point accepted mutations''.

These were constructed by aligning very similar proteins, and tabulating how often each substitution occurred.

The PAM1 matrix scores the transitions when the proteins differ in 1% of the residues.

Raising this to higher powers gives us PAM matrices suited for comparing more distantly related proteins.



figure=figures/pam50.eps,width=3.5in figure=figures/pam250.eps,width=3.5in


Note the main diagonal on these plots of the PAM50 and PAM250 matrices.

More modern than the PAM matrices are the Blosum matrices, reflecting additional sequence alignment data.

Gap Penalties

Gaps in protein alignments occur because the important conserved parts of a protein may be primarily relatively small docking sites.

Gap in alignments can be modeled as repeated base deletions, where the penalty for a gap is just a linear function of its length.

However, gaps can grow relatively easily once started.

Thus a more meaningful model charges a fixed penalty for the existence of each gap, plus another penalty depending upon the length of the gap.

Affine gap penalties are A + Bt for gaps of length t, while logarithmic gap penalties of the form $A + B \lg t$ are suggested by empirical data.

Pairwise Alignment Algorithms

The standard algorithms for computing exact alignment scores are based on dynamic programming.

In dynamic programming, we build a table of partial results to help up compute the optimum global results.

Dynamic programming is a very powerful technique on any ordered structures, like character strings, permutations, and rooted trees.

Global Alignment without Gaps

NOTE: MUST ADD RECURRENCE, PSEUDOCODE, EXAMPLE, AND FIGURE.






































Reconstructing the Alignment

Once we have the dynamic programming matrix, we have to walk backwards through it to reconstruct the alignment.

Either explicit back pointers can be kept, or we can remake decisions of how we got to the critical cells starting from the back.

NOTE: MUST ADD PSEUDOCODE


























Local Alignment without Gaps

Typically called the Smith-Waterman algorithm.

NOTE: MUST ADD RECURRENCE AND EXAMPLE
































Global Alignment with Gaps

Recall that the cost of a gap is less than the cost of the equivalent number of deletions.

NOTE: MUST ADD RECURRENCE
































Fast Alignment with Gaps

In fact, by being clever we can avoid the extra linear cost of looking for the start of the gap.

NOTE: MUST ADD RECURRENCE
































This only works for linear gap penalties, not logarithmic penalties.

In banded alignments, we can further speed up search by limiting attention to cells near the diagonal, at a cost in accuracy when the alignment is poor.

Space Efficient Dynamic Programming

Quadratic space will kill you faster than quadratic time.

(30,000)2 bytes equals one gigabyte, while (30,000)2 operations takes 1000 seconds on a machine doing 1 million steps per second.

The dynamic programming algorithms we have seen only look at neighboring rows/columns to make their decision.

Computing the highest cell score in the matrix does not require keeping more than then last column and the best value to date, for a total of O(n) space, where $n \leq m$.

Note that reconstructing the optimal alignment does seem to require keeping the entire matrix, however.

But Hirshberg found a clever way to reconstruct the alignment in O(nm) time using only O(n) space, by recomputing the appropriate portions of the matrix.

For each cell, we drag along the row number where the optimal path to in crossed the middle (m/2nd) column.

This requires only O(n) extra memory, one cell per row.

This works because the crossing point k of the (m/2)nd column means that the optimal alignment lies in the sub matrices A from (1,1) to (m/2,k), and B from (m/2,k) to (m,n).

Note that the number of cells in A and B totals only half of the the original mn cells.

Further, these dynamic programming algorithms are linear in the number of cells they compute.

Thus the total amount of recomputation done is $\sum_{i=0}^{\lg m} m n / 2^i = 2 mn$, so the total work remains O(mn)

Heuristic String Comparison

Dynamic programming methods for sequence alignment give the highest quality results.

However, quadratic O(nm) algorithms are only feasible for comparing two modest sized sequences.

Comparing the human genome against mouse ( 3,000,000,0002 operations) at a billion operations per second equals 285 years!

Comparing your target sequence against the entire database is hopeless, even with special purpose hardware.

Instead, heuristic algorithms such as BLAST and FASTA are used to make an initial scan of the database to find a small number of hits.

Smith-Waterman can then be used to find the optimal alignment to display.

FASTA

FASTA is a heuristic string alignment program which is based upon finding short exact matches (k-mers) between the query sequence and the database.

The trick is to choose k large enough that there are relatively few hits, but small enough that we are likely to have an exact k-mer match between related sequences.

Recommended values of k are 2 for protein sequences and 6 for DNA sequences.

Note that there are at most n-k+1 distinct k-mers in a sequence of length n.

A hash table of all k-mers in the database can be built once and used repeatedly for efficiently looking up the k-mers in query strings.

A dynamic programming-like algorithm is used to align the k-mer hits between query and database, but the problem is much smaller since there are far fewer hits than bases.

BLAST

BLAST stands for ``Basic Local Alignment Search Tool''.

It also seeks to exploit the speed of exact pattern matching while factoring in the impact of the scoring matrix.

BLAST also breaks the query into k-mers in order to search a hash table, but for input k-mer q constructs all other k-mers which lie within a distance t of q.

By processing all these patterns into an automata, BLAST can then make one linear-time pass through the database to find all exact matches and group them in an alignment.

The window size k is typically 3-5 for protein sequences and 12 for DNA sequences.

Conventional wisdom has BLAST as faster than FASTA, but perhaps a little less accurate.



 
next up previous
Next: About this document ... Up: My Home Page
Steve Skiena
2000-10-03