next up previous
Next: About this document ... Up: My Home Page

Administrivia

If you took a previous instantiation of CSE 648, register under an independent study code (CSE 587).

Make sure I get your name and email address written clearly.

I need volunteers particularly for the first presentation, one week from today.

Do we want to assign scribes to prepare notes for each lecture?

Project lists should go out in about a week or so.

Why Computational Biology?

Computational biology is the application of a core technology of computer science (e.g. algorithms, artificial intelligence, data bases) to problems arising from biology.

Computational biology is particularly exciting today because (1) the problems are large enough to motivate efficient algorithms, (2) the problems are accessible, fresh and interesting, (3) biology is increasing becoming a computational science.

Developments in biology are coming astonishingly quickly, and with amazing possibilities.

Many problem ideas go from biology to CS: e.g. fragment assembly, sequence analysis, algorithms for phylogenic trees.

Many problem ideas go from CS to biology: e.g. sequencing by hybridization, DNA computing.

Computer Scientists vs. Biologists

There are many different types of life scientists (biologists, ecologists, medical doctors, etc.), just as there are many different types of computational scientists (algorists, software engineers, statisticians, etc.).

There are many fundamental cultural differences between computational/life scientists:

Biology for Computer Scientists

DNA sequences can be thought of as strings of bases on a four-letter alphabet, $\{A,C,G,T\}$.

Each base binds with its complement, A-T and C-G, so each sequence has a unique complementary sequence.

The human genome is approximately 3 billion base-pairs long, and contains all the information necessary to make all the proteins which you are made of.

Proteins are sequences of amino acids, and hence all proteins can be thought of as strings on a 20-letter alphabet.

A gene is a DNA sequence which acts as a template for building a specific protein.

Genes specify how to build proteins according to the triplet code, where each of the 43 = 64 possible sequences or codons of three consecutive nucleotide bases map to one of the 20 different amino acids or the stop symbol.



figure=figures/transcription.eps,width=4in


RNA is an intermediate step in the translation process, and maps 1-to-1 with DNA.

The human genome contains about 100,000 genes, meaning that your body is made up of about that many different components.

The recently ``completed'' human genome project seeks to sequence or read the entire set of DNA and protein strings.

But sequencing is just a first step towards understanding what the proteins do and how to manipulate them.

Only a small portion of the human genome consists of genes. The rest contains various binding/signaling sites and less well understood ``junk''.

Organisms

Living organisms differ greatly in complexity and organization.

Viruses are simplest organisms ( $\sim 10,000$ bp. long), which require a living host.

Prokaryotes are simplest free living organisms, e.g. bacteria ( $\sim 1,000,000$ bp. long).

Eukaryotes have cells which contain internal structures such as a nucleus, e.g. yeast.

Multi-celled organisms involve cell specialization, requiring differential gene expression and inter-cellular signaling.

Historically, many biologists focused their careers on one model organism: E. Coli, yeast, drosophila, arabadopsis, zebrafish, sea urchins, mouse.

The advent of genomics has focused more attention on the similarity between organisms.

Evolution

Evolutionary change happens because of changes in genomes due to mutations and recombination.

Mutations are rare events, sometimes single base changes, sometimes larger events.

Recombination is how your genome was constructed as a mixture of your two parents.

Through natural selection, favorable changes tend to accumulate in the genome.

Evolution motivates homology (similarity) search, because different species are assumed to have common ancestors.

Thus DNA/amino acid sequences for a given protein (e.g. hemoglobin) in two species or individuals should be more similar the closer the ancestry between them.

The genetic variation between different people is surprisingly small, perhaps only 1 in 1000 base-pairs.

Homology searches can often detect similarities between extremely distant organisms (e.g. humans and yeast).

Phylogenic trees based on gene homologies have provided an independent confirmation of many phylogenies proposed by taxonomists. This is convincing evidence of the Theory of Evolution.

A host of interesting computational problems arise in trying to reconstruct evolutionary history.

Biotechnologies

Amazing biotechnologies for manipulating DNA molecules have been developed, and are used as building blocks for even more powerful technologies.

These technologies are as amazing to me as the silicon etching/masking of VLSI fabrication.

DNA synthesis machines enable one to grow short DNA molecules of a specified sequence.

The Polymerase chain reaction (PCR) enables one to make large number of copies of a particular DNA sequence anywhere in solution given only the starting and ending sequences (primers).

PCR is one foundation of DNA fingerprinting, by turning a single molecule into billions.

Electrophoresis enables one to approximately measure the length of a DNA molecule, by measuring the time it takes to walk up an electrically charged Gel.

Since certain regions of the human genome have varying numbers of repeated characters, measuring their length by electrophoresis yields one method of DNA fingerprinting / identification.

DNA sequencing machines are built from both these technologies, and will be discussed when we talk about assembly.

Computer Science for Biologists

We will be interested in the correctness and efficiency of computer algorithms.

We seek algorithms which provably always return the best possible solution to a well-defined combinatorial problem.

Heuristics are procedures which might return good answers in practice, but are not provably correct.

We seek to extract clean, well-defined problems from the typically messy ``real'' problem to gain insight into it.

This process is analogous to in vitro versus in vivo experimentation.

Exact String Matching

Input: A text string T, where |T|=n, and a pattern string P, where |P|=m.

Output: An index i such that Ti+j-1 = Pj for all $1 \leq j \leq m$, i.e. showing that P is a substring of T.

The following brute force search algorithm always uses at most $n \times m$ steps:


for i = 1 to n-m+1 do
		 j=1 
		 while (

T[i+j-1] == P[j]) and ($j \leq m)$)
				  do j=j+1 
		 if (j>m) print "pattern at position ", i 

This algorithm might use only n steps if we are lucky, e.g. T=aaaaaaaaaaa, and P=bbbbbbb.

We might need $\sim n \times m$ steps if we are unlucky, e.g. T=aaaaaaaaaaa, and P=aaaaaab.

We can't say what happens ``in practice'', so we settle for a worst case analysis.

By being more clever, we can reduce the worst case running time to O(n+m).

Certain generalizations won't change this, like stopping after the first occurrence of the pattern.

Certain other generalizations seem more complicated, like matching with gaps.

Algorithm Complexity

We use the Big oh notation to state an upper bound on the number of steps that an algorithm takes in the worst case.

Thus the brute force string matching algorithm is O(mn), or takes quadratic time.

A linear time algorithm, i.e. O(n+m), is fast enough for almost any application.

A quadratic time algorithm is usually fast enough for small problems, but not big ones, since 10002 = 1,000,000 steps is reasonable but 1,000,0002 is not.

An exponential-time algorithm, i.e. O(2n) or O(n!), can only be fast enough for tiny problems, since 220 and 10! are already up to 1,000,000.

``A billion here, a billion there, and soon you are talking about real money'' - Senator Everett Dirksen

NP-Completeness

Unfortunately, for many problems, there is no known polynomial algorithm.

Even worse, most of these problems can be proven NP-complete, meaning that no such algorithm can exist!

At the 1999 RECOMB conference, I witnessed a rebellion by biologists tired of seeing all their problems shown NP-complete.

But proving a problem NP-complete can be a useful thing to do, because it focuses our attention on heuristics and tells us why it is difficult.

NP-completeness proofs work by showing that the target problem is as ``hard'' as some famous hard problem, e.g. satisfiability, vertex cover, Hamiltonian cycle.

Shortest Common Superstring

Input: A set $S=\{s_1,\ldots,s_m\}$ of text strings on some alphabet $\Sigma$.

Output: The shortest possible string T such that each si is a substring of T.

This problem arises in DNA sequence assembly.

What is the shortest common superstring of $\{abba, baba, bbaa\}$?

Can you suggest an algorithm to find the shortest common superstring?

The Greedy Heuristic

The most obvious strategy is one where we merge the two strings with the longest overlap, put the combined string back, and repeat until only one string remains.

This greedy strategy can yield a string which is almost twice as long as necessary:

          ababababc              babababab
       babababab        ababababc
     aabababab         aabababab
Optimal Greedy

The greedy heuristic for longest common superstring of n strings of length l can be easily solved in n rounds of n2 string comparisons, each of which takes l2 steps, for a total of O(n3 l3).

But faster implementations exist using the ``right'' data structure, and avoiding string redundant comparisons.

Directed Hamiltonian Path is NP-Complete

The Hamiltonian cycle problem asks whether there is a tour using the edges of a given graph such that every vertex is visited exactly once.

When computer scientists talk about graphs, they mean networks of nodes or vertices where certain pairs are connected by edges.



figure=figures/hamiltonian-cycle-L.eps,width=2in figure=figures/hamiltonian-cycle-R.eps,width=2in


The Hamiltonian path problem is well known to be NP-complete, even if (a) every edge is directed, (b) a particular node is designated as the start vertex, and (c) a particular node is designated as the stop vertex.

Shortest Common Superstring is NP-Complete

We prove this by constructing an instance of SCS from any directed Hamiltonian path problem such that any solution to the SCS gives the Hamiltonian path.

Since Hamiltonian path cannot be solved in polynomial time, this means that SCS also can't - because if it could then HP could!

For all edges (v,xi) out of vertex v, we will construct two strings, $\bar{v} x_i \bar{v}$ and $x_i \bar{v} x_{i+1}$.

Thus if there are three edges from v, ie. (v,4), (v,7), (v,8), we will construct the following strings:


\begin{displaymath}\bar{v}4\bar{v}, 4 \bar{v} 7, \bar{v}7\bar{v}, 7 \bar{v} 8,
\bar{v}8\bar{v}, 8 \bar{v} 4 \end{displaymath}

Note that these have a superstring of length 8 starting with $\bar{v}$ and ending with any other vertex, by breaking the cycle at the right point.

We will also construct n ``connector'' strings $v\char93 \bar{v}$ to join each vertex with its complement.

Finally, we have a start string to connect to first vertex in the path, $@\char93 \bar{v_1}$, and an end string to connect to the last vertex in the path, $v_n\char93 \$$.

These strings have a superstring of length 2m+3n iff the graph has a Hamiltonian path.

The weird characters (#, $, @) ensure there can be no shorter way to put the strings together than the intended way.



 
next up previous
Next: About this document ... Up: My Home Page
Steve Skiena
2000-09-07