next up previous
Next: About this document ... Up: My Home Page

CSE 648 Computational Biology Projects

Steven S. Skiena

The project will involve some concentrated work in one research project related to computational biology. Below I list several possible projects, but don't feel restricted by these subjects. If you can convince me that what you want to do is interesting, you can do it. However I strongly discourage projects derived from your own thesis research.

My hope is that most projects will involve interaction between life science and computational students. To help form such interdisciplinary teams and to spread the projects out among groups, I propose that every student submit to me ASAP an ordered list of their three favorite projects, with a brief description of their background.

I will expect 1-page written progress reports from each group on October 15 and November 15. Each group will have to turn in a written report/WWW site and make a personal presentation to me during finals week. It is important to get started early as I expect several projects to quickly discover insurmountable roadblocks in data acquisition or problem definition.

My hope is that one or more of these projects will lead to published work. This has been the case almost every time I have taught such a course. The projects I believe are best have been starred (*).

1.
Sequence Design for Molecular Motors (*) - Yurke, et.al. (Nature and New York Times, August 10, 2000) describe a molecular motor design based on DNA hybridization. In particular, a single stranded sequence (E) pulls a strand (D) from a complex of four strands, causing the remainder to recoil. It is clear that selecting the best possible sequences should improve the performance of such a motor, and such an assignment involves balancing the degree of similarity in subsequences. Can you isolate the `right' combinatorial problem and solve it?

Need: at least one real biologist and at least one algorithmically strong CS-type

2.
Optimization Problems in Experimental Biotechnologies (*) - The following recently developed laboratory technologies seem to be based at least in part on the basis of combinatorial techniques. Working together, try to understand the proposed technology, isolate a combinatorial/algorithmic problem, and solve it. Once you understand the technology well enough to explain it to me, I'll be happy to try to help you isolate a problem.

Need: at least one real biologist and at least one algorithmically strong CS-type.

3.
Increasing Phage Resistance by Eliminating Restriction Sites (*) - I have developed an algorithm to find the coding sequence for a protein which minimizes the number of restriction sites in any target set of enzymes. Does replacing the original gene with an optimized coding sequence increase the robustness of the phage? Can you propose and perform laboratory experiments to answer this question?

Need: at least one real biologist.

4.
The Impact of RNA secondary structure on gene expression (*) - How much (if at all) does the DNA second structure of a gene impact its expression? We have developed an algorithm to find the coding sequence for a protein which maximizes the amount of secondary structure. Is this good, bad or irrelevant? Can you show it in a lab?

Need: at least one real biologist.

5.
Optimally Linearizing Tree Clustered Data (*) - I have an idea for a dynamic programming algorithm to optimally linearize tree clustered data, i.e. order the clusters in a column so successive elements differ by as little as possible. This will help visualize interesting data sets. Implement the algorithm and see how well it works.

Need: one or two algorithmically oriented CS-types.

6.
Phylogenic Trees of Baseball Players (*) - A good phylogenic tree algorithm should be able to cluster baseball players (say batters) according to similarity of career record. Such a tree should have interesting properties if the distance function is right, for example clustering all sluggers, all weak-hitting shortstops, and all singles hitters appropriately. Can you produce and display interesting trees for all major-league hitters and pitchers with significant careers? Alternately, can you produce phylogenic trees of other interesting "data" sets, like U.S. presidents, countries by size or economies,etc.?

Need: one or two CS-types

7.
Racehorses as a Model Organism? (*) - Extensive geneological and performance data is available for thoroughbred race horses. Further, these pedigrees are tremendously inbred, which makes them a very interesting model organism for genetic studies. To what extend can you get and parse the geneological and performance data, and make predictions of which breeding pairs will be most successful?

Need: at least one real biologist and at least one CS-type.

8.
Trimming sequences - Sean McCorkle has been working on a program to perform a crude quality trim of DNA sequences, right off the sequencer. He wants to use an old pre-phred-quality method, which is to extract the longest substring that is less than 2% N's. He's got an algorithm which is order M2 in time, M being the number of Ns in the sequence, but M can sometimes be large, and he wants to process an astronomical number of sequences, so he wonders if better than M2 can be achieved. Contact Sean R. McCorkle, mccorkle@bnl.gov, Biology Department, Genome Group BNL, (631) 344-4270.

Need: one CS-type

9.
Help the Notes Become a Book? (*) - I have been pleased with how the lecture notes have turned out so far, and suddenly started pondering maybe trying to turn them into a book if I can get enough help. Writing a few chapters from the notes is a good way to really learn the material deeply. Extracting pictures and other resources from the WWW will improve the notes and provide useful resources.

Need: one or more people who can write, preferably including a biologist.

10.
Optimizing Tagging Systems (*) - Gunter Blobel won the Nobel prize for the "signal hypothesis", which states that many newly-made proteins bear molecular "tags" that are not part of the final molecule. These tags are special coded markers. Each marker directs the cell to carry the protein molecule to a specific place. When the protein molecule arrives at its destination the tag is removed. See http://linkage.rockefeller.edu/wli/news/blobel.html Is there something combinatorial about this which can be optimized or studied?

Need: at least one real biologists and at least one CS-type

11.
Software for Microarray Data Analysis (*) - With the university's new microarray facility, there is fresh data to analyze. Make contact with a current user of the facility, find out what kind of analysis they would like to do with their data, and develop a program to help out. I would be happy to discuss ideas with you once you have made contact with a live user.

Need: at least one real biologist and at least one CS-type

12.
Genetic Algorithms - Genetic algorithms are a computer optimization technique nominally inspired by evolution. How well they perform is still somewhat of an open question as far as I am concerned. I am looking for some students to build a rigorous enough testbed to try some ideas on heuristic optimization. This is an interesting project, but with no real biology.

Need: at least one algorithmic CS-type.

13.
Tiling Genes (*) - Give an algorithm which explains how to take as input the genome of an organism, and a protein, and as output produces a gene for the protein made of as few consecutive subsequences from the genome as possible. Further, can one design a protocol using PCR, restriction enzymes, etc. to have these new genes cut and assemble themselves?

Need: at least one real biologist and at least one CS-type

14.
Space Efficient Suffix Arrays - Suffix arrays are data structures which are particularly useful because of their low memory requirements. They can be constructed in linear time using suffix trees, but the space blows up in construction. Can you give an $o(n \lg n)$ construct algorithm which uses less space? See the discussion in Manber and Myer's paper.

Need: at least one algorithmic CS-type.

15.
Genetic Drift and Human Diseases (*) - I once formulated a theory of the rare blood disease aplastic anemia based on random genetic evolution or drift. The theory doesn't seem right for aplastic anemia, but it is so pretty and natural it must hold for some rare disease, somewhere. Study the medical literature and prove I'm right or wrong.

Need: one or two biologists.

16.
A Theory of Restriction Enzyme Cut Sequences - An analysis of restriction enzyme cut sequences from REBASE shows some interesting biases, such as a high proportion of palindromic sequences. Why is this? What can we estimate or conclude about the spectrum of different enzymes which should be in a given bacteria, based on all known sequences, the amount of laboratory effort which has gone into finding them, and the evolutionary distance between organisms.

Need: at least one real biologists and at least one CS-type

17.
Unfolding Proteins on a Lattice using Crankshaft Moves - What can you say about unfolding proteins on a lattice such that each protein remains simple (non-intersecting) at each point? Can you test whether there exist a sequence of crankshaft moves (which flip the curve between two points) and unfolding moves (which bend the curve at one joint) to move between curves C1 and C2 or is it hard? Kleinberg's RECOMB 99 paper and various Arkin-Mitchell-Skiena unfolding papers may be relevant.

Need: at least one algorithmically oriented CS-type

18.
Characterizing Proteins by Residue Distributions - Consider proteins that have been characterized only in terms of their amino acid compositions, unordered. This data can be obtained by sequencing them, and then discarding the ordering of the residues, for example. What can be said about the proteins?

This type of data was popular before routine sequencing was possible, and is still being generated in many instances. A recent paper: Karlin, Blaisdell, and Bucher. Protein Engineering. 5(8):729-738. 1992. (from Jared Roach (roach@u.washington.edu))

Need: At least one biologist and at least one CS-type

19.
Understanding/predicting post-transcriptional modifications - Are there interesting algorithmic problems in identifying where post-transcriptional modifications can occur?

Need: At least one biologist and at least one CS-type

20.
Whole Genome Comparison Programs - Can you build programs to compare two distinct genomes and find interesting homologies, etc.?

21.
DNA Computing - Write a survey paper on the state of the art in DNA computing, hopefully finding new directions.



 
next up previous
Next: About this document ... Up: My Home Page
Steve Skiena
2000-09-21