The project will involve some concentrated work in one research project
related to
computational biology.
Below I list several possible projects, but don't feel restricted
by these subjects.
If you can convince me that what you want to do is interesting, you can do it.
However I strongly discourage projects derived from your own thesis research.
My hope is that most projects will involve interaction between
life science and computational students.
To help form such interdisciplinary teams and to spread the projects
out among groups, I propose that every student submit to me ASAP an
ordered list of their three favorite projects, with a brief description
of their background.
I will expect 1-page written progress reports from each group
on October 15 and November 15.
Each group will have to turn in a written report/WWW site and make a personal
presentation to me during finals week.
It is important to get started early as I expect several projects to
quickly discover insurmountable roadblocks in data acquisition or problem
definition.
My hope is that one or more of these projects will lead to published work.
This has been the case almost every time I have taught such a course.
The projects I believe are best have been starred (*).
- 1.
- Sequence Design for Molecular Motors (*) -
Yurke, et.al. (Nature and New York Times, August 10, 2000)
describe a molecular motor design based on DNA hybridization.
In particular, a single stranded sequence (E) pulls a strand
(D) from a complex of four strands, causing the remainder to
recoil. It is clear that selecting the best possible sequences
should improve the performance of such a motor, and such an assignment
involves balancing the degree of similarity in subsequences.
Can you isolate the `right' combinatorial problem and solve it?
Need: at least one real biologist and at least one algorithmically
strong CS-type
- 2.
- Optimization Problems in Experimental Biotechnologies (*) -
The following recently developed laboratory technologies seem to be based
at least in part on the basis of combinatorial techniques.
Working together, try to understand the proposed technology,
isolate a combinatorial/algorithmic problem, and solve it.
Once you understand the technology well enough to explain it
to me, I'll be happy to try to help you isolate a problem.
-
Phage display
-
Lynx Corp. MPSS Technology
-
High speed cell sorting systems
Need: at least one real biologist and at least one algorithmically
strong CS-type.
- 3.
- Increasing Phage Resistance by Eliminating Restriction Sites (*) -
I have developed an algorithm to find the coding sequence for a protein
which minimizes the number of restriction sites in any target set of
enzymes. Does replacing the original gene with an optimized coding
sequence increase the robustness of the phage?
Can you propose and perform laboratory experiments to answer this
question?
Need: at least one real biologist.
- 4.
- The Impact of RNA secondary structure on gene expression (*) -
How much (if at all) does the DNA second structure of a gene impact its
expression? We have developed an algorithm to find the coding sequence
for a protein which maximizes the amount of secondary structure.
Is this good, bad or irrelevant? Can you show it in a lab?
Need: at least one real biologist.
- 5.
- Optimally Linearizing Tree Clustered Data (*) -
I have an idea for a dynamic programming algorithm to optimally linearize
tree clustered data, i.e. order the clusters in a column so successive
elements differ by as little as possible.
This will help visualize interesting data sets.
Implement the algorithm and see how well it works.
Need: one or two algorithmically oriented CS-types.
- 6.
- Phylogenic Trees of Baseball Players (*) -
A good phylogenic tree algorithm should be able to cluster baseball
players (say batters) according to similarity of career record.
Such a tree should have interesting properties if the distance function
is right, for example clustering all sluggers, all weak-hitting shortstops,
and all singles hitters appropriately.
Can you produce and display interesting trees for all major-league
hitters and pitchers with significant careers?
Alternately, can you produce phylogenic trees of other interesting
"data" sets, like U.S. presidents, countries by size or economies,etc.?
Need: one or two CS-types
- 7.
- Racehorses as a Model Organism? (*) -
Extensive geneological and performance data is available for thoroughbred
race horses.
Further, these pedigrees are tremendously inbred,
which makes them a very interesting model organism for genetic studies.
To what extend can you get and parse the geneological and performance
data, and make predictions of which breeding pairs will be most successful?
Need: at least one real biologist and at least one CS-type.
- 8.
- Trimming sequences -
Sean McCorkle has been working on a program to
perform a crude quality trim of DNA sequences, right off the sequencer.
He wants to use an old pre-phred-quality method, which is to extract
the longest substring that is less than 2% N's. He's got an algorithm
which is order M2 in time, M being the number of Ns in the sequence,
but M can sometimes be large, and he wants to process an astronomical number
of sequences, so he wonders if better than M2 can be achieved.
Contact Sean R. McCorkle, mccorkle@bnl.gov, Biology Department, Genome Group
BNL, (631) 344-4270.
Need: one CS-type
- 9.
- Help the Notes Become a Book? (*) -
I have been pleased with how the lecture notes have turned out so far,
and suddenly started pondering maybe trying to turn them into
a book if I can get enough help.
Writing a few chapters from the notes is a good way to really learn the
material deeply.
Extracting pictures and other resources from the WWW will improve the
notes and provide useful resources.
Need: one or more people who can write, preferably including a biologist.
- 10.
- Optimizing Tagging Systems (*) -
Gunter Blobel won the Nobel prize for the "signal
hypothesis", which states that many newly-made proteins bear
molecular "tags" that are not part of
the final molecule. These tags are special coded markers. Each
marker directs the cell to carry the protein molecule to a specific
place.
When the protein molecule
arrives at its destination the tag is removed.
See
http://linkage.rockefeller.edu/wli/news/blobel.html
Is there something combinatorial about this which can be optimized or studied?
Need: at least one real biologists and at least one CS-type
- 11.
- Software for Microarray Data Analysis (*) -
With the university's new microarray facility, there is fresh data to analyze.
Make contact with a current user of the facility, find out what kind of
analysis they would like to do with their data, and develop a program to
help out. I would be happy to discuss ideas with you once you have made
contact with a live user.
Need: at least one real biologist and at least one CS-type
- 12.
- Genetic Algorithms -
Genetic algorithms are a computer optimization technique nominally
inspired by evolution.
How well they perform is still somewhat of an open question as far
as I am concerned.
I am looking for some students to build a rigorous enough testbed
to try some ideas on heuristic optimization.
This is an interesting project, but with no real biology.
Need: at least one algorithmic CS-type.
- 13.
- Tiling Genes (*) -
Give an algorithm which explains how to take as input the genome of an
organism, and a protein, and as output produces a gene for the protein
made of as few consecutive subsequences from the genome as possible.
Further, can one design a protocol using PCR, restriction enzymes, etc.
to have these new genes cut and assemble themselves?
Need: at least one real biologist and at least one CS-type
- 14.
- Space Efficient Suffix Arrays -
Suffix arrays are data structures which are particularly useful
because of their low memory requirements.
They can be constructed in linear time using suffix trees,
but the space blows up in construction.
Can you give an
construct algorithm which uses
less space?
See the discussion in Manber and Myer's paper.
Need: at least one algorithmic CS-type.
- 15.
- Genetic Drift and Human Diseases (*) -
I once formulated a theory of the rare blood disease aplastic anemia
based on random genetic evolution or drift.
The theory doesn't seem right for aplastic anemia, but it is so pretty
and natural it must hold for some rare disease, somewhere.
Study the medical literature and prove I'm right or wrong.
Need: one or two biologists.
- 16.
- A Theory of Restriction Enzyme Cut Sequences -
An analysis of restriction enzyme cut sequences from REBASE
shows some interesting biases, such as a high proportion of palindromic
sequences.
Why is this?
What can we estimate or conclude about the spectrum of different enzymes
which should be in a given bacteria, based on all known sequences,
the amount of laboratory effort which has gone into finding them,
and the evolutionary distance between organisms.
Need: at least one real biologists and at least one CS-type
- 17.
- Unfolding Proteins on a Lattice using Crankshaft Moves -
What can you say about unfolding proteins on a lattice such that
each protein remains simple (non-intersecting) at each point?
Can you test whether there exist a sequence of crankshaft
moves (which flip the curve between two points) and unfolding moves
(which bend the curve at one joint) to move between curves C1 and C2
or is it hard?
Kleinberg's RECOMB 99 paper and various Arkin-Mitchell-Skiena unfolding
papers may be relevant.
Need: at least one algorithmically oriented CS-type
- 18.
- Characterizing Proteins by Residue Distributions -
Consider proteins that have been characterized only in terms of
their amino acid compositions, unordered. This data can
be obtained by sequencing them, and then discarding the
ordering of the residues, for example. What can be said about the
proteins?
This type of data was popular before routine sequencing was possible, and is still being generated in many instances.
A recent paper: Karlin, Blaisdell, and Bucher.
Protein Engineering. 5(8):729-738. 1992.
(from Jared Roach (roach@u.washington.edu))
Need: At least one biologist and at least one CS-type
- 19.
- Understanding/predicting post-transcriptional modifications -
Are there interesting algorithmic problems in identifying where
post-transcriptional modifications can occur?
Need: At least one biologist and at least one CS-type
- 20.
- Whole Genome Comparison Programs -
Can you build programs to compare two distinct genomes and find
interesting homologies, etc.?
- 21.
- DNA Computing -
Write a survey paper on the state of the art in DNA computing,
hopefully finding new directions.