Semester Project
Proposal due: Thursday, October 25, 2007
Progress report due: Tuesday, November 20, 2007
The project will involve concentrated work in one research project
related to
computational biology.
Below I list several possible projects.
You may also choose your own topic
if you can convince me that what you want to do is interesting.
However I discourage projects derived from your own thesis research.
While I anticipate that much of the work will be done as
the deadline approaches, it
is important to get started early enough to
discover insurmountable roadblocks in data acquisition or problem
definition before it is too late.
Each group is responsible to turn in 3-5 page project proposals
and progress reports
as of the dates above.
I will award 50% of the grade for each project on the
strength of the preliminary reports.
This is to encourage starting early, and to make sure that
you and I both know what you are to do before it is too late to avoid trouble.
My hope is that one or more of these projects will lead to published work.
This has been the case most times I have taught such a course.
The projects I believe are best have been starred (*).
Certain projects which I believe might be extendible to MS or
PhD projects have been so annotated - anyone is free to work on them,
but I particularly recommend them to students thinking about doing
research under me.
- Text Mining of PubMed Abstracts (*) -
The PubMed database contains the full text of 12.5 million
journal article abstracts from the biomedical literature.
By studying the juxaposition of gene names, organisms, diseases,
and drug names in abstracts we can extract large-scale structured
data about the relationships of these entities.
This project will involve building and improving text-mining tools
of our Lydia text analysis system (www.textmed.org)
to analyze PubMed abstracts in several ways:
- Reinvigorate www.textmed.org by restoring active processing and identifying
interesting trends.
- Extract the nature of relationships ("regulates", "inhibits", "treats")
identified between drugs, genes, and diseases.
- Cluster drugs, genes, and diseases by similarity based on the associations
they share in common.
- Build a medical news service (akin to Google News) -
what are the latest pubmed abstracts in each subfield?
- Propose new drug candidates for a given disease based on similarities of
references.
See Friedman, et.al (ISBM 2001) for details on a similar project.
(from Steven Skiena)
Need: two or more strong CS-types - MS or PhD
- Patent Analysis (*) -
Patents are very important in the biotechnology and pharmaceutical
industries.
Can we identify which patents are most valuable by doing NLP and
citation network analysis on the complete set of patents?
This project will involve building and improving text-mining tools
of our Lydia text analysis system to retreive and analyze
patents:
- Spider and parse patent documents.
- Analyze/cluster them by similarity, ownership, citations, etc. to
determine the relative significance of them.
- Estimate the value of the patent portfolio of each company.
There is a literature on such analysis to start from.
- Partitioning motifs for PSSMs (*) -
Constructing a position specific sequence matrix (PSSM)
from a collection of
-mers is trivial; just count
the frequency of each symbol in each position.
But what if the k-mers we are given come from two different
motifs. How should we partition them? What is the right
measure of success? Is it NP-complete?
A paper on the application is "Context-specific independence
mixture modeling for positional weight matrices" by B. Georgi
and A. Schliep, Bioinformatics, v. 22 (2006) e166-173
(from Martin Vingron)
- Angle Agreement Statistic -
Pyne proposed an angle agreement statistic for cell cycle genes
I did not like. We need a better one. I want it based on one
of the following
What is the probability that k random angles all fall in a sector of angle a?
What is the probability that k random angles have a standard deviation of d
(with Pyne and Bruce)
- Prediction of Health costs -
Prediction of health costs based on medical expenditures is an
important and open research problem [Willan and Briggs]. Insurance
companies try to estimate the probability of potential costs such as
reimbursements in the next month. For any disease, this can be
answered by fitting a suitable distribution to the histogram of
expenses incurred by patients with the disease (and computing its tail
probability).
The student is required to search and obtain real medical expenditure
data for a chosen set of diseases, as well as simulate data, transform
these as needed, run programs to fit models like Gamma mixture to the
histograms, and report the materials, the methods and the quality of
fit.
(from S. Pyne (spyne@broad.mit.edu))
- Generating DNA motifs in a Profile -
A profile matrix gives probability of each base appearing in each position,
so a length-
profile matrix scores the likelihood of each of
sequences.
Given a profile matrix and appropriate data structure, we seek to generate
these sequences in order of decreasing weight.
By taking logs of the probabilities, this reduces to scoring sequences
in terms of sums of weight elements.
A nice project from a previous class implemented such a generation algorithm.
An extension project would be to write a program (using suffix arrays)
to exploit this incremental algorithm for fast motif checking.
(from Paul Fodor (pfodor@cs) - check out http://www.sinc.sunysb.edu/Stu/pfodor/
)
Need: one CS-type - MS
- String Matching in a Bit Torrent? -
Suppose we are receiving a string of length
one character at a time,
but not from left to right. Each character arrives as a pair
,
which means an a will be the ith character in the string.
How fast can we support the following data structure operations?
- GetChar - incorporate
in the string
- Substring(s) - is
now a substring of the string
- PossibleSubstring(s) - might the string still be resolved so as to make
a substring?
- HaveFixedSubstring - Starting with a null string and a goal string
,
does the current string have
? Note that this can be potentially
solved in constant time per new chrancter given preprocessing of
the pattern, where substring must take
.
This seems a model of searching in a protocol where you don't have the
text at once, like Bit Torrent.
Need: one algorithmically strong CS-type
- Shortest Subset Subsequence -
As discussed in algorithm reading group, what is the string on an
alpabet of size
such that every one of the
subsets
is represented by a substring of minimum size, ie. equal to
the cardinality.
Need: one algorithmically strong CS-type
- Path Consistant Walk in a String -
Can you find a low-period string defining
a path between two vertices in a character-labeled path,
i.e. a string?
The problem is hard for trees and dags.
A related problem:
Given two strings
and
, can you
find a string
such that both
and
can be realized by a
back and forth walk on
?
(from Steven Skiena)
Need: one algorithmically strong CS-type
Evolutionary theory and Baseball Competitiveness -
Many professional sports demonstrate long runs of
mismatches between different leagues/conferences/divisions
(e.g. World Series, Super Bowl, interleague play,
and All-Star game competitions), more so than chance
should allow.
I argue this is because teams strive to win in their nitch
(league), and so many better teams are likely to exist in
the league which happens to have the best teams.
There must be something in evolutionary theory to explain this
(differnt populations diverage and both thrive, but one will
dominate if then put together)
(from Steven Skiena)
Need: one CS or bio type
- Probe Chip Design Problem (*) -
We want to design an oligonucleotide microarray based on the human
genome. This will involve:
- Extracting short sequences from the genome;
- Removing known repetitive sequences;
- Selecting subsets that have similar melting temperatures;
- Excluding sequences that have high secondary structure;
- Doing above for different sizes of oligonucleotides - 30, 40, 50 etc.
- Interfacing with a database I have which lists unique oligos.
(from Eli Hatchwell).
Need: at two or three CS-types.
- Microarray Analysis and Website for Platelets -
My friends in the medical school study
platelet gene and proteomic profiling. They
are seeking students interested in learning how to process microarray
data, and in development of databases and web-faced interfaces.
Can you help?
(from Wadie Bahou)
Need: one CS type - MS
Warning - I have not confirmed his current interest.
- Ion Channel Modeling (*) -
This is a systems biology project in which we are taking an overview of
one particular aspect of biology, the regulation of ion channel expression
in heart, and are attempting to describe this system quantitatively using
a mathematical model.
(from David McKinnon)
Need: one or two CS/AMS types
Warning - What he really is interested in now is probably different.
- Exploiting DNA Ancestry Data (*) -
Personal DNA ancestry test kits are now commericially
available (e.g. http://www.dnaancestryproject.com/) enabling the
tracking of family histories from such sequence data/databases.
Get such data and something interesting with it.
- Building HMMs - Build a general-purpose Hidden-Markov Model
implementation, supporting one or more learning algorithms, general topologies,
probability computations, etc.
Report the results of experiments with your program on some classification
problem, perhaps one we discussed in class or in natural language processing
(actor type recognition, sentence boundary identification).
Need: at least one CS-type - MS
- Using SVMs - Use support vector machines (SVMs)
to solve some classification
problem, perhaps one we discussed in class or in natural language processing
and report the results.
Need: at least one CS-type - MS
- Six Degrees of Seperation (*) - Given a network of related
biological terms,
come up with an efficient way of finding the shortest path between two
terms; and create a web based user interface for accepting shortest path
queries, and displaying shortest paths.
(from Andrew Mehler)
Need: at least one CS-type - MS
- Using Baysian Networks - Use Baysian networks to model data
from an entity, citation, or other relevant network.
Need: at least one CS-type - MS/PhD
- Microarray Spotting Verifier -
Human mistakes associated with shuttling and orienting sample plates in
spotting robots sometimes leads to incorrect labelings of microarray
locations, thus rendering meaningless the associated data.
Develop a program which validates the plausibility of an array image
from test probe data, and then develop test-probe placement strategies
to maximize the chance of identifying sample placement errors.
(from Bruce Futcher)
Need: one or two CS types - MS
Warning - What he is really interested in now is probably different.
- Phylogenies of Texts/Disciplines -
Use our text analysis system to generate distances between entities
or document sets, and then phylogenic tree algorithms to reconstruct
history of (say) geopolitical splits, language/cultural groups,
or partitioning of academic fields..
(from Steven Skiena)
Need: one or two CS types - MS
- Finding the Most Frequent Subsequence -
Given a string
, which string
occurs most often as a
scattered subsequence of
?
Note that the number of candidates and possible occurences are exponential.
Can you prove this is NP-complete, and maybe give an approximation algorithm?
Can you use dynamic programming to count the number of occurences of a
candidate
?
Need: One algorithmically inclined CS-type. - PhD
- Software Survey Papers -
Write a 10-15 page in-depth survey paper on the state of software
available for one of the following topics of
interest, or pick your own topic:
- Haplotyping
- Synthetic biology
- RNA interference.
- Genetic linkage analysis.
In all cases, discuss the different programs available for the problem,
and how
they compare by performance, speed, input requirements, and algorithms.
Your paper should be based on some actual experience with the programs.
All survey papers must appropriately cite all sources, and be written in
your own words.
Any student caught plagerizing from Web or published sources will receive
a Q grade and be subject to academic dishonesty proceedings.
Don't be stupid - I know how to use Google, too.
Need: each survey paper should be done individually by either a
CS-type or a biologist