next up previous
Next: About this document ... Up: My Home Page

Computer Science 549 - Computational Biology
Prof. Steven Skiena
Fall 2007

Semester Project

Proposal due: Thursday, October 25, 2007

Progress report due: Tuesday, November 20, 2007

The project will involve concentrated work in one research project related to computational biology. Below I list several possible projects. You may also choose your own topic if you can convince me that what you want to do is interesting. However I discourage projects derived from your own thesis research.

While I anticipate that much of the work will be done as the deadline approaches, it is important to get started early enough to discover insurmountable roadblocks in data acquisition or problem definition before it is too late.

Each group is responsible to turn in 3-5 page project proposals and progress reports as of the dates above. I will award 50% of the grade for each project on the strength of the preliminary reports. This is to encourage starting early, and to make sure that you and I both know what you are to do before it is too late to avoid trouble.

My hope is that one or more of these projects will lead to published work. This has been the case most times I have taught such a course. The projects I believe are best have been starred (*). Certain projects which I believe might be extendible to MS or PhD projects have been so annotated - anyone is free to work on them, but I particularly recommend them to students thinking about doing research under me.

  1. Text Mining of PubMed Abstracts (*) - The PubMed database contains the full text of 12.5 million journal article abstracts from the biomedical literature. By studying the juxaposition of gene names, organisms, diseases, and drug names in abstracts we can extract large-scale structured data about the relationships of these entities.

    This project will involve building and improving text-mining tools of our Lydia text analysis system (www.textmed.org) to analyze PubMed abstracts in several ways:

    See Friedman, et.al (ISBM 2001) for details on a similar project. (from Steven Skiena) Need: two or more strong CS-types - MS or PhD

  2. Patent Analysis (*) - Patents are very important in the biotechnology and pharmaceutical industries. Can we identify which patents are most valuable by doing NLP and citation network analysis on the complete set of patents?

    This project will involve building and improving text-mining tools of our Lydia text analysis system to retreive and analyze patents:

    There is a literature on such analysis to start from.

  3. Partitioning motifs for PSSMs (*) - Constructing a position specific sequence matrix (PSSM) from a collection of $n$ $k$-mers is trivial; just count the frequency of each symbol in each position.

    But what if the k-mers we are given come from two different motifs. How should we partition them? What is the right measure of success? Is it NP-complete?

    A paper on the application is "Context-specific independence mixture modeling for positional weight matrices" by B. Georgi and A. Schliep, Bioinformatics, v. 22 (2006) e166-173 (from Martin Vingron)

  4. Angle Agreement Statistic - Pyne proposed an angle agreement statistic for cell cycle genes I did not like. We need a better one. I want it based on one of the following

    What is the probability that k random angles all fall in a sector of angle a? What is the probability that k random angles have a standard deviation of d (with Pyne and Bruce)

  5. Prediction of Health costs - Prediction of health costs based on medical expenditures is an important and open research problem [Willan and Briggs]. Insurance companies try to estimate the probability of potential costs such as reimbursements in the next month. For any disease, this can be answered by fitting a suitable distribution to the histogram of expenses incurred by patients with the disease (and computing its tail probability).

    The student is required to search and obtain real medical expenditure data for a chosen set of diseases, as well as simulate data, transform these as needed, run programs to fit models like Gamma mixture to the histograms, and report the materials, the methods and the quality of fit. (from S. Pyne (spyne@broad.mit.edu))

  6. Generating DNA motifs in a Profile - A profile matrix gives probability of each base appearing in each position, so a length-$k$ profile matrix scores the likelihood of each of $4^k$ sequences. Given a profile matrix and appropriate data structure, we seek to generate these sequences in order of decreasing weight. By taking logs of the probabilities, this reduces to scoring sequences in terms of sums of weight elements.

    A nice project from a previous class implemented such a generation algorithm. An extension project would be to write a program (using suffix arrays) to exploit this incremental algorithm for fast motif checking. (from Paul Fodor (pfodor@cs) - check out http://www.sinc.sunysb.edu/Stu/pfodor/ ) Need: one CS-type - MS

  7. String Matching in a Bit Torrent? - Suppose we are receiving a string of length $n$ one character at a time, but not from left to right. Each character arrives as a pair $(a,i)$, which means an a will be the ith character in the string. How fast can we support the following data structure operations? This seems a model of searching in a protocol where you don't have the text at once, like Bit Torrent. Need: one algorithmically strong CS-type

  8. Shortest Subset Subsequence - As discussed in algorithm reading group, what is the string on an alpabet of size $n$ such that every one of the $2^n$ subsets is represented by a substring of minimum size, ie. equal to the cardinality. Need: one algorithmically strong CS-type

  9. Path Consistant Walk in a String - Can you find a low-period string defining a path between two vertices in a character-labeled path, i.e. a string? The problem is hard for trees and dags.

    A related problem: Given two strings $S_1$ and $S_2$, can you find a string $S$ such that both $S_1$ and $S_2$ can be realized by a back and forth walk on $S$? (from Steven Skiena) Need: one algorithmically strong CS-type

    Evolutionary theory and Baseball Competitiveness - Many professional sports demonstrate long runs of mismatches between different leagues/conferences/divisions (e.g. World Series, Super Bowl, interleague play, and All-Star game competitions), more so than chance should allow. I argue this is because teams strive to win in their nitch (league), and so many better teams are likely to exist in the league which happens to have the best teams.

    There must be something in evolutionary theory to explain this (differnt populations diverage and both thrive, but one will dominate if then put together) (from Steven Skiena) Need: one CS or bio type

  10. Probe Chip Design Problem (*) - We want to design an oligonucleotide microarray based on the human genome. This will involve: (from Eli Hatchwell). Need: at two or three CS-types.

  11. Microarray Analysis and Website for Platelets - My friends in the medical school study platelet gene and proteomic profiling. They are seeking students interested in learning how to process microarray data, and in development of databases and web-faced interfaces. Can you help? (from Wadie Bahou) Need: one CS type - MS

    Warning - I have not confirmed his current interest.

  12. Ion Channel Modeling (*) - This is a systems biology project in which we are taking an overview of one particular aspect of biology, the regulation of ion channel expression in heart, and are attempting to describe this system quantitatively using a mathematical model. (from David McKinnon) Need: one or two CS/AMS types

    Warning - What he really is interested in now is probably different.

  13. Exploiting DNA Ancestry Data (*) - Personal DNA ancestry test kits are now commericially available (e.g. http://www.dnaancestryproject.com/) enabling the tracking of family histories from such sequence data/databases. Get such data and something interesting with it.

  14. Building HMMs - Build a general-purpose Hidden-Markov Model implementation, supporting one or more learning algorithms, general topologies, probability computations, etc. Report the results of experiments with your program on some classification problem, perhaps one we discussed in class or in natural language processing (actor type recognition, sentence boundary identification). Need: at least one CS-type - MS

  15. Using SVMs - Use support vector machines (SVMs) to solve some classification problem, perhaps one we discussed in class or in natural language processing and report the results. Need: at least one CS-type - MS

  16. Six Degrees of Seperation (*) - Given a network of related biological terms, come up with an efficient way of finding the shortest path between two terms; and create a web based user interface for accepting shortest path queries, and displaying shortest paths. (from Andrew Mehler) Need: at least one CS-type - MS

  17. Using Baysian Networks - Use Baysian networks to model data from an entity, citation, or other relevant network. Need: at least one CS-type - MS/PhD

  18. Microarray Spotting Verifier - Human mistakes associated with shuttling and orienting sample plates in spotting robots sometimes leads to incorrect labelings of microarray locations, thus rendering meaningless the associated data. Develop a program which validates the plausibility of an array image from test probe data, and then develop test-probe placement strategies to maximize the chance of identifying sample placement errors. (from Bruce Futcher) Need: one or two CS types - MS

    Warning - What he is really interested in now is probably different.

  19. Phylogenies of Texts/Disciplines - Use our text analysis system to generate distances between entities or document sets, and then phylogenic tree algorithms to reconstruct history of (say) geopolitical splits, language/cultural groups, or partitioning of academic fields.. (from Steven Skiena) Need: one or two CS types - MS

  20. Finding the Most Frequent Subsequence - Given a string $S$, which string $S'$ occurs most often as a scattered subsequence of $S$? Note that the number of candidates and possible occurences are exponential. Can you prove this is NP-complete, and maybe give an approximation algorithm? Can you use dynamic programming to count the number of occurences of a candidate $S'$? Need: One algorithmically inclined CS-type. - PhD

  21. Software Survey Papers - Write a 10-15 page in-depth survey paper on the state of software available for one of the following topics of interest, or pick your own topic:

    In all cases, discuss the different programs available for the problem, and how they compare by performance, speed, input requirements, and algorithms. Your paper should be based on some actual experience with the programs.

    All survey papers must appropriately cite all sources, and be written in your own words. Any student caught plagerizing from Web or published sources will receive a Q grade and be subject to academic dishonesty proceedings. Don't be stupid - I know how to use Google, too. Need: each survey paper should be done individually by either a CS-type or a biologist




next up previous
Next: About this document ... Up: My Home Page
Steve Skiena
2007-09-27