next up previous
Next: About this document ... Up: My Home Page

Computer Science 549 - Computational Biology
Prof. Steven Skiena
Fall 2005

Homework 1

Due Thursday, September 22, 2005

Each of the problems should be solved on a separate sheet of paper to facilitate grading. Limit the solution of each problem to one sheet of paper unless otherwise stated.

Many of these problems are deliberately open-ended and vague; do the best you can on them but don't make yourself crazy. Please don't wait until the last minute to look at the problems.

You must do this assignment in groups of 2 to 3 people. Mixed groups of computational and life scientists are best, if possible.

  1. Spend some time playing with the Genbank and SwissProt databases.

    Write a one page paper about your experiences. Below are representative topics for investigation:

    1. Hunt for the sequence of the gene which produces Myoglobin. Why are there so many results - how do they differ?

    2. Can you find the complete sequence of some human chromosomes? Where are they?

    3. Find a relatively obscure molecule of interest. What can you learn about it from these database entries?

    4. How many organisms have had their genomes completely sequenced? How many of these are mammals, plants, fungi, bacteria, and viruses, respectively.

  2. Spend some time playing with the PubMed, the database of all biological/medical literature available at www.ncbi.nlm.nih.gov/PubMed/ .

    Write a one page paper about your experiences. Below are representative topics for investigation:

    1. Find a disease you have never heard of which caused somebody's death. (Check out http://www.deadpeople.info or http://www.obituaries.com for candidates). Look up this disease on PubMed to see what that most up-to-date treatments are. What is known about the cause of this disease?

    2. Find some obscure organism (such as the Duck-billed Platypus) and see what is know about it. You might need to find Latin names for such a beast and search on that.

    3. How many of Skiena's papers are listed on Pubmed? Are any of these any good? How many of these do not appear in standard computer science bibliographic sources such as http://liinwww.ira.uka.de/bibliography/, the ACM Digital Library, or the Web of Science database. The later two should be available through the Stony Brook Library webpage.

  3. Experiment with the assembly program CAP-2, written by Xiaoqiu Huang. This is a good assembler for small sequencing projects whose C language source code is available from the course web page.

    Conduct systematic experiments to study the accuracy of reconstruction on both random and real sequences under different conditions, and write a three page paper describing the results.

    You should start by writing a program which takes an input sequence and generates simulated fragments with the following parameters: $m$ random fragments, each of length $l$, each of which has $e$% random sequencing errors (i.e. base changes).

    The following questions are representative of those you can address in your study. Note that it might be more useful to conduct repeated experiments on scaled-down problem sizes than fewer experiments on realistic instances if computation time becomes a problem.

    1. How large a project does it take, based on your computer configuration, before reconstruction times start to get to be a problem? How about the speed-up using a computer with higher configuration?

    2. What happens when your sequence has repeats in it? How long a repeat is sufficient to cause trouble?

    3. How does the number of gaps vary (if any) as a function of coverage, sequence length and sequencing error rate?

    4. How is the accuracy of reconstruction affected (if any) by coverage, sequence length and sequencing error rate?

    5. How does the reconstruction accuracy differ over random, viral, and mammalian sequences?

  4. Let $T$ be a text string of length $n$ and $S$ be a multiset of $m$ characters. Give an efficient $O(n+m)$ algorithm to find the positions of all substrings of $T$ which are exactly formed by the characters of $S$. For example, for $S=(a,a,b,c)$ and $T=eabahgcabah$ the string caba starting in position 7 is the only solution. You may assume a constant-sized alphabet if necessary.

  5. Assume that DNA is a random string on 4-letters. What is the shortest length $l$ such that any length $l$ sequence on $(A,C,G,T)$ is very likely not to occur in a random sequence of length $n$? What is the longest length $l'$ such that any length $l'$ sequence on $(A,C,G,T)$ is very likely to occur in a random sequence of length $n$?

    Very precise analysis is difficult - I am looking for a gross analysis (i.e. upper bounds and lower bounds). Do this analysis for viral, bacterial, and human-sized genomes.




next up previous
Next: About this document ... Up: My Home Page
Steve Skiena
2005-09-07