Next: About this document ...
Up: My Home Page
Computer Science 549 - Computational Biology
Prof. Steven Skiena
Fall 2005
Homework 1
Due Thursday, September 22, 2005
Each of the problems should be solved on a
separate sheet of paper to facilitate grading. Limit the solution
of each problem to one sheet of paper unless otherwise stated.
Many of these problems are deliberately
open-ended and vague; do the best you can on them but don't make
yourself crazy. Please don't wait until the last minute to look at
the problems.
You must do this assignment in groups of 2 to 3 people.
Mixed groups of computational and life scientists are best, if possible.
- Spend some time playing with the Genbank and SwissProt databases.
Write a one page paper about your experiences.
Below are representative topics for investigation:
- Hunt for the sequence of the gene which produces Myoglobin.
Why are there so many results - how do they differ?
- Can you find the complete sequence of some human
chromosomes? Where are they?
- Find a relatively obscure molecule of interest. What can you
learn about it from these database entries?
- How many organisms have had their genomes completely sequenced?
How many of these are mammals, plants, fungi, bacteria, and viruses,
respectively.
- Spend some time playing with the PubMed, the database of all
biological/medical literature
available at www.ncbi.nlm.nih.gov/PubMed/ .
Write a one page paper about your experiences.
Below are representative topics for investigation:
- Find a disease you have never heard of which caused somebody's death.
(Check out http://www.deadpeople.info or
http://www.obituaries.com for candidates).
Look up this disease on PubMed
to see what that most up-to-date treatments are.
What is known about the cause of this disease?
- Find some obscure organism (such as the Duck-billed
Platypus) and see what is know about it. You might need to find
Latin names for such a beast and search on that.
- How many of Skiena's papers are listed on Pubmed?
Are any of these any good?
How many of these do not appear in standard computer science bibliographic
sources
such as
http://liinwww.ira.uka.de/bibliography/,
the ACM Digital Library,
or the Web of Science database.
The later two should be available through the
Stony Brook Library webpage.
- Experiment with the assembly program CAP-2, written by
Xiaoqiu Huang. This is a good assembler for small sequencing
projects whose C language source code is available from the course
web page.
Conduct systematic experiments to study the accuracy of
reconstruction on both random and real sequences under different
conditions, and write a three page paper describing the results.
You should start by writing a program which takes an input
sequence and generates simulated fragments with the following
parameters:
random fragments, each of length
, each of
which has
% random sequencing errors (i.e. base changes).
The following questions are representative of those you can
address in your study. Note that it might be more useful to
conduct repeated experiments on scaled-down problem sizes than
fewer experiments on realistic instances if computation time
becomes a problem.
- How large a project does it take, based on your computer
configuration, before reconstruction times start to get to be a
problem? How about the speed-up using a computer with higher
configuration?
- What happens when your sequence has repeats in it? How long
a repeat is sufficient to cause trouble?
- How does the number of gaps vary (if any) as a function of
coverage, sequence length and sequencing error rate?
- How is the accuracy of reconstruction affected (if any) by
coverage, sequence length and sequencing error rate?
- How does the reconstruction accuracy differ over random,
viral, and mammalian sequences?
- Let
be a text string of length
and
be a multiset
of
characters. Give an efficient
algorithm to find
the positions of all substrings of
which are exactly formed by
the characters of
. For example, for
and
the string caba starting in position 7 is
the only solution. You may assume a constant-sized alphabet if
necessary.
- Assume that DNA is a random string on 4-letters. What is the
shortest length
such that any length
sequence on
is very likely not to occur in a random sequence
of length
? What is the longest length
such that any
length
sequence on
is very likely to occur in a
random sequence of length
?
Very precise analysis is difficult
- I am looking for a gross analysis (i.e. upper bounds and lower bounds).
Do this analysis for viral,
bacterial, and human-sized genomes.
Next: About this document ...
Up: My Home Page
Steve Skiena
2005-09-07