Linear protein molecules rapidly fold into predefined 3D shapes or structures.
The properties of any protein is largely determined by its structure.
Proteins can be denatured by heat or chemical agents, but then fold back to their original shape.
Protein structures can be experimentally determined by crystallizing the protein and then using x-ray crystallography or NMR to find the position of the atoms, but this is a very difficult procedure.
The folded structure of a sequence is determined by the sequence of successive solid bend angles, where each solid angle can be represented by two planar angles.
Such a problem can be made discrete (at some loss of accuracy) by limiting the number of ways to bend each joint to, say, 7 solid angles.
Even so, a 100 residue protein then has a search space of 7100 configurations.
Determining the shape of proteins from sequence is one of today's great computational challenges.
The primary structure of a protein is simply its amino acid sequence.
The secondary structure of a protein is the labeling of each
residue with whether it is part of an (1)
-helix,
(2)
-sheet, or (3) a connecting loop.
Secondary structure prediction is important because the helices and sheets determine the protein core which is typically conserved.
Different amino acids have different probabilities of appearing in each
of these structures. But beware, since there is a sequence of 5 residues
which appears in both
-helix and
-sheet.
Although the notion of secondary structure seems somewhat
ill-defined, there are reasonably successful prediction programs
(say correctly labeling
of all bases) based on ideas like
hidden Markov models.
The 3D or tertiary structure of a protein describes the coordinates in space of each amino acid. This geometric information helps determine whether two proteins interact or dock with each other.
Protein folding programs seek to determine the tertiary structure of any protein from its sequence.
The computational difficulty of protein folding has led to proofs that the problem of finding the minimum energy configuration is NP-complete under a variety of models, e.g. maximizing the number of adjacent hydrophobic pairs in a 3D lattice model.
Leventhal's paradox is that proteins correctly fold into their pre-ordained shape less than a minute after being synthesized. How does nature solve this NP-complete problem?
Possible reasons around this problem are (1) that the theoretical models used to prove hardness are not what nature is trying to optimize, (2) evolution may have selected for proteins which fold easily, (3) proteins may well fold in locally, not globally optimal ways.
Prions, infectious agents which work by ``tricking'' proteins to fold in non-functional ways, are presumed responsible for mad-cow disease.
De novo (or ab initio) prediction programs work by defining a global energy function and does a search of possible bond-angle configurations to find one which minimizes total energy.
The process is similar watching a restless sleeper folds into the most comfortable (minimum energy) configuration.
The most important issues are (1) the energy function selected, and (2) the optimization procedure employed to search the space.
Reasonable energy minimization functions include hydrophobic/hydrophilic interactions, size and flexibility properties of different amino acids, and electrostatic / Van der Waals interactions of nearby atoms.
Standard optimization methods to employ are gradient descent, simulated annealing, genetic algorithms, and parallel computation.
IBM's Blue Gene project seeks to build a massively parallel computer for doing such de novo protein folding computations.
How can we judge how well a protein prediction program works?
One measure is to align the correct and predicted 3d structures and compute the average (RMS) deviation per residue.
Finding this alignment is not trivial, and misses the fact that the core structure is what is most important.
The CASP project/competition regularly invites structure predictions of proteins about to be experimentally determined, and determines the winner on a more ad hoc basis.
Since de novo structure prediction is hard, many programs use known 3D structures as a crutch to help folding new sequences.
This makes sense since all proteins likely descend from a small number of original structures.
Two amino acid sequences with
identical residues likely have
similar three dimensional structures.
Thus there may only be a small number of different folds/substructures common to all proteins, and we will likely see them all after determining a given number of structures.
In general, threading or inverse folding programs are more accurate than de novo prediction programs.
The input is (1) a protein sequence, (2) a core model describing the position of the core residues and allowable lengths of loops, and (3) a scoring function to evaluate the given threading.
Reasonable factors in the cost model include (1) the similarity of the base at each position to the original, (2) the length and similarities of the loops,and (3) pairwise interactions between bases at core positions.
Without modeling pairwise interactions, this becomes a simple dynamic programming-type problem.
However, incorporating pairwise interactions turns the problem NP-complete.
Why isn't threading just finding the best alignment with the structure model, solved with dynamic programming?
A pairwise interacting optimization function requires tabulating the possible substructures for every base assignment, not just the best matching prefix structures, so dynamic programming becomes less feasible.
Thus exhaustive search/heuristics are used in threading programs, but the options are much more constrained than for de novo folding algorithms.