Grail – group of tools designed to analyze raw sequence data.

     Download it from  ftp://ftp.lsd.ornl.gov/xgrail

     It can locate:

     promoter regions-red boxes, TATA/ATA elements.

     exons-bars above the sequence.

     poly-a sites-cyan bars-AATAAA within 5kb of stop codon.

          Highly suggestive of the end of a gene.

repetitive elements-yellow and orange boxes-satellite DNA  are

     sequences (i.e. ACAAACT) that are repeated millions of times

     around the centromeres and telomeres. (center and end of

     chromosomes) there are also smaller repeats of up to a couple

     thousand bp scattered throughout the genome. Presence of

     repetitive DNA is highly suggestive of an intron.

         

CpG islands-purple boxes-About 56% of human genes are

     associated with CpG islands. Often CpG islands overlap the

     promoter and extend about 1000 base pairs downstream into the

     transcription unit. Identification of potential CpG islands during

     sequence analysis helps to define the extreme 5' ends of genes.

     CpG islands are commonly defined as regions of DNA of at

     least 200 bp in length and that have a G+C content above 50.

     The Cs in most CpG dinucleotides are methylated, and

     methylated Cs tend to mutate to T.

 

When tested against a set of sequence data with known exons, GRAIL recognized 91% of the exons in the set, with a false positive rate of 8.6%. Now its time to play with XGRAIL on a 135kb sequence of human chromosome 22. (Z83838.2)

There is a gene, Rho GTPase activating protein 8, which is located over a 48,048bp span, starting at 123.

Get your sequence data in FASTA and save it.

Open up the file in a text editor and replace the first line (> [info]) with >[filename].

Run xgrail and open up your sequence. Select the correct organism in the open dialogue so that your codon biases are correct.

In the features menu you can choose which features you’d like to display.  Be prepared to sit and wait for repetitive DNA.

After running all the analyses you can save them. The annotated sequence is ~4 times the size of the FASTA.

Once probable exons have been located, the protien must be assembled.

GRAIL uses BLAST to search a database of known proteins and returns high-scoring alignments.

There is also a web interface to GRAIL at

          http://grail.lsd.ornl.gov/grailexp/


Procrustes – A homology based gene prediction tool

http://www-hto.usc.edu/software/procrustes/

Procrustes was a legendary Greek robber baron who would lay his victims down on an iron bed and either stretch them until they fit if they were too short, or cut off their legs if they were too long.

Based on the theory that genes are well conserved across species

Best for complex exon assemblies and short exon prediction.

Procrustes runs through each possible exon assembly and compares it to the database of known protiens, saving the best matches.

This is an important difference from GRAIL, which chooses its exons based on known codon preferences and other factors, assembles the protein, and then uses genquest to compare it’s guess against known proteins.

 

 

Images: