Perl & Bioperl - Great Tools for Sequence Analysis

http://genome.bnl.gov/users/mccorkle/bioperl_talk.html

Sean McCorkle
BNL Biology Dept.

The Perl Programming Language

http://www.perl.com/ - main source
http://www.perl.org/ - advocacy

Salient features

In addition to the usual features you would expect from a language that grew up in a post-C world (if-then-else, while, for loops, subroutines, ++, etc), perl has:

Example script

Here's an example script, which translates DNA sequences to Amino Acid sequences

#!/usr/bin/perl

                               ########
                               # Main #
                               ########


load_translation();                      # read %trans from file

while ( <> )                             # read lines into $_
   {
    if ( /^>(.*)/ )                      # is this a fasta header line?
       {
        $new_header = $1;                # if yes, then if we've got somthing
        process() if $header;            # sitting in our buffer, translate
        $header = $new_header;           # and output, then reset our buffer
        $dna_seq = "";                   # globals
       }
    else                                 # not a header, must be sequence!
       { 
        s/\s//g;                         # squeeze out any blanks
        $dna_seq .= uc( $_ );            # append an uppercase copy to $dna_seq
       }
   }
process() if $header;                    # don't forget the last one in the

                                         # buffer before we go. 


                            ###############
                            # Subroutines #
                            ###############

sub process
   {
    my  $i = 0;                             # local counter

    print ">TRANSLATION: $header\n";        # write out a fasta header
    while ( $dna_seq =~ s/^(...)// )        # grab next three nucleotides,
      {                                     # putting them into $1
       print $trans{$1};                    # print the translated amino acid.
       print "\n" unless ( ++$i % 50 );     # and a newline if 
      }
    print( "\n" );
   }


sub load_translation
   {
    open( TRANS, "trans.dat" ) || die "Can't open trans.dat\n";
    while ( $_ =  )
       {
        /^\s*([ACGT]{3})\s+([A-Z\*])/ || die "Bad trans.dat line $_\n";
        $trans{$1} = $2;
       }
    close( TRANS );
   }

The file trans.dat has 64 lines which look like
    ...

    GTT  V
    TAA  *
    TAC  Y
    TAG  *
    TAT  Y
    TCA  S
    TCC  S

    ...
and when you run it on this FASTA format file
>test1
ACGTTACGATCCCACTTA
>test2
ACGTTACGATCCCACTTAACGTTACGATCCCACTTAACGTTACGATCCCACTTA
the output is
>TRANSLATION: test1
TLRSHL
>TRANSLATION: test2
TLRSHLTLRSHLTLRSHL

Perl Modules

Since perl 5 added modular "packages", making language extensions and libraries is easy & convenient.

The CPAN module archive is gargantuan. It contains:

... to name just a few. Like perl, these are free. Some are implemented completely in perl, some in C. However, theres typically no need to be concerned with how they are written - typically these are "ready to use"


Bioperl package

http://bio.perl.org/

Sequence manipulation

Blast interface

Blast generates a large, complex report for each query. While this is very useful to the biologist running a couple of searches, we quite often need to run large numbers (>500) of blast queries and generate overviews for the purpose of sequence annotation. (i.e. one may want to find all genes in a fresh ~3 Mbase DNA sequence to see which ones have matches in GenBank - this could easily be in excess of 1000 queries!)

Unfortunately, Blast only provides a text (or html) output, and while quite informative to the biologist, its very complicated to write software to extract overview data from it.

The Bio::Tools::Blast package provides a parser which does this for you, as well as the ability to submit blast queries to a remote server, and retrieve and analyze the results.


Other Parallel Efforts

biojava.org
biopython.org
bioxml.org
biocorba.org