Started in unix world, was quickly ported to Windows world, MacOS, VAX/VMS, Plan9 etc.
dynamic typing - strings, integers, floating point are converted as necessary.
no boolean type. 0 and "" are false, everything else is true.$a = "cacgtgan"; $b = 1.5;
Lots of good functions for lists: push, pop, shift, join, ...@lets = ( "a", "c", "g", "t"); @nums = ( 1, 3..5, 7 ); $nums[1] = 5; print @nums[2..4];
%comp = ( 'A', 'T', 'C', 'G', 'G', 'C', 'T', 'A' );
%trans = ( 'ATG' => 'M', 'TTA' => 'L', 'TTG' => L, 'CTT' => 'L', ... );
$trans{ 'ATG' } = 'M';
You can convert between lists & hashes.
Both can also form argument lists passed to subroutines:
routine( 6, 2, @nums ); # same as routine( 6, 2, 1, 3, 4, 5, 7 )
Can make programs very terse.
Basic:
open( INFILE, "seq1.fasta" ) || die "Can't open seq1.fasta\n";
while( $line = <INFILE> ) # reads one line, returns a character string
{
... # process $line
}
close( INFILE );
More common:
while ( <> ) # read lines from each file (sequentially) on command { # line, putting them into pattern buffer $_ ... }unix command:% prog seq1.fasta seq2.fasta ...
/ATG{6,12}/; # match a subsequece of ATG repeated between
# 6 and 12 times in pattern buffer $_
$str =~ s/(TAG|TAA)/*/gi; # replace all occurances of TAG with * in $str
You can capture desired substrings in special variables $1,
$2, ...After this code executes, $acnum contains AR080483 and $len contains 10754$_ = "LOCUS AR080483 10754 bp DNA PAT 31-AUG-2000"; /^LOCUS\s+(\S+)\s+(\d+)\s+bp/ || die "bad input $_\n"; ($acnum,$len) = ($1,$2);
#!/usr/bin/perl
########
# Main #
########
load_translation(); # read %trans from file
while ( <> ) # read lines into $_
{
if ( /^>(.*)/ ) # is this a fasta header line?
{
$new_header = $1; # if yes, then if we've got somthing
process() if $header; # sitting in our buffer, translate
$header = $new_header; # and output, then reset our buffer
$dna_seq = ""; # globals
}
else # not a header, must be sequence!
{
s/\s//g; # squeeze out any blanks
$dna_seq .= uc( $_ ); # append an uppercase copy to $dna_seq
}
}
process() if $header; # don't forget the last one in the
# buffer before we go.
###############
# Subroutines #
###############
sub process
{
my $i = 0; # local counter
print ">TRANSLATION: $header\n"; # write out a fasta header
while ( $dna_seq =~ s/^(...)// ) # grab next three nucleotides,
{ # putting them into $1
print $trans{$1}; # print the translated amino acid.
print "\n" unless ( ++$i % 50 ); # and a newline if
}
print( "\n" );
}
sub load_translation
{
open( TRANS, "trans.dat" ) || die "Can't open trans.dat\n";
while ( $_ = )
{
/^\s*([ACGT]{3})\s+([A-Z\*])/ || die "Bad trans.dat line $_\n";
$trans{$1} = $2;
}
close( TRANS );
}
The file trans.dat has 64 lines which look like
...
GTT V
TAA *
TAC Y
TAG *
TAT Y
TCA S
TCC S
...
and when you run it on this FASTA format file
the output is>test1 ACGTTACGATCCCACTTA >test2 ACGTTACGATCCCACTTAACGTTACGATCCCACTTAACGTTACGATCCCACTTA
>TRANSLATION: test1 TLRSHL >TRANSLATION: test2 TLRSHLTLRSHLTLRSHL
The CPAN
module archive is gargantuan. It contains:
Personal recommendation: Perl novices shouldn't bother trying to understand the example scripts - they use a lot of advanced perl.
DNA, Protein alphabets;
Basic operations - reverse complement, length, etc.
Sequences can have attached features, annotations
Input and output operations is provided for all these formats:
| Fasta | FASTA format |
| EMBL | EMBL format |
| GenBank | GenBank format |
| swiss | Swissprot format |
| SCF | SCF tracefile format |
| PIR | Protein Information Resource format |
| GCG | GCG format |
| raw | Raw format (one sequence per line, no ID) |
#!/usr/bin/perl
use Bio::Seq;
use Bio::SeqIO;
$input_seqs = Bio::SeqIO->new ( '-format' => 'Fasta' ,
'-file' => 'test.fasta'
);
while ( $s = $input_seqs->next_seq() )
{
print( "sequence is ", $s->seq(), "\n" );
print( "sequence length is ", $s->length(), "\n" );
$r = $s->revcom();
print( "rev. complement is ", $r->seq(), "\n" );
}
#!/usr/bin/perl
use Bio::Seq;
use Bio::SeqIO;
$input_seqs = Bio::SeqIO->new ( '-format' => 'GenBank',
'-file' => 't7.gb'
);
while ( $s = $input_seqs->next_seq() )
{
print( "sequence length is ", $s->length(), "\n" );
print( "ac number is ", $s->accession(), "\n" );
foreach $f ( $s->all_SeqFeatures() )
{
print "Feature from ", $f->start, " to ", $f->end,
" Primary tag ", $f->primary_tag,
" From", $f->source_tag(), "\n";
print "Feature description is ", $f->gff_string(), "\n";
}
}
Unfortunately, Blast only provides a text (or html) output, and while quite informative to the biologist, its very complicated to write software to extract overview data from it.
The Bio::Tools::Blast package provides a parser which does this for you, as well as the ability to submit blast queries to a remote server, and retrieve and analyze the results.
Blast object - created for a parsed report or a result from a remote or local query. Created different ways, but all the subsequent methods are the same. A blast object contains a list of
Sbjct or hit objects - a report typically has 1 or more matched subject sequence or hit A sbjct object contains one or more
HSP objects or High-Scoring Segment Pair objects - there's one alignment shown for each of these in each hit.
Each object comes equiped with many useful methods to access all the data.
@$blast->num_hits # returns number of hits or sbjcts
@hits = $blast->hit_list # returns list of hits
Far too many to describe here - see the documentation
at bio.perl.org
#!/usr/bin/perl
use Bio::Tools::Blast;
$bl = Bio::Tools::Blast->new( '-file' => 'query.blast',
'-parse' => 1,
'-signif' => '1e-10',
);
print $bl->num_hits, " hits\n";
foreach $hit ( $bl->hits() )
{
print $hit->n(), " ", $hit->score(), " ", $hit->expect(),
" ", $hit->start(), "...", $hit->end(), " ",
$hit->desc(),
"\n";
}
This produces output like:
5 hits 1 504 0.0 137651...50438154 Bacteriophage T7, complete genome 1 504 0.0 137651...50438154 Genome of bacteriophage T7 1 164 1e-86 237238...39337629 Bacteriophage phi-YeO3-12, complete genome 1 164 1e-86 237238...39337629 Bacteriophage phi-YeO3-12 complete genome 1 158 4e-83 83140...3933525 Bacteriophage T3 strain amNG220B right end, tai ...
%runParam = (
-method => 'remote',
-prog => 'blastp',
-database => 'swissprot',
-seqs => [ $seq ], # Bio::Seq.pm objects.
);
$blastObj = Bio::Tools::Blast->new( -run => \%runParam,
-parse => 1,
-signif => '1e-10',
-strict => 1,
);
A more fleshed out set of inputs may look like this:
%runParam = (
-method => 'remote',
-prog => 'blastp',
-version => 2, # BLAST2
-database =>'swissprot',
-html => 0,
-seqs => [ $seqObject ], # Bio::Seq.pm object(s)
-descr => 250,
-align => 250,
-expect => 10,
-gap => 'on',
-matrix => 'PAM250',
-email => undef, # don't send report via e-mail if parsing.
-filter => undef, # use default
-gap_c => undef, # use default
-gap_e => undef, # use default
-word => undef, # use default
-min_len => undef, # use default
);
parse_blast.pl
print_blasts.pl
run_blast_remote.pl
retrieve_blast.pl