FASTA General Facts

·       www2.ebi.ac.uk/fasta3/

·       First widely used program for database similarity searching

·       FASTA performs optimized searches for local alignment using a substitution matrix

·       Program better for nucleotides than for proteins

·       Compares a DNA sequence to a DNA database or a protein sequence to a protein database

·       Today FASTA is thought of more as a format than a program


FASTA features

·       Gap penalties

§       Gap open = penalty for the first residue in a gap (-12 default for proteins,-16 for DNA)

§       Gap extension = penalty for additional residues in a gap (-2 default for proteins, -4 for DNA)

·       Scores and Alignments

§       sets the maximum number of reported scores/alignments in the output file

·       ktup

§       parameter to specify word length

§       default for proteins = 2, for DNA = 6

§       increasing ktup creates less sensitive searches but increases the speed

·       histogram

§       display expected frequency of chance occurrence of the database matches found

·       database

§       many choices for both amino acid and nucleotide searches

§       ex. SWISS-PROT, PDB (protein database of Brookhaven), EMBL

·       Z-score

§       score normalized by sequence length

·       expectation value E()

§       how many hits we expect to find by chance with such a score, while comparing this query to the database. 

§       Note: This does not represent a measure of similarity between two sequences.


FASTA - algorithm details

 

·       basic algorithm details... all variations use some form of this basic algorithm

 

 

 

·       4 stages to algorithm:

 

1)   find initial regions in search sequence

 

2)   re-score to find top 10 initial regions (init1)

 

3)   attempt to join initial regions together (initn)

 

4)   optimize around initial region to find best fit (opt)

 


FASTA - stage 1

 

 

1)   using a look-up table (generally implemented as a fast hash table) locate all identities between 2 DNA or amino acid sequences

 

·       Ktup value is the number of identities to match

 

·       narrow to the 10 best diagonal regions based on:

                                                             i.      # of Ktup matches

                                                          ii.      distance between matches...

 

these 10 matches are called the initial regions

 

 

example of hash table use

 

sequence       Position

number     1 2 3 4 5 6 7 8 9 10

1     F L W R T W S T W T

2                              S W K T W T F L R R

 

SEQ1 - location

F - 1 

L - 2

W - 3, 6, 9

R - 4

T - 5, 8

S - 7

 

    offset = seq1 loc - seq2 loc

 

·       can use offset to create diagonals in the homology plot


FASTA - stage 2

 

 

2)   re-score the 10 initial regions using a scoring matrix (generally the PAM250 is used for protein sequences)

 

·       allow for conservative replacements

 

·       allow for shorter then Ktup identities to contribute to score

 

 

the best scoring initial region is given as the init1 score


FASTA - stage 3

 

 

3)   check to see if several of the initial regions can be joined together

 

·       only non-overlapping regions can be joined

 

·       the score is the sum of the scores for each individual region minus a joining penalty for each gap

 

·       only regions with a score above a given join threshold may be used

 

this score is given as the initn (new) score and is used to rank the library sequences


FASTA - stage 4

 

4)   the top ranking library sequences are then further compared by a variation of the Needleman-Wunsch algorithm.

 

·       performs local optimization, so dissimilar portions of the sequence outside of the optimized alignment region do not affect the score

 

considers all possible alignments of the queue and library sequences that fall within a band around the highest scoring initial region

 

·       obtains the optimal global alignment between the two sequences, allowing gaps

 

·       gaps allowed in a 23 residue wide band

 

 

score is given as the final opt (optimized) score


FASTA - final alignment

 

 

·       Final alignment is performed by a full Smith-Waterman algorithm, with unlimited gaps allowed

 

         score is given as the Smith-Waterman score

 


Statistical analysis of results

 

 

·       Significance of a database search depends on the relative magnitude of the initial score, the change in the score after optimization, and the alignment itself.

 

·       The best way to obtain an estimate of statistical significance is to compare the queue sequence with a randomly permuted version of the potentially related sequence.

 

·       Since similarity doesn’t follow a normal distribution, a P value (Probability) can not be calculated.

 

·       the significance of the score is expressed as a z value, where

 

         (observed score - mean of shuffled scores)

z =     -----------------------------------------------------------

(standard deviation of shuffled scores)


Statistical analysis of results

Z-score

 

 

It was originally suggested (by Altschul, et.al.) that local sequence similarity scores follow the extreme value distribution, so that:

 

 

P(s > x) = 1 - exp ( -exp (-lambda (x-u))

 

         where

         u = ln (Kmn) / lambda

         m,n are the lengths of the query and library sequences

 

 

re-written as:

 

1 - exp ( -Kmn exp ( - lambda x))

 

 

 

important things to note are:

 

·       average score for an unrelated library sequence increases with the logarithm of the length of the library sequence

 

·       FASTA uses simple linear regression against the log of the library sequence length to calculate a normalized score for the sequence pair

 

this score is reported as the Z-score


Statistical analysis of results

E()-score  (1 of 2)

 

 

 

·       the Z-scores can then be used with the extreme value distribution and the poisson distribution to calculate the number of library sequences to obtain a score greater than or equal to the score obtain in the search

 

 

simpler -

·       an estimate of the number of sequences that would be expected to produce, purely by chance, a z-score greater then obtained in the search.

 

 

note -

·       this score is database specific, because it is based on the distribution of the score within the database searched

 

 

This score is reported as the E() score


Statistical analysis of results

E()-score   (2 of 2)

 

 

·       statistics assume library contains large sample of unrelated sequences

 

·       the expectations are meaningless if this is not the case

 

·       fewer then 20 sequences, statistics are not done

 

 

 

For protein searches:

 

·       library sequences with E() values < 0.01 from a search of 10,000 entries are almost always homologous

 

·       E() values of 1-10 are frequently related as well

 


Where to find FASTA programs?

 

 

·       FASTA searches can be done via the FASTA web server at the European BioInformatics Institute.

www2.ebi.ac.uk/fasta3

 

 

·       source can be found for most platforms at:

ftp.virginia.edu/pub/fasta

 

 

·       GCG software package

 

 

 

Search Examples

 

FASTA can be found and run at: www2.ebi.ac.uk/fasta3