· www2.ebi.ac.uk/fasta3/
· First widely used program for database similarity searching
· FASTA performs optimized searches for local alignment using a substitution matrix
· Program better for nucleotides than for proteins
· Compares a DNA sequence to a DNA database or a protein sequence to a protein database
·
Today FASTA is thought of more as a format than a
program
FASTA features
· Gap penalties
§ Gap open = penalty for the first residue in a gap (-12 default for proteins,-16 for DNA)
§ Gap extension = penalty for additional residues in a gap (-2 default for proteins, -4 for DNA)
· Scores and Alignments
§ sets the maximum number of reported scores/alignments in the output file
· ktup
§ parameter to specify word length
§ default for proteins = 2, for DNA = 6
§ increasing ktup creates less sensitive searches but increases the speed
· histogram
§ display expected frequency of chance occurrence of the database matches found
· database
§ many choices for both amino acid and nucleotide searches
§ ex. SWISS-PROT, PDB (protein database of Brookhaven), EMBL
· Z-score
§ score normalized by sequence length
· expectation value E()
§ how many hits we expect to find by chance with such a score, while comparing this query to the database.
§ Note: This does not represent a measure of similarity between two sequences.
· basic algorithm details... all variations use some form of this basic algorithm
· 4 stages to algorithm:
1) find initial regions in search sequence
2) re-score to find top 10 initial regions (init1)
3) attempt to join initial regions together (initn)
4) optimize around initial region to find best fit (opt)
1) using a look-up table (generally implemented as a fast hash table) locate all identities between 2 DNA or amino acid sequences
· Ktup value is the number of identities to match
· narrow to the 10 best diagonal regions based on:
i. # of Ktup matches
ii. distance between matches...
these 10 matches are called the initial regions
example of hash table use
sequence Position
number 1 2 3 4 5 6 7 8 9 10
1 F L W R T W S T W T
2
S W K T W T F L R R
SEQ1 - location
F - 1
L - 2
W - 3, 6, 9
R - 4
T - 5, 8
S - 7
offset = seq1 loc - seq2 loc
· can use offset to create diagonals in the homology plot
2) re-score the 10 initial regions using a scoring matrix (generally the PAM250 is used for protein sequences)
· allow for conservative replacements
· allow for shorter then Ktup identities to contribute to score
the best scoring initial region is given as the init1 score
3) check to see if several of the initial regions can be joined together
· only non-overlapping regions can be joined
· the score is the sum of the scores for each individual region minus a joining penalty for each gap
· only regions with a score above a given join threshold may be used
this score is given as the initn (new) score and is used to rank the library sequences
4) the top ranking library sequences are then further compared by a variation of the Needleman-Wunsch algorithm.
· performs local optimization, so dissimilar portions of the sequence outside of the optimized alignment region do not affect the score
considers all possible alignments of the queue and library sequences that fall within a band around the highest scoring initial region
· obtains the optimal global alignment between the two sequences, allowing gaps
· gaps allowed in a 23 residue wide band
score is given as the final opt (optimized) score
· Final alignment is performed by a full Smith-Waterman algorithm, with unlimited gaps allowed
score is given as the Smith-Waterman score
· Significance of a database search depends on the relative magnitude of the initial score, the change in the score after optimization, and the alignment itself.
· The best way to obtain an estimate of statistical significance is to compare the queue sequence with a randomly permuted version of the potentially related sequence.
· Since similarity doesn’t follow a normal distribution, a P value (Probability) can not be calculated.
· the significance of the score is expressed as a z value, where
(observed score - mean of shuffled scores)
z = -----------------------------------------------------------
It was originally suggested (by Altschul, et.al.) that local sequence similarity scores follow the extreme value distribution, so that:
P(s > x) = 1 - exp ( -exp (-lambda (x-u))
where
u = ln (Kmn) / lambda
m,n are the lengths of the query and library sequences
re-written as:
1 - exp ( -Kmn exp ( - lambda x))
important things to note are:
· average score for an unrelated library sequence increases with the logarithm of the length of the library sequence
· FASTA uses simple linear regression against the log of the library sequence length to calculate a normalized score for the sequence pair
this score is reported as the Z-score
· the Z-scores can then be used with the extreme value distribution and the poisson distribution to calculate the number of library sequences to obtain a score greater than or equal to the score obtain in the search
simpler -
· an estimate of the number of sequences that would be expected to produce, purely by chance, a z-score greater then obtained in the search.
note -
· this score is database specific, because it is based on the distribution of the score within the database searched
This score is reported as the E() score
· statistics assume library contains large sample of unrelated sequences
· the expectations are meaningless if this is not the case
· fewer then 20 sequences, statistics are not done
For protein searches:
· library sequences with E() values < 0.01 from a search of 10,000 entries are almost always homologous
· E() values of 1-10 are frequently related as well
· FASTA searches can be done via the FASTA web server at the European BioInformatics Institute.
www2.ebi.ac.uk/fasta3
· source can be found for most platforms at:
ftp.virginia.edu/pub/fasta
· GCG software package
Search Examples
FASTA can be found and run at: www2.ebi.ac.uk/fasta3