BLAST General Facts

 

·       Basic Local Alignment Search Tool

·       www.ncbi.nlm.nih.gov/BLAST/

·       improved overall search speed and put database searching on firm statistical foundation.

·       uses heuristics to speed the pairwise comparison. 

·       creates sequence abstraction by listing exact and similar words

·       uses local alignment


BLAST  Element 1 of 4

 

1 Name of BLAST program

blastp

amino acid query sequence to protein sequence database

blastn

nucleotide query sequence to nucleotide sequence database

blastx

nucleotide query sequence translated in all reading frames against a protein sequence database

tblastn

protein query sequence against nucleotide sequence database translated in all reading frames

tblastx

most intensive computationally

compares 6 frame translations of nucleotides query sequence against 6 frame translation of nucleotide sequence database


BLAST Elements  2 & 3 of 4

 

 

2 Name of database

 

·       There are a number of different databases to use for both peptide and nucleotide sequences. 

·       Both contain:

   nr = all non-redundant sequences (from

           particular databases)

        month = new releases from the past 30 days

                genomes from Drosophila, yeast, E.coli

 

 

3 query sequence

 

·       This is either your amino acid sequence or your nucleotide sequence that you wish to input your search. 

·       This can be typed in or pasted in FASTA format, or use the Accession or GI number.

 


BLAST Element 4 of 4

 

4 various optional parameters

(many seen in the Advanced BLAST search)

 

·       choose and organism to limit your search

·       change Expect Value

§       This is the threshold for reporting matches against a database sequence. 

§       default = 10 therefore 10 matches are expected to be found merely by chance (Karlin and Altschul)

§       Lower expect thresholds are more stringent, leading to fewer chance matches being reported

·       Filter

§       masks sequences of low compositional complexity (SEG program Wooton and Federhen)

§       can eliminate statistically significant but biologically uninteresting reports from the output

§       only applied to query sequence

·       Matrix

§       default = BLOSUM62

§       should be altered for short alignments because they need to be relatively strong to rise above background.

§       Short alignments should use the PAM series

·       Descriptions

§       restricts number of short descriptions of matching sequences reported to a number specified

·       Alignments

§       restricts database sequence to a number specified for which HSPs (high scoring pairs) are reported


Just a word about other BLAST programs

 

 

PSI-Blast

·       Position specific iterative

·       constructed on the fly and iteratively refined

·       begins with a standard database search using a single query sequence

·       a profile is constructed from highly significant alignments in this initial search result

·       the profile is then used in a second pass search of the database

·       this is repeated as wanted with additional sequences found in each cycle being used to refine the profile

·       can lead to hits where blastp searches come with no significant hits.


BLAST algorithm details

 

 

·       basic algorithm details... all variations use some form of this basic algorithm

 

·       3 stages to algorithm:

 

1)   compiling a list of high-scoring words

 

2)   scanning the database for hits

 

3)   extending those hits

 


BLAST stage 1

 

 

1)   A list of words (w-mers) is compiled for the query sequence

 

·       the queue words are lists of residues that score at least T when compared to some word in the query sequence

 

·       typically 50 words / residue in the queue sequence

12,500 words for a sequence of length 250

 

 

 

·       this list of words can be compiled in time essentially proportional to the length of the list

 

·       trade off between speed and sensitivitiy in T and w

higher T yields greater speed, but increased probability of missing weak links


BLAST stage 2

 

 

2)   scan database for hits on those words

 

·       classic algorithmic problem:

“search long sequence for all occurrences of short sequences”

 

·       find all hits in the database sequence that match a word in the query list

 

 

·       2 approaches to solving problem:

1.               array -

·       suppose w = 4, map each word to an integer between 1 and 204 so word can be used as an index in an array of size 204 = 160,000

·       let each ith entry in the array point to the list of all occurrence in the query sequence

·       as we scan the database each word leads us directly to the hits in the query string

 

·       typically only a few thousand of the spaces in the array are used

·       can be creative in the design of the array to take as little space as possible

 

 

2.               finite state machine -

·       signal acceptance on transitions not states

·       saves space and time in construction


BLAST stage 3

 

 

3)   extend a hit to find a locally maximal segment pair containing that hit

 

·       two non-overlapping word pairs on the same diagonal and within a distance of A from each other (hit)

 

·       extends the hit in both directions

 

·       extension considers only alignments that drop in score no more then X below the best score yet scene gives the ability to adapt the region of the graph explored to the data it sees

 

·       if extension has score large enough it will trigger a gapped extension of the hit

 

·       can generate gaps in alignments via a dynamic programming algorithm we have already seen, same idea as FASTA

 

·       report resulting alignment if it has an E-score low enough to be of interest

 


 Comparison of BLAST and FASTA

 

BLAST

FASTA

can produce more than one HSP per database entry

produces only the one best alignment

better for proteins than nucleotides

better for nucleotides than proteins

faster than FASTA

slower than BLAST

less sensitive when using default settings

more sensitive; misses less homologs

less separation between true homologs and random hits

more separation between true homologs and random hits

calculates probabilities (sometimes fails entirely if some assumptions are invalid)

calculates significance ‘on the fly’ from the given dataset (problems if dataset is small)

 


Conclusions on General Issues Regarding

BLAST and FASTA

 

 

·       run BLAST  first because it’s a more general search, then if needed run FASTA

 

·       use translated amino acid sequence wherever you can

 

·       E() ,0.05 is statistically significant and usually biologically interesting

 

·       if query has repeated segments, remove them and repeat the search

 

·       use proteins for database similarity searches when possible because:

 

§       code is degenerate and those with different amino acid sequence can still code for similar protein sequences

 

§       more random hits with DNA than protein

 

§       DNA databases are much larger than protein therefore more random hits conservation in evolution, proteins are rarely mutated

 

 

 

Search Examples

 

BLAST can be found and run at: www.ncbi.nlm.nih.gov/BLAST/