· www.ncbi.nlm.nih.gov/BLAST/
· improved overall search speed and put database searching on firm statistical foundation.
· uses heuristics to speed the pairwise comparison.
· creates sequence abstraction by listing exact and similar words
· uses local alignment
BLAST Element 1 of 4
|
blastp |
amino acid query sequence to protein sequence database |
|
blastn |
nucleotide query sequence to nucleotide sequence database |
|
blastx |
nucleotide query sequence translated in all reading frames against a protein sequence database |
|
tblastn |
protein query sequence against nucleotide sequence database translated in all reading frames |
|
tblastx |
most intensive computationally compares 6 frame translations of nucleotides query sequence against 6 frame translation of nucleotide sequence database |
BLAST Elements 2
& 3 of 4
2 Name of database
· There are a number of different databases to use for both peptide and nucleotide sequences.
· Both contain:
nr = all non-redundant sequences (from
particular databases)
month = new releases from the past 30 days
genomes from Drosophila, yeast, E.coli
3 query
sequence
· This is either your amino acid sequence or your nucleotide
sequence that you wish to input your search.
· This can be typed in or pasted in FASTA format, or use the
Accession or GI number.
BLAST Element 4 of 4
4
various optional parameters
(many
seen in the Advanced BLAST search)
· choose and organism to limit your search
· change Expect Value
§ This is the threshold for reporting matches against a
database sequence.
§ default = 10 therefore 10 matches are expected to be found
merely by chance (Karlin and Altschul)
§ Lower expect thresholds are more stringent, leading to fewer
chance matches being reported
· Filter
§ masks sequences of low compositional complexity (SEG program
Wooton and Federhen)
§ can eliminate statistically significant but biologically
uninteresting reports from the output
§ only applied to query sequence
· Matrix
§ default = BLOSUM62
§ should be altered for short alignments because they need to
be relatively strong to rise above background.
§ Short alignments should use the PAM series
· Descriptions
§ restricts number of short descriptions of matching sequences
reported to a number specified
· Alignments
§ restricts database sequence to a number specified for which
HSPs (high scoring pairs) are reported
Just a word about other
BLAST programs
PSI-Blast
· Position specific iterative
· constructed on the fly and iteratively refined
· begins with a standard database search using a single query
sequence
· a profile is constructed from highly significant alignments
in this initial search result
· the profile is then used in a second pass search of the
database
· this is repeated as wanted with additional sequences found
in each cycle being used to refine the profile
· can lead to hits where blastp searches come with no
significant hits.
· basic algorithm details... all variations use some form of this basic algorithm
· 3 stages to algorithm:
1) compiling a list of high-scoring words
2) scanning the database for hits
3) extending those hits
1) A list of words (w-mers) is compiled for the query sequence
· the queue words are lists of residues that score at least T when compared to some word in the query sequence
· typically 50 words / residue in the queue sequence
12,500 words for a sequence of length 250
· this list of words can be compiled in time essentially proportional to the length of the list
· trade off between speed and sensitivitiy in T and w
higher T yields greater speed, but increased probability of missing weak links
2) scan database for hits on those words
· classic algorithmic problem:
“search long sequence for all occurrences of short sequences”
· find all hits in the database sequence that match a word in the query list
· 2 approaches to solving problem:
1. array -
· suppose w = 4, map each word to an integer between 1 and 204 so word can be used as an index in an array of size 204 = 160,000
· let each ith entry in the array point to the list of all occurrence in the query sequence
· as we scan the database each word leads us directly to the hits in the query string
· typically only a few thousand of the spaces in the array are used
· can be creative in the design of the array to take as little space as possible
2. finite state machine -
· signal acceptance on transitions not states
· saves space and time in construction
3) extend a hit to find a locally maximal segment pair containing that hit
· two non-overlapping word pairs on the same diagonal and within a distance of A from each other (hit)
· extends the hit in both directions
· extension considers only alignments that drop in score no more then X below the best score yet scene gives the ability to adapt the region of the graph explored to the data it sees
· if extension has score large enough it will trigger a gapped extension of the hit
· can generate gaps in alignments via a dynamic programming algorithm we have already seen, same idea as FASTA
· report resulting alignment if it has an E-score low enough to be of interest
Comparison of BLAST and FASTA
|
BLAST |
FASTA |
|
can
produce more than one HSP per database entry |
produces
only the one best alignment |
|
better
for proteins than nucleotides |
better
for nucleotides than proteins |
|
faster
than FASTA |
slower
than BLAST |
|
less
sensitive when using default settings |
more
sensitive; misses less homologs |
|
less
separation between true homologs and random hits |
more
separation between true homologs and random hits |
|
calculates
probabilities (sometimes fails entirely if some assumptions are invalid) |
calculates
significance ‘on the fly’ from the given dataset (problems if dataset is
small) |
Conclusions on General Issues
Regarding
BLAST and FASTA
· run BLAST first
because it’s a more general search, then if needed run FASTA
· use translated amino acid sequence wherever you can
· E() ,0.05 is statistically significant and usually
biologically interesting
· if query has repeated segments, remove them and repeat the
search
· use proteins for database similarity searches when possible
because:
§ code is degenerate and those with different amino acid
sequence can still code for similar protein sequences
§ more random hits with DNA than protein
§ DNA databases are much larger than protein therefore more
random hits conservation in evolution, proteins are rarely mutated
Search Examples
BLAST can be found and run at: www.ncbi.nlm.nih.gov/BLAST/