8. BLAST Flashcards

Question

BLAST: Evaluation Which scores need to be evaluated? What scores does BLAST compare them to and how are these obtained?

Answer 1

each HSP has associated score - but how good is this score? - is it significantly better than for random sequences? --> We need to compare the score of an HSP with scores of random sequences of equal length & composition. We could: - evaluate empirically - evaluate analytically Empirically not done in practice!

Answer 2

derive theoretically * best scores: follow Gumbel extreme value distribution * use to compute the probability of obtaining a score equal to or greater than a given score by chance * Karlin & Altschul, 1990. PNAS

Answer 3

S = ∑s_ij Paramters eg: - A substitution matrix (eg BLOSUM62) - A gap opening penalty (eg -12) - A gap extension penalty (eg -1)

Answer 4

S' = [ (λ x S) - ln(K) ] / ln2

Answer 5

E = Kmne^-λS

Answer 6

E = mn(2^{- S'})

Answer 7

raw, bit, and E-value

Answer 8

normalizing factor for the scoring system

Answer 9

the K parameter scales the E-value based on the database and sequence lengths. (alignments starting at different places in two sequences may be highly correlated)

Answer 10

Expect value (E) * under comparable conditions: expected no. of matches by chance with S’ ≥ S’obs * can be much smaller or greater than 1 * there may be many or no matches with E << 1, depending on homologs in the database The smaller the E-value, the better the match.

Answer 11

depends on - alignment score - length of query (m) - size of database (n)

Answer 12

The E-Value cannot be compared across searches of different databases

Answer 13

can be much smaller or greater than 1 there may be many or no matches with E << 1, depending on homologs in the database

Answer 14

n if n increases, the E-value increases (worse E-value) --> a sequence hit would get a better E-value when present in a smaller database Why: large databases increase the chance of false positive hits, the E-value corrects for the higher chance

Answer 15

m according to equation for E, if m increases, E increases However, short identical sequence may have a high E-value and may be regarded as "false positive" hits. This is often seen if one searches for short primer regions, small domain regions etc

Answer 16

(LCRs) alignment statistics require that symbols occur randomly in strings long substrings of one or a few symbols violate this assumption

Answer 17

in homology search to improve sensitivity and specificity and to avoid misinterpretation / artifacts

Answer 18

Masking low-complexity regions: sequences are pre-processed to identify LCRs. LCR are masked --> * they do not contribute to alignment or score * they appear as X’s in the BLAST alignment tools: - DUST for DNA, - SEG for protein sequences other types of repeats may also be masked by BLAST

Answer 19

Smaller Word Size: --> requires exact matches of shorter subsequences Lower T- value (neighborhood word threshold T) A lower value of T increases probability of finding weak similarities. decrease --> number of neighboring words will increase Increase E-value threshold allowing matches with lower significance to be reported. (However, may also increase false-positive matches) Relaxing Gap Penalties: extending or opening gaps more easily - allowing for more flexible alignments. (However, maybe more false positives) Different substitution matrix: Choosing a less specific scoring matrix (e.g BLOSUM with lower number) --> better reflecting evolutionary divergence between sequences.

Answer 20

larger Word Size: allowing more mismatches (However, may reduce sensitivity) Reduce E-value threshold: filtering out matches with lower significance. (may reduce sensitivity) Scoring Matrix: Choosing a more specific scoring matrix (eg BLOSUM with higher number) Gap Penalties: extending or opening gaps less easily - stricter alignments

Answer 21

Decrease the sensitivity with following paramters: Word size ?? E-value threshold - raise Database size: smaller, or prefilter: Search space size: increase neighborhood word threshold T --> number of neighboring words will drop and thus limit the search space

Answer 22

Word size T- value E-value threshold Gap Penalties: Different substitution matrix: Word size increase sensitivity: larger increase specificity: smaller increase speed: smaller T- value increase sensitivity: lower increase specificity: higher increase speed: higher Gap Penalties: increase sensitivity: lower increase specificity: higher increase speed: higher Different substitution matrix: increase sensitivity: eg BLOSUM with lower number increase specificity: BLOSUM with higher number increase speed: BLOSUM with higher number E-value threshold increase sensitivity: higher increase specificity: lower increase speed: lower

Answer 23

Word size increase sensitivity: larger increase specificity: smaller increase speed: smaller

Answer 24

T- value increase sensitivity: lower increase specificity: higher increase speed: higher

Answer 25

E-value threshold increase sensitivity: higher increase specificity: lower increase speed: lower

Answer 26

Gap Penalties: increase sensitivity: lower increase specificity: higher increase speed: higher

Answer 27

Different substitution matrix: increase sensitivity: eg BLOSUM with lower number increase specificity: BLOSUM with higher number increase speed: BLOSUM with higher number

8. BLAST Flashcards

(52 cards)