Sequence Similarity Searching Flashcards Preview

Core Lectures > Sequence Similarity Searching > Flashcards

Flashcards in Sequence Similarity Searching Deck (112):
1

What is database structure determined by?

The requirements of designers/users

2

Complete this statement, databases can be local or...?

Remote

3

Complete this statement, querying can be manual or...?

Automated

4

What must providers such as NCBI/EBI balance across users?

Demand on computation resources

5

What does sequence similarity in DNA/proteins suggest?

Common ancestry

6

What might common ancestry imply?

Common function

7

What is the name given to homologs separated by a speciation event?

Orthologs

8

What is the name given to homologs separated by a duplication event?

Paralogs

9

Paralogs and orthologs are two types of homologous sequence, true or false?

True

10

What does the alignment or equivalencing of bases enable?

Maximisation of similarity

11

What could a database query look like?

Could simply be a sequence (DNA/protein)
Could be a logical structure, e.g. human + mitochondrial + HVS2

12

Why do sequence databases require specialised search tools?

Due to size and similarity

13

Is quantification of biological similarity easy or difficult?

Can be difficult

14

What can searching sequence databases for similar sequences predict about novel sequences?

Possible functions

15

What can alignments of sequences contain?

Mismatches and gaps

16

How are mismatches and gaps interpreted in sequence alignments?

As substitutions and indels respectively

17

What do alignment algorithms ideally try to identify about sequences?

The most likely evolutionary 'path' between sequences

18

What are databases?

Searchable collections of information

19

What does how we search databases depend on?

Database access, design and location

20

What does the quantification of sequence similarity require?

Alignment

21

What is the constant gap penalty?

Opening a gap of any size attracts a constant (a) negative score
= -a

22

What is the proportional gap penalty?

Opening a gap attracts a penalty proportional to its length (L)
= -(aL)

23

What is the affine gap penalty?

Opening a gap attracts a constant (a), extending it attracts a penalty (b) proportional to the gap's length (L)
= -(a+bL) where a>>b

24

What type of gap penalty is generally the most relevant biologically?

Affine

25

What does the choice of gap penalty depend on?

Software

26

What do amino acid side chains share?

Chemical properties (acidic/basic etc.)

27

What is the accepted theory about amino acid substitutions?

Chemically similar amino acids substitute more readily than chemically dissimilar amino acids

28

What is 'built into' amino acid substitution matrices?

Physico-chemical classification of amino acids

29

In the PAM250 (accepted point mutation) substitution matrix, what do similar amino acids score?

+ve score

30

In the PAM250 (accepted point mutation) substitution matrix, what do dissimilar amino acids score?

-ve score

31

What is the PAM250 (accepted point mutation) substitution matrix based on alignments of?

Closely related proteins

32

What are accepted point mutation substitution matrices extrapolated to?

Large (PAM120) and very large (PAM250) evolutionary distances

33

What is BLOSUM62 (blocks substitution matrix) based on alignments of?

Gap free alignments of short protein motifs (blocks)

34

What do the numbers represent in BLOSUM62 (blocks substitution matrix)?

Level of identity in alignments (BLOSUM62 = 62%)

35

BLOSUM is continually updated, true or false?

False, it is no longer updated but is still widely used

36

BLOSUM62 (blocks substitution matrix) has no extrapolation. What does this mean for distant relationships?

More reliable

37

What is the default amino acid substitution matrix of BLAST?

BLOSUM62 (blocks substitution matrix)

38

What are the types of BLAST searches and their uses?

Nucleotide query versus nucleotide database, i.e. what gene is this?
Protein query versus protein database, i.e. what protein is this?
Translated nucleotide query versus protein database, i.e. does this DNA sequence code for a known protein?
Protein query versus translated nucleotide database, i.e. can we identify a DNA sequence that might encode this protein?

39

What are the 4 sections in the results page of a BLAST search?

Search information (including RID)
Graphical summary (conserved domain search)
Results table (hyperlinked to alignments)
Alignments (download links)

40

Databases contain more information than can be searched practically by observation, true or false?

True

41

Most databases are relational. What does this mean?

The data are organised into table with defined inter-relationships

42

Does the manual querying of remote databases require specialist knowledge?

Little

43

What might automated querying of local databases enable?

Greater throughput and flexibility

44

Cheaper hardware for databases can increase locally available resources but what does it also make quite costly?

Administration

45

What do highly similar (>80%) sequences probably share?

A common ancestor and thus probably are homologous

46

Why is it necessary to quantify similarity of sequences?

To distinguish between 'chance' and 'real' similarity (common ancestry)

47

What are alignments generally interpreted in terms of?

An explicit model of molecular evolution (substitutions and indels)

48

Sequences are either homologous or not, true or false?

True, i.e. they either share a common ancestor or they do not

49

Any two sequences can show similarity simply by 'chance', true or false?

True

50

Is the alignment of pairs of long sequences computationally easy or intensive?

Intensive due to a large number of equivalences

51

What do dynamic programming algorithms (DPAs) theoretically allow?

Exhaustive identification of optimal alignments

52

Are DPA methods slow or fast for searching large databases?

Often considered too slow

53

What can DPA methods find?

Optimal global or local alignments

54

Give an example of a global algorithm

Needleman-Wunsch

55

What do global algorithms try to find?

Similarity over the whole sequence

56

Give an example of a local algorithm

Smith-Waterman

57

What do local alignments try to find?

Just local regions
They do not try to align the whole sequence

58

Which is the most biologically relevant, global or local alignment?

Local alignment

59

Different algorithms will give similar alignments of the same sequences, true or false?

False, different algorithms can give very different alignments of the same sequences

60

What are the pros and cons of exhaustive algorithms to find optimal alignments?

Accurate but slow

61

Genes with shared common ancestors are homologous and so may be what?

Very similar in sequence

62

Do exhausative mathematical algorithms find the alignment with the maximum or minimum similarity?

Maximum

63

What does distinguishing between 'real' and 'chance' alignments require?

The comparison of some measure of similarity

64

What do scoring procedures need to take into account?

Biology

65

How can we measure similarity of sequences?

By scoring the alignment

66

What do E-values enable?

Comparison between searches and provide an objective threshold for significance

67

What can protein sequences identify?

More distant evolutionary relationships (depending on scoring matrix)

68

What can DNA searches find?

Evolutionary close relationships

69

BLAST is fast, but what might it miss?

Optimal alignments

70

DNA and protein databases can only be searched effectively using what algorithms?

Heuristic algorithms (speed)

71

How can we compare results when using different methods for alignment?

Using p values and E values

72

What does E value stand for?

The expect value

73

What do E values indicate?

How often a match at a given p value would be expected to occur in the database, by chance (i.e. when the sequences are unrelated)

74

Should 'real' alignments between sequences that do share a common ancestor show more or less similarity than 'chance' alignments of sequences that do not?

Greater similarity

75

What are the pros and cons of maximisation scores?

They ensure the best alignment but are not comparable

76

Are default scores and comparing alignments good to use?

Generally good, but experimentation is possible

77

What does the percentage of identically aligned residues allow?

Comparison of different alignments

78

How is the percentage of identically aligned residues calculated?

The number of matches by the length of the alignment and multiplied by 100

79

When scoring alignments, what is the assumption about nucleotides (G, A, T, C)?

They substitute equally for one another

80

What type of selection are protein sequences under for structure and function?

Stabilising selection

81

Are changes between chemically similar amino acids more or less likely to be deleterious?

Less likely

82

When scoring protein alignments, particular substitutions can have different scores. What does this depend upon?

Their chemical similarity, e.g. LEU to ILE or PRO to TRP

83

Almost any 2 sequences can be aligned by using what?

Gaps (indels)

84

How is the quality of an alignment assessed?

Using a scoring matrix
(Matches are +ve, mismatches are 0, gaps are -ve)

85

What do heavy gap penalties help the discovery of?

Biologically meaningful alignments

86

What do gaps in a query represent?

Insertion/deletion events, relative to ancestor

87

How are mismatches (substitutions) in an alignment scored?

Equivalent
(G>T = G>C = -1)

88

Algorithms maximise the score of indels but what does biology indicate about indels?

That they should be rare

89

What are the pros and cons of heuristic algorithms?

They are not guaranteed to find the best alignment but are much faster (by 5-20x)

90

What is the main problem with DPAs?

They are too slow for large databases

91

What does sequence database searching involve?

Aligning query to all database sequences and ranking 'hits' by similarity/quality

92

What is the main advantage to using DPAs?

They are guaranteed to find the highest scoring alignment (but biological reality?)

93

What does alignment allow?

Quantification of similarity between sequences

94

What does the statistical theory of alignment enable?

P values to be calculated

95

What is a p value?

The probability of observing as high scoring alignment between two unrelated sequences of similar length and composition

96

What p value indicates a significant match?

p<0.05

97

Why are E values greater than 0.01 unlikely to reflect real matches?

Since as good a match would be expected to occur at least 1% of the time, even for unrelated sequences

98

E values are always calculated in the same way, true or false?

False

99

How are E values calculated for BLAST?

E = pX
Where X is the total length of all the sequences in the database (treats the database as one large sequence)

100

How are E values calculated for FASTA?

E = pN
Where N is the number of sequences

101

What is the default word size (W) for BLAST?

3 for proteins
11 for DNA

102

BLAST searches only for word matches above what?

A threshold, T

103

What did early versions of BLAST only allow for?

Ungapped alignments (but T allows mismatches)

104

What happens to high-scoring segment pairs (HSPs) along the BLAST query?

They are reported and ordered by score

105

What do current versions of BLAST allow which improves its sensitivity?

Gapped alignments

106

In BLAST searches, all matches above T are extended until what?

The introduction of gaps causes the alignment score to fall quickly

107

Alignments are scored in order to quantify what?

Similarity

108

What are the most common heuristic algorithms?

FASTA and BLAST

109

Initial alignment hits are examined to see if they can be what?

Extended

110

Generally speaking, what do heuristic methods do?

Break the query into short 'words' and look for matches above a threshold level in the subject database

111

What do DPAs find?

A computationally optimal solution to the alignment problem

112

What do heuristic methods assume?

That high scoring alignments contain short regions of exact matches