Bioinformatics Flashcards Preview

BIOC2001 Dina > Bioinformatics > Flashcards

Flashcards in Bioinformatics Deck (66)
Loading flashcards...
1

Define bioinformatics

the application of computers to problems in biology

2

Define bioinformatics

the application of computers to problems in biology

3

What is the aim of bioinformatics?

Based on known protein structure and function, to enable understanding and modulation of protein function

4

How often does DNA data double?

every ~18 months

5

How often does structure data double?

every ~6 years

6

How big is the human genome?

3.2 Gbp

7

What percentage of the human genome is coding?

8

What percentage of the human genome is repeated sequences?

>50%

9

What percentage of genes have alternative splicing?

~35%

10

What is a database?

a structured collection of data with some tool enabling it to be 'queried'

11

What is a databank?

a collection of data (normally in simple text files) without an associated query tool

12

What types of databank are there?

primary, secondary and meta-databanks

13

What is a primary databank?

- simply contain sequence data (DNA or protein)
- may also have 'feature' information (splice sites, signal sequences, disulphides, actives sites, etc.)
- DNA databanks may also contain translations (known or predicted

14

What is a meta-databank?

collections of links between databanks and databases

15

Give some examples of primary databanks

Genbank, EMBL, DDBJ, UniProtKB/SwissProt, PIR, PDB, Enzyme

16

What is the implication of imperfect gene-prediction methods?

a protein identified from genome data is hypothetical until verified by experiment

17

What information is found in PDB?

structural data

18

What information is found in Enzyme?

enzyme classifications (EC numbers)

19

What is a secondary databank?

- these contain derived information
- patterns that characterise a protein family
- detailed annotation

20

What is a secondary databank?

- these contain derived information
- patterns that characterise a protein family
- detailed annotation

21

What is the aim of bioinformatics?

Based on known protein structure and function, to enable understanding and modulation of protein function

22

How often does DNA data double?

every ~18 months

23

How often does structure data double?

every ~6 years

24

What is the meaning of some characters and symbols used in PROSITE?

- the standard IUPAC one letter code for the amino acids is used
- the symbol 'x' is used for a position where any amino acid is accepted
- [ALT] stands for Ala or Leu or Thr
- {AM} stabds for any amino acid except Ala and Met
- each element in a pattern is separated from its neighbour by '-'
- x(3) corresponds to x-x-x
- x(2,4) corresponds to x-x or x-x-x or x-x-x-x

25

What is a dotplot

a graphical method that allows the comparison of two biological sequences and identification of regions of close similarity between them

26

What is similarity measure?

Similarity matrices are used to align sequences of nucleic acids or amino acids

27

What is scored for a match; and what for a mismatch?

1 for a match; 0 for a mismatch

28

What is an example of a more complex scoring system?

a more complicated matrix would give a higher score to transitions (pyrimidine to pyrimidine or purine to purine) than to transversions (pyrimidine to purine or vice versa); the match/mismatch ratio of the matrix sets the target evolutionary distance

29

What is the Needleman-Wunsch algorithm?

an algorithm used to align protein or nucleotide sequences; one of the first applications of dynamic programming to compare biological sequences

30

When were similarity measures first carried out?

by Dayhoff in 1978

31

When were similarity measures improved?

by Henrikoff and Henrikoff in 1992 by use of BLOSUM matrices

32

When was dynamic programming invented?

introduced by Needleman and Wunsch (global) in 1970 and formalised by Smith and Waterman (local) in 1981

33

Give some examples of primary databanks

Genbank, EMBL, DDBJ, UniProtKB/SwissProt, PIR, PDB, Enzyme

34

What is the implication of imperfect gene-prediction methods?

a protein identified from genome data is hypothetical until verified by experiment

35

What information is found in PDB?

structural data

36

What information is found in Enzyme?

enzyme classifications (EC numbers)

37

Give some examples of secondary databanks

PROSITE, PRINTS, BLOCKS, INTERPRO

38

What is an algorithm?

A complete and precise set of steps that will solve a problem and achieve an identical result whenever given the same set of data to a defined level of accuracy.

39

What is PROSITE?

PROSITE is a protein database consisting of entries describing the protein families, domains and functional sites, as well as amino acid patterns and profiles in them.

40

What is the PROSITE pattern for a protein kinase C phosphorylation site?

[ST]-x-[RK]

41

What is the PROSITE pattern for N-linked glycosylation?

N-{P}-[ST]-{P}

42

What is the PROSITE pattern for the Kringle domain?

[FY]-C-[RH]-[NS]-x(7,8)-[WY]-C

43

What is a dotplot

a graphical method that allows the comparison of two biological sequences and identification of regions of close similarity between them

44

What is scored for a match; and what for a mismatch?

1 for a match; 0 for a mismatch

45

Define annotation

a subfield in the general field of genome analysis, which includes anything that can be done with genome sequences by computational means

46

Why might methods be imperfect?

- a coding region may be missed
- an incomplete protein may be reported
- splicing may be predicted incorrectly
- coding regions may overlap
- exon assembly (splicing) may be different in different tissues
- some apparent coding sequences may be defective or not expressed

47

When was dynamic programming invented?

introduced by Needleman and Wunsch (global) in 1970 and formalised by Smith and Waterman (local) in 1981

48

What is meant by heuristics?

approximate fast methods

49

What does heuristics entail?

- index the database by finding locations of short 'words'
- take 'words' from the probe sequence and look them up in the index
- look for multiple matches and extend to find likely hits to full alignment

50

How is DNA sequenced?

by the Sanger method: di-deoxy chain termination

51

How does this apply to sequencing entire genomes?

each segment

52

What is meant by heuristics?

approximate fast methods

53

What does heuristics entail?

- index the database by finding locations of short 'words'
- take 'words' from the probe sequence and look them up in the index
- look for multiple matches and extend to find likely hits to full alignment

54

How is DNA sequenced?

by the Sanger method: di-deoxy chain termination

55

How does this apply to sequencing entire genomes?

each segment

56

What is fragment assembly?

aligning and merging fragments from a longer DNA sequence in orfer to reconstruct the original sequence

57

What is an algorithm?

A complete and precise set of steps that will solve a problem and achieve an identical result whenever given the same set of data to a defined level of accuracy.

58

What are two approaches to sequencing eukaryotes?

1. detect similarity with known coding regions
2. ab initio methods; make predictions based on typical features

59

What are ESTs?

expressed sequence tags; short subsequences of a cDNA sequence, used to identify gene transcripts and instrumental in gene discovery and in gene-sequence determination

60

What are some typical features used in ab initio methods?

initial 5' exon (transcription start point with upstream promoter; ends immediately before a GT splice signal)
internal exons (begins after AG; ends before a GT splice signal)
final 3' exon (begins after AG splice signal; ends with stop codon and poly-A tail)

61

How do computers deal with this information?

machine learning methods; a general class of computer software which learns from examples and is then able to make predictions

62

What are some examples of these methods?

- artificial neural networks
- support vector machines
- decision trees
- naive Bayesian classifiers

63

What are artificial neural networks (ANNs)?

- family of models inspired by biological neural networks
- used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown

64

Why might methods be imperfect?

- a coding region may be missed
- an incomplete protein may be reported
- splicing may be predicted incorrectly
- coding regions may overlap
- exon assembly (splicing) may be different in different tissues
- some apparent coding sequences may be defective or not expressed

65

Explain quality

the quality of raw data is as good as the methods that produce it

the quality of annotations is as good as the curators

66

Explain quality

the quality of raw data is as good as the methods that produce it

the quality of annotations is as good as the curators