Lecture 3 - Probability and Stats for Sequence Analysis Flashcards

1
Q

Why would you want to identify the GC composition of a human genome?

A

it is the fraction of GC in genome and it is often hypermethylated and more GC due to more H binds increases the melting point of DNA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Given the GC composition of a list of possible source genomes and a sequence read, can you identify which is the most likely genome the read was sequenced from?

A
  1. need to turn genome into something for numerical encoding
  2. perform a statistical test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are DNA binding proteins in DNA?

A

-they are proteins which form noncovalent bonds with short strecthed of DNA and there are multiple binding sites within a genime
-the specific loci where DBP bind are called binding sites and they are near other genes - the pattern bound is called a motif
-when a genome is intiallly sequnced the locations of binding sites are unknown
-forms ionic bonds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the LOGO binding motif show?

A

that there is a relative frequency of nucleotides ate DNA binding sites - every binding site gas an A at the fourth and fifth positions at the 12ths positions and G at the third position
-consider a model where the binding motif is a fixed number of bases k and the nucelotides of the motif have 100% frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the probability a certain chromosomal sequence is over represented in the genome?

A

need to consider a single chromosomal sequence in the forward strand from 5’ to 3’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the null model for DNA?

A

random nucleotide generator which does not care about the output before or output later meaning it is independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When generating random nucleotides what is each base at position i considered to be?

A

the random variable Li

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the domain of the random varaible Li or the values that it can take on?

A

Li E {A. C, G, T)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are discrete random variables?

A

they are random variables which can take on a finite set of values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

If a discrete random variable Li can take on J separate values with probabilites p1….pj then the sum of all the probabilites is?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What forms the probability distribution of a random variable?

A

p1,…,pj

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

For the random nucleotide generator the only constraint we place on nucleotide probabilties is what?

A

pA + pC + pG + pT = 1
it is not constrained to be equal unless specified - i.i.d.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is i.i.d.?

A

pA=pC=pG=pT = 0.25

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When are two random varaible independent?

A

if the probability of those two random variables happening is the product of them like if event X is getting a one one a 6 sided die and Y is getting a heads in coin toss the probability of them both occurring is 1/6*1/2 = 1/12

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is an example of a dependent variable?

A

the height and wieght of a person X and Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you use the complement of A to find the P(A)?

A
17
Q

How do you find the probability of A or B?

A

sum the two probabilites and subtract their intersection

18
Q
A
19
Q

What is the expectation of a random variable?

A
20
Q

What is the expectation scale by a constant?

A

multiply the calculated expectation by the constant

21
Q

Can you sum expectations?

A

yes

22
Q

What is the variance if a random variable?

A

mean of squares - square of means

23
Q

How do you scale a variance?

A

multiple by the square of constant

24
Q

Can you sum variances?

A

yes

25
Q

The expected number of a given nucleotide in a sequence needs to use an auxiliary variable Xi which is where?

A
26
Q

If N is the random variable representing the number of As in a sequence then…

A
27
Q

Consider the example below when calculating deviation from expectation…

A genome is thought to have a 50% GC composition and a sequence of 100 bases is observed with 65 nucleotides that are either G or C. What is the probability of such a deviation?

A
  1. Need to first fine number of expected bases in the genome. 50 GC nucleotides
  2. Use X^2 test
28
Q

Consider the example below when comparing nucleotide composition…

Given a genome of length n with Sa, Sc, Sg, St counts of each nucleotide and a sample sequence of 100 nucleotides what is the proabability that the nucleotide counts deviate from what could be expected if the sample sequence was derived from the genome?

A
  1. Find the expected number of nucleotides in the read and calculate the probabilties and expected value
  2. use chi square for multiple df
29
Q

What is the formula for df for multiple for chi sqaure?

A

(#rows -1)(#cols -1)

30
Q
A