Lecture 3 - Probability and Statistics for Sequence Analysis Flashcards
(31 cards)
What are some motivating examples for statistics for DNA?
-if you want to figure out the GC composition on a genome aka the fraction of a genome that is cytosine or guanine
-due to nonrandom variation some genomes have a low GC composition
Why is GC composition analysis important?
-it affects methylation since GCs tend to be methylated and the melting point of DNA because it increases it
Given the GC compositions of a list of possible source genomes and a sequence read can you identify which is the most likely genome the read was sequenced from?
- turn genome into something for computational numerical encoding
- perform a statistical test (need to create a null model)
What is a DNA binding protein?
-they are proteins that form noncovalent bonds and grab onto short 8-16bp stretches of DNA; cause repression or expression of a gene
-has multiple binding sites in a genome
What is the specific loci where DBP bind called and where are they often found?
-binding sites and they are often near genes and the pattern bound is called a motif
Are the locations of binding sites known when a genome is sequenced?
-no they are unknown
What are binding motifs and what is an example of one?
-the patterns that the DNA binding proteins typically bind to
-here nearly every binding site has an A at the fourth and fifth positions and a T at the 12th position and the third position is typically a G
-here we will consider a model where the binding motif is a fixed number of bases (k), and the nucleotides of the motif have 100% frequency
What is a random nucleotide generator?
-a null model for DNA and it does not care about the output before for an output later so they are independent events - the previous output does not dictate the next output
What random variable is used to denote each base at a position i?
Li
What is a random variable?
-a variable where the value is filled by some random generating function
What is the domain of a random variable?
Li E {A,C,G,T}
-the values a random variable can take on
What is a discrete random variable?
-a variable that takes on a finite set of values
If a discrete random variable like Li can take on J separate values with probabilities p1……pj then what is the value of the sum of its probabilities and what are the rough values of the individual probabilties?
each value has a nonzero probability and the sum of the probabilities is 1
What forms the probability distribution of a random variable?
-the values p1,…pj
-the probability that my random variable x takes on some random value j for all values of j
What does it mean for the random nucleotide generator if the probabilities are independent and identically distributed?
that pA=pT=pC=pG=0.25
-however it is not always constrained to have equal probabilties it must be stated in the problem that they are i.i.d. independent and identically distributed
What determines if two variables are independent?
if they are the same as the product of their two probabilties
What is an example of an independent and dependent variable?
-Independent - X is the outcome of a 6-sided die and Y is a coin toss
-Dependent - X is the height of a person and Y is the person’s weight
Given the random variable A what is the probability of A given the complement of the random variable A?
Given two random variable A and B what is the probability of A or B happening?
On a 3-way Venn diagram show the area corresponding to p (A n - (B U C)
What is the expectation of a random variable?
-the average - aka the sum of the probability of a random variable or the mean
-the probability of each value multiplied by that value and summed
How do you scale the expectation of a random variable by c?
just multiply the calculated expectation of a random variable by c
Can you sum expectations? How do you sum them if it is identical and independent distribution?
Yes you can sum them; if they are independent and identically distributed then you can just multiple by n or the number of expectations
What is the variance of a random variable?
mean of the squares minus the square of the means