Lecture 4 - k-mers and probability Flashcards

1
Q

What is a k-mer?

A

a sequence of k bases; there are biologically relevant sequences are TF binding sites -they have a sequence of length k; k is a fixed parameter
-one of the most common units in computational sequence analysis
-when stored in binary a single 32 bit integers holds up to a 16-base kmer
-k-mers are used in many types of analysis including genome assembly, alignment free comparison, and genotyping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The probability of observing a specific k-mer generated by the random nucleotide generator is…

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The product of observing two k-mers in a sequence is affected by whether or not the k-mers overlap, and is an instance of conditional probability?

A

-if you have a kmer at one pos and an adjacent kmer at the next pos they will overlap with k-1 bases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is the conditional probability defined (i.e. like the probability of A given B)?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Given a sequence L1,…,Ln each a nucleotide with probabilites Pa,Pc,Pg,Pt and the first k-mer k=4 is GACT, what is the probability the k-mer starting in the third base is CTGG?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

There are biological examples where the probability of observing (generating a particular nucleotide depends on what the nucleotide was before it

A

-the rows must sum to one but columns do not need to sum to one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a Markov chain?

A

-the next state is dependent on the previous state you are in
-the sum of every row is 1
-when generating a sequence the probabilities used for the next base correspond to the row indexed by the current base
-this type of sequence is called a Markov Chain
-each letter is a dependent random variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Given a genome of size n how can we test if the bases are generated randomly (i.i.d.) or through a Markov process?

A

-calculate the expected number of times each dinucleotide is expected to appear in the genome
-calculate the frequency of each nucleotide and use this to estimate the probability of each base
-calculate the probability of each dinucleotide
-calculate the expected number of times each dinucleotide should appear in the genome
-count the number of times each dinucleotide appears in the genome
-compare the number of observed counts of each dinucleotide to the expected using the chi-squared test (how many df)?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you find the distribution of counts of an outcome?

A

-want to know how many probability events are 0 and how many probability events are 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the binomial distribution?

A

-the distribution of the total number of 1s in n Bernoulli trials each with the probability p defined by
-the expected value of P(x) with parameters n,p is np
-the variance of P(x) is np(1-p)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What two parameters is the binomial distribution defined by?

A

the number of trials N and the probability of success of each trial p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A DNA sequence R of 100 nucleotides is generated with a pA =0.3 and p-A=0.7. What is the probailbity of v As in the sequence?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to test if a sequence has a different composition than what was expected?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Probability distribution for counts of low probability, high number of trials: Given a genome of length n, and a k-mer of probability p we can estimate the number of times the k-mer appears as np and the distribution as … (look at image)… how can we calculate this low probability and high number of trials?

A

The Poisson distribution

17
Q

What is the poisson distribution defined by?

A

A parameter lambda h
where lambda = np
-this is the mean or expected value

18
Q

A genome G is composed of N = 3,000,000 bases that may be modeled as independent random variables(e.g. Li has values ACGT with pA,pC,pG,pT = 0.1,0.3,0.4,0,2). If a k-mer ACCGACGGT is observed 30 times what is the probability this k-mer would appear at least this many times in the genome?

A
  1. Need to calculate the probability that the k-mer occurs by multiplying 0.10.30.30.40.10.30.40.40.2 = 3.456X10^-6

2.Calculate the lambda or np or the expected number of times the k-mer appears in the genome G = np = 30*3.456X10^-6 = 31.104

  1. Now count the probability there are 31 or more occurenace of the k-mer in G
19
Q

What are the binoimial and Poisson examples of distributions or what types of variables?

A

-discrete random variables

20
Q

What is a continuous random variable?

A

variables which represent infinitesimal densities

21
Q

What describes the probability of a continuous random variable and what is it denoted by and what does it mean?

A

-probability density function or pdf
-denoted by f(x)
-the probability that random variable X lies in the interval [a,b] is given by the integral of the pdf over that interval

22
Q

For any continuous random variable X with a pdf f(x), what should the total are under the curve be equal to?

A

One

23
Q

What is the mean or expected value of a continuous random variable X with a pdf f(x)?

A
24
Q

What is the variance of a continuous random variable X defined as?

A

the expected value of the squared deviation from the mean

25
Q

What is the probability of a single point in a uniform distribution?

A

dx

26
Q

What is the probability density function or pdf of the standard normal distribution?

A
27
Q
A
28
Q

Central Limit Theorem

A
29
Q
A