R and Deseq2 Flashcards

1
Q

Name some benefits of using R

A
  • opensource
  • reproducible research w R markdown
  • huge community of developers
  • custom packages are available
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the uses of .R projects?

A
  • links all files and outputs to project dir
  • imported data looked for in proj dir instead of specifying file path
  • can save environment in .RData and reload workspace where you left it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some challenges when trying to identify differentially expressed genes? (3)

A
  • distinguish technical variation from treatment variation (ex: technical factors that can’t be controlled in library prep)
  • majority of genes don’t change b/w treatments (hard to perform stats and get significant)
  • only few replicates per treatment, hard to estimate variance (not feasible for price, limited material, and experiment execution)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is count normalization?

A

The determination of size factors to account for/normalize differences in sample sequencing depth, gene length, and RNA composition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why can’t you use total sampling depth to normalize counts?

A

In highly expressed genes, taking ratios doesn’t reflect the actual expression. Two genes might be expressed at the same levels, but taking ratios might make them appear at a lower level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why should count normalization be done?

A

the numerical value of non-differentially expressed genes should not vary due to sampling depth or RNA composition. We need to determine a sample-specific size factor for each sample

This is needed to make accurate comparison of gene expression between samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the steps in DESeq2 count normalization? (6)

A
  1. determine natural log of genes counts
  2. calculate the geometric mean of each row to use a pseudo-reference sample
  3. Remove infinite values
  4. subtract geometric mean from log of counts (subtract reference from log of counts which is equivalent to lof ratio of counts to reference)
  5. calculate median for each sample
  6. convert log of median to number
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are 2 limits to FDR-controlling procedures? & what are the solutions?

A
  1. multiple testing causes false positives
  2. when FDR correct, the more negatives, the more false negatives

solutions:

  • low-expressed genes variance can’t be estimated
  • remove low-expressed genes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the probability distribution if a gene is not differentially expressed in 2 different conditions?

A

The samples come from the same distribution

the probability distribution is uniform from 0 to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the probability distribution if a gene is differentially expressed in 2 different conditions?

A

Samples come from 2 different distributions
probability distribution is skewed towards 0
most samples below 0.05

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the p values for true positives and false negatives?

A

true positives : 0 - 0.05

false negatives: > 0.05

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Benjamini-hochberg method?

A

A method to control the FDR and account for the fact that sometimes p-values less than 0.05 happen by chance.
It adjusts p-values by making them larger
It ensures that the false positives never make up more than 5% of all positives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In independent filtering how is the filter threshold calculated?

A

filter threshold = max of fit curve - 1SD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does the Benjamini-hochberg method work?

A

It sets the p-value to which ever is the lower value of
1. the p-value of the next higher rank (after ranking p-values from lowest to highest); p-value(rank+1)

  1. p-value(rank) * [total# of p-values]/rank
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What happens to the numbers of true positives after the Benjamini-hochberg method is applied?

A

fewer true positives are identified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does FDR do?

A

Limit the number of false negatives reported, but lose some true positives

17
Q

Why are genes with low read count very noisy?

A

small changes will have dramatic changes on the calculated fold change; remove these out of dataset

18
Q

What does DESeq2 do to counteract loss of true positives when there are lots of non-differentially expressed genes?

A

Removes tests that are unlikely to show significant differences and determines which quantiles return the largest number of rejections

19
Q

What is independent filtering?

A

The removal of genes with very low counts; Maximizes the number of positives

20
Q

What is gene dispersion?

A

The variance of the gene

21
Q

What are the steps in independent filtering?

A

Genes with low counts removed (sample mean > filter threshold)

  1. determine significant genes for different threshold (expressed as quantiles) and lot of significant genes vs quantities
  2. fit curve
  3. determine filter threshold
22
Q

Describe properties of read counts (4)

A
  1. sparse events (i.e. small likelihood p of a read mapping to a specific gene) = read count of a given gene likely small
  2. discrete and high number of events n( (sampling depth/# of reads)
  3. model raw counts with poison distribution (mean = variance)
23
Q

What distribution is suitable for a large sampling depth (n) and a very small number for p?

A

poissson

important property: mean = variance

24
Q

what does over-dispersed mean? and what is this caused by?

A

when the variance of the data increases faster than the mean.
caused by biological variation

25
Q

What factors can make variance difficult to deal with?

A
  • only few replicates/conditions make it hard to estimate variance
  • at low expression levels, data is very noisy
26
Q

How does DESeq2 deal with challenges with variation? (3)

A
  • Models count matrix using negative bionomical distribution
  • borrows info across genes to estimate dispersion and ultimately variance (to determine variance at an expected mean expression); can calculate dispersion for each gene
  • determines the log2fold change and its significance (wald statistic)
27
Q

What is negative binomial distribution?

A

distribution with overdispersion
K = raw count of gene i in sample J with fitted mean u and gene disperation a

a = value to estimate to determine gene variance
K<i> ~NB(u<i>, a<i>)</i></i></i>

u<i> = sq<i></i></i>

s = sample specific size factor
q = number of fragments in sample (expression level)</i></i></i></i></i>
28
Q

What is the formula for variance?

A

Var(K<i>) = u<i> + a<i>(u<i> )^2</i></i></i></i>

variance of each gene fitted to line which is assumed to be the true dispersion for any given mean

a<i> used to determine log2 fold change and error of that estimate</i></i></i></i></i>

29
Q

On a plot of estimating gene-wise dispersion what happens to genes that are too far away from the best fit line?

A

the variance of those genes won’t be adjusted

30
Q

describes the steps in estimating gene-wise dispersion?

A
  1. gene dispersion estimates done on their own
  2. information from all genes used to created a best ifr line to give general estimate of what to expected for a given mean
  3. estimates of individual genes are shrunk towards the best fit line