Transcriptomics Flashcards

Question 1

Q

What is the RNA seq pipeline?

Answer

A

RNA isolation and selection
DNA library
Sequencing
Quality control
Read alignment
Processing raw reads to counts
Normalize
Data analysis

Question 2

Q

How can we select only mRNA in our sample?

Answer

A

polyA selection: using beads that have a polyT sequence so it can bind to the polyA tail almost all mRNA have. This does not include lncRNA so if we want that we need to do rRNA depletion.

Question 3

Q

How do we remove rRNA from a sample?

Answer

A

We do rRNA depletion which is using beads to bind to specific sequences on the rRNA and then washing them away.

Question 4

Q

How is a DNA library prepared for RNA seq?

Answer

A

We have mRNA in a sample (meaning we have done rRNA depletion/polyA selection). RNA is single stranded.
It is fragmented and annealed with a cDNA primers.
Synthesize cDNA and adding a tagning sequence that will bind a random place on the mRNA and indicate directionality from 3’ end.
Remove the RNA
Anneal another tag to mark the directionality to 5’ end. Synthesize cDNA again
Purify the cDNA
Add PCR primers and amplify with PCR

Question 5

Q

What is cDNA?

Answer

A

DNA transcribed from mRNA

Question 6

Q

Computational pipeline from raw reads

Answer

A

Quality control
Align reads to genome (BAM files)
OPTIONAL: Assembly of transcriptome by making contigs (de novo technique)
Read count per gene/isoform
Normalize
Statistics

Question 7

Q

How do we take splicing into account when mapping to a genome/transcriptome?

Answer

A

Top-hat pipeline.
Gets rid of splice sites.
First everything is mapped to the genome, and the reads that cannot be mapped are set aside. The splicing sites are often flanked by consensus sequences, so we find places in the reference genome that have splicing sites and try to map to those.

Question 8

Q

What bias does different size isoforms carry?

Answer

A

If we have different sizes of isoforms, we will have different numbers of reads mapping to them. The scaling should be linear, double size = double number of mapped reads.
Cufflinks can be used to find isoforms and estimate likely abundance

Question 9

Q

What is cufflinks?

Answer

A

Cufflinks can be used to find isoforms and estimate likely abundance. It uses reads that exclude isoforms (maps perfectly to the genome), and the likelihood of a fragment mapping to an isoform (size of fragments and isoforms)

Question 10

Q

Why do we need to normalize the data? Which factors do we need to compensate for?

Answer

A

Factors:
- Sequencing depth: Different number of reads per sample (efficiency of the run)
- Gene length (within the library)
- Differences in count distributions

Question 11

Q

How do we normalize?

Answer

A

If we want to compare expressions between genes:
- TPM (transcripts per million)
Normalizes for length of genes to a read per base

If we want to know change in gene expression in the same gene in different situations:
- Scaling factors (divide all counts by a constant)
The scaling factor reflects the size of the library between different runs.
We are always looking at the same gene, therefore we don’t need to normalize for gene length

If we have redistribution of reads:
- TMM: scaling factor normalization, based on weighed trimmed means of the log expression ratios between two conditions
- Median Ratio Method: geometric mean

Question 12

Q

Which error distribution would be most fitting for bulk RNA seq data?

Answer

A

Poisson (discrete).
Sidenote: Generally higher mean values have higher variance than expected

Question 13

Q

What are the three types of noise?

Answer

A

Noise from the poisson
Technical noise from preparation and sequencing
Biological noise (often dominant for highly expressed genes)

Question 14

Q

What is meant by the power of a statistical analysis and how can we improve the power?

Answer

A

The statistical power: the probability that the test correctly rejects the null hypothesis
Things that lower the power:
- Multiple testing (specifically using Bonferroni to correct for it)
- Adjusting p values

Things that increases the power:
- Not many tests
- Do more replicates (more sequencing runs etc)
- Reduce errors in estimates (back to before sequencing to reduce errors)

To better correct for multiple testing:
- Family-wise error rate (FWER) control: Corrects for the likelihood of having at least 1 false positive
- False discovery rate (FDR): Control the number of false positives, so how many of our null hypothesis rejections are false positives.

Consider: Is it more costly to have false positives in the data or is it more costly to miss interesting positives because of over correction?

Question 15

Q

Why is false positives a problem in stat?

Answer

A

Often we have many false positives so we need to control the rate of these. Can be done by adjusting the p-value to reflect the expected rate. This lowers the power which is bad.
If we do multiple testing we also need to correct for this as it lowers the power further.

Question 16

Q

How do you optimize a GLM with empirical bayes?

Answer

Study These Flashcards

A

Estimate variability for every gene
Estimate which part of the uncertainty in variability estimates is due to sampling width
Use this to make a prior of true variability per gene, given the expression level
Use this to update the initial estimates and iterate however many times.

Question 17

Q

What kind of plots can be made?

Answer

Study These Flashcards

A

Hieratical clustering
PCR
Volcano plot (test outliers and patterns)
MA plot (fold change vs abundance)

Question 18

Q

What is ribosomal profiling/footprinting?

Answer

Study These Flashcards

A

A method used to study translational control by analyzing which mRNAs are being translated by ribosomes at a given time.

Question 19

Q

What is the process of obtaining samples for ribosomal profiling and doing library prep?

Answer

Study These Flashcards

A

Cells are treated with chemicals that freezes the ribosomes in place on the mRNA they are currently translating
Cell lysis (breakdown membrane)
Nuclease digestion: all mRNA outside of the ribosome protection is digested with nucleases so only the ribosomal protected fragments (RPFs) are left
Purification
Library preparation: ligating linkers, revserse transcribing it into cDNA, performing PCR amplification
Sequencing

Question 20

Q

What are the purposes of ribosomal profiling

Answer

Study These Flashcards

A

Measuring the mRNAs that are being actively translated in the specific moment.
Identify translation start sites

Transcriptomics Flashcards

(20 cards)