Transcriptomics Flashcards
(20 cards)
What is the RNA seq pipeline?
- RNA isolation and selection
- DNA library
- Sequencing
- Quality control
- Read alignment
- Processing raw reads to counts
- Normalize
- Data analysis
How can we select only mRNA in our sample?
polyA selection: using beads that have a polyT sequence so it can bind to the polyA tail almost all mRNA have. This does not include lncRNA so if we want that we need to do rRNA depletion.
How do we remove rRNA from a sample?
We do rRNA depletion which is using beads to bind to specific sequences on the rRNA and then washing them away.
How is a DNA library prepared for RNA seq?
- We have mRNA in a sample (meaning we have done rRNA depletion/polyA selection). RNA is single stranded.
- It is fragmented and annealed with a cDNA primers.
- Synthesize cDNA and adding a tagning sequence that will bind a random place on the mRNA and indicate directionality from 3’ end.
- Remove the RNA
- Anneal another tag to mark the directionality to 5’ end. Synthesize cDNA again
- Purify the cDNA
- Add PCR primers and amplify with PCR
What is cDNA?
DNA transcribed from mRNA
Computational pipeline from raw reads
- Quality control
- Align reads to genome (BAM files)
- OPTIONAL: Assembly of transcriptome by making contigs (de novo technique)
- Read count per gene/isoform
- Normalize
- Statistics
How do we take splicing into account when mapping to a genome/transcriptome?
Top-hat pipeline.
Gets rid of splice sites.
First everything is mapped to the genome, and the reads that cannot be mapped are set aside. The splicing sites are often flanked by consensus sequences, so we find places in the reference genome that have splicing sites and try to map to those.
What bias does different size isoforms carry?
If we have different sizes of isoforms, we will have different numbers of reads mapping to them. The scaling should be linear, double size = double number of mapped reads.
Cufflinks can be used to find isoforms and estimate likely abundance
What is cufflinks?
Cufflinks can be used to find isoforms and estimate likely abundance. It uses reads that exclude isoforms (maps perfectly to the genome), and the likelihood of a fragment mapping to an isoform (size of fragments and isoforms)
Why do we need to normalize the data? Which factors do we need to compensate for?
Factors:
- Sequencing depth: Different number of reads per sample (efficiency of the run)
- Gene length (within the library)
- Differences in count distributions
How do we normalize?
If we want to compare expressions between genes:
- TPM (transcripts per million)
Normalizes for length of genes to a read per base
If we want to know change in gene expression in the same gene in different situations:
- Scaling factors (divide all counts by a constant)
The scaling factor reflects the size of the library between different runs.
We are always looking at the same gene, therefore we don’t need to normalize for gene length
If we have redistribution of reads:
- TMM: scaling factor normalization, based on weighed trimmed means of the log expression ratios between two conditions
- Median Ratio Method: geometric mean
Which error distribution would be most fitting for bulk RNA seq data?
Poisson (discrete).
Sidenote: Generally higher mean values have higher variance than expected
What are the three types of noise?
- Noise from the poisson
- Technical noise from preparation and sequencing
- Biological noise (often dominant for highly expressed genes)
What is meant by the power of a statistical analysis and how can we improve the power?
The statistical power: the probability that the test correctly rejects the null hypothesis
Things that lower the power:
- Multiple testing (specifically using Bonferroni to correct for it)
- Adjusting p values
Things that increases the power:
- Not many tests
- Do more replicates (more sequencing runs etc)
- Reduce errors in estimates (back to before sequencing to reduce errors)
To better correct for multiple testing:
- Family-wise error rate (FWER) control: Corrects for the likelihood of having at least 1 false positive
- False discovery rate (FDR): Control the number of false positives, so how many of our null hypothesis rejections are false positives.
Consider: Is it more costly to have false positives in the data or is it more costly to miss interesting positives because of over correction?
Why is false positives a problem in stat?
Often we have many false positives so we need to control the rate of these. Can be done by adjusting the p-value to reflect the expected rate. This lowers the power which is bad.
If we do multiple testing we also need to correct for this as it lowers the power further.
How do you optimize a GLM with empirical bayes?
- Estimate variability for every gene
- Estimate which part of the uncertainty in variability estimates is due to sampling width
- Use this to make a prior of true variability per gene, given the expression level
- Use this to update the initial estimates and iterate however many times.
What kind of plots can be made?
- Hieratical clustering
- PCR
- Volcano plot (test outliers and patterns)
- MA plot (fold change vs abundance)
What is ribosomal profiling/footprinting?
A method used to study translational control by analyzing which mRNAs are being translated by ribosomes at a given time.
What is the process of obtaining samples for ribosomal profiling and doing library prep?
- Cells are treated with chemicals that freezes the ribosomes in place on the mRNA they are currently translating
- Cell lysis (breakdown membrane)
- Nuclease digestion: all mRNA outside of the ribosome protection is digested with nucleases so only the ribosomal protected fragments (RPFs) are left
- Purification
- Library preparation: ligating linkers, revserse transcribing it into cDNA, performing PCR amplification
- Sequencing
What are the purposes of ribosomal profiling
- Measuring the mRNAs that are being actively translated in the specific moment.
- Identify translation start sites