RNA seq Flashcards
What are the objectives of RNA seq?
- Study gene regulation and expression variation; * (e.g. compare different tissues, time points, disease states)
- Understand the structure, function and organization of information
within the genome - and many more - sub-classifying cancer, spatial transcritptomics, host pathogen interaction
Describe microarrays
- quick and cost effective
- based on hybridization to complementary sequence
in affy you need to chips per experiment - for control and for the actual experiment
very noisy!
What are some limitations of micrarrays?
- The data is very “noisy”
- Expression levels are determined by a spot of light against a noisy background
- Probes are not available for all genes - Affy probes are only present for approx 75-80% of human genes
- Genes with very low expression may not be detected
- The data requires a large degree of statistical manipulation
- Result only shows a gene is expressed but gives no information about which transcript
Outline the workflow used in RNA seq
Compare RNA sequencing and Microarrays
- Method works as it can be assumed every mRNA present will be sequenced the same number of times
- If experiment shows twice as much mRNA for a particular gene as control then gene expression is 2 fold greater
- RNA-seq gives more accurate quantification and has better dynamic range (ability to quantify genes expressed at low and high levels)
- Not limited by microarray probe sequences and availability
- RNA-seq can potentially identify novel transcripts (e.g. new splice sites)
- RNA-seq can be used to study alternative splicing
Outline the RNA seq analysis procedure
Describe library preparation
- Total RNa extraction and target RNA enrichment
- Poly(A) capture
- Ribosomal RNA deplaetion
- Fragment RNA and reverse transcribe
- Ligate adapters and PCR amplify
- indexes/barcodes allow multiplexing
What should you consider in your experimental design for RNA seq?
- Single vs paired end (latter helps identify e.g. isoforms)
- Sequencing depth (deeper sequencing detects more transcripts)
- Biological replicates (important for differential expression) - you need to have many samples to identify eny errors
- Spike-in RNAs (can help with normalization and quality control) - you add a known amount of RNA and then you can normalize your data
- Multiplexing (pool barcoded samples, then split across lanes)
- Batch design (randomize samples across experimental batches, cannot correct for batch effects if technical and experimental factors are confounded)
Describe the quality control step in RNA seq
- asses the quality and trim the reads if needed
- Quality control is an essential step in the analysis as poor quality reads can significantly impact results
What are some problems you can face during QC?
- Low-quality sequences (low confidence bases)
- Sequencing artefacts (duplicate reads, sequence bias)
- Sequence contamination (reads from another organism)
How can you solve the problems of low quality in QC?
- FastQC for simple QC reads on raw reads - helps you remove the low reads or trim them down
- Discard low quality reads, and trim adapter sequences & poor quality
bases (e.g. using Trimmomatic
- Discard low quality reads, and trim adapter sequences & poor quality
Describe read alignment
After quality control, reads are aligned to a reference genome or transcriptome.
Method depends on experiment aims and availability of suitable references.
What will you have to do if you’re not confident in your mRNA reads?
if you are not confident in your mRNA reads then you will probably have to map against the genome - more difficult because then you will have to map through the exon boundaries
What methods of alignment can you have?
-alignemnt to reference genome
-alignment to reference transcriptome
alignment to de novo assembled genome
Describe alignment to reference genome
- requires splice-aware aligners (e.g. STAR, HISAT2)
- use known splice junctions, but can also discover new ones
- computational challenge is to accurately align reads that span splice junctions
- you can give it an excel sheet with the exons so it is aware of where they are
Describe alignment to reference transcriptome
- unspliced alignment (e.g. Bowtie2)
- generally faster, but requires comprehensive reference transcriptome
- main challenge is dealing with multi-mapping (reads that map to several transcripts)
Describe alignment to de novo assembled transcriptome
if no suitable reference genome, first assemble reads into contigs, and align reads to this de novo transcriptome (e.g. for novel genome, cancer samples)
Describe reference based mapping
- If the reference genome is well annotated e.g. human, the reads can be mapped to known genes using the GTF file
- For less well defined annotations or where more accurate mapping is required then the mapped transcripts need to be assembled
- Novel genome
- To identify variable transcripts/isoforms
- Cancer samples
- Tools to produce these assembled transcripts include Cufflinks and StringTie
- Read mapping aligns reads to the reference genome, marking reads that align with and without splice junctions.
- Those that map unspliced must be exons, any that jump regions must span introns
- Cufflinks/StringTie look at the distribution of reads and estimates a transcript and transcript/gene read counts
Do we have alignment free methods? What are the benefits f them?
- Recent methods (e.g. Kallisto, Salmon) avoid full alignments of each read, and instead use ‘pseudoalignments’ that identify which transcripts are compatible with a given read (but not exactly where that read aligns).
- Very fast, accurate and computationally efficient
- Assume well annotated transcriptomes and cannot identify novel transcripts
- Pseudo aligners have been developed to provide faster and more efficient read mapping
- Kallisto “can quantify 30 million human reads in less than 3 minutes on a Mac
desktop computer using only the read sequences and a transcriptome index
that itself takes less than 10 minutes to build”
How can you quantify expression?
-direct fragment counting
- transcript level quantification
-alignment-free quantification
Describe direct fragment counting
- Count fragments that overlap each gene (use e.g. featureCounts,
HTSeq) - Simple and fast
- How to deal with multi-mapping reads?
- No information on relative transcript abundances
Describe transcript level quantification
- Assign fragments to specific transcripts
- Can aggregate over all possible isoforms to obtain gene-level count
- Use a statistical model to handle multi-mapping reads, and assign 1. these fragments probabilistical
- Can observe e.g. changes in isoform usage
Describe alignment free quantification
Bypass full alignment, fast and accurate (see later slides)
How can you estimate transcript abundances?
- Many methods have been developed to deal with multimapping reads.
- These use statistical models to link the probability of observing a given fragment to the relative transcript abundances.
- Optimisation algorithms are then used to infer transcript abundances given the observed reads and the assumed model.
- Cufflinks is an example which combines transcript assembly with abundance estimation. It uses fragment length information to help assign reads.