Gene module Flashcards
(32 cards)
- Explain why RNAseq reads should be mapped with splicing-aware
read mappers?
RNA-Seq reads need splicing-aware mappers because RNA comes from spliced transcripts where introns are removed, and some reads span exon-exon junctions. Regular mappers can’t handle these split reads, but splicing-aware tools (e.g., STAR, HISAT2) can align them correctly, ensuring accurate gene expression analysis and detection of splicing events.
What are the RPKM/FPKM and DESeq2/VST techniques?
Normalization techniques for bulk RNA-seq
RPKM/FPKM (Reads/Fragments Per Kilobase of transcript per
Million mapped reads):
- Normalizes for gene length and sequencing depth
- RPKM (single-end reads), FPKM (paired-end reads)
TPM (Transcripts per million):
- Normalizes for gene length first, then sequencing depth
- Makes expression levels comparable across genes and samples
DESeq2/VST (Varianze stabilising transformation)
: normalizes count data and performs differential gene expression analysis using a negative binomial model. VST (Variance Stabilizing Transformation) is a technique within DESeq2 that stabilizes variance across genes, making the data more suitable for visualization and clustering.
What are the key metrics for QC (in bulk DNA analysis)?
Read Quality: A measure of the accuracy and reliability of sequencing reads, often represented as a Phred score indicating the probability of an error in each base call.
Adapter Content: The presence of adapter sequences (used in library preparation) within the sequencing reads, which can interfere with downstream analysis if not removed.
Sequence Length Distribution: A summary of the lengths of the sequencing reads, used to check for consistency and identify potential trimming or sequencing issues.
GC Content: The proportion of guanine (G) and cytosine (C) bases in the sequences, often analyzed for biases that may affect sequencing coverage or downstream analysis.
Behavioral module
What is the purpose and general idea of a Linear Mixed-Effects Model (LME)?
Purpose: Account for fixed and random effects
* Fixed effects: consistent and systematic across all observations (e.g.
treatment or condition)
* Random effects: batch effects, individual variability
* LME allows to control for confounding variables (random effects) while estimating impact of variables of interest (fixed effects)
What is the issue with testing many genes and how can this be mitigated?
- With thousands of genes, massive number of statistical tests performed
- Some will be detected as differential purely by chance
- Correction methods mitigate the risk of false positives, but increase the
likelihood of false negatives (missing truly differentially expressed genes)
Multiple test correction
* Differential expression: many tests are performed
* Need to take this into account, e.g. using Benjamini–Hochberg
(BH) multiple testing correction
* BH adjusts the p-value based on the number of tests
* It controls the False Discovery Rate (FDR): among all genes called
significantly differentially expressed, which proportion is in reality
from the null model (i.e. not differentially expressed
Applications of PCA in RNA-seq)
- Visualizing relationships between samples
- Detecting outliers (problematic samples)
- Identifying patterns (e.g. influence of treatment
What are average linkage and complete linkage methods for and what is the difference between them?
methods used in hierarchical clustering to determine how clusters are formed by measuring the distance between groups of data points
Complete linkage uses maximal intercluster dissimilarity.
The largest of the pairwise dissimilarities is use
Average linkage uses mean intercluster dissimilarity.
The average of the pairwise dissimilarities is used
What is Enrichment Analysis for and what are the steps?
statistical techniques used to identify whether specific biological categories (e.g., pathways, gene sets, or functional annotations) are overrepresented or “enriched” in a given list of genes, compared to what would be expected by chance.
Steps:
1. Input Gene List:
A set of genes of interest (e.g., differentially expressed genes, genes from a specific cluster, or genes with mutations).
- Reference Background:A larger set of genes representing the entire genome, transcriptome, or experimental dataset.
- Gene Annotations:Categories or functional terms, often from curated databases such as:
* Gene Ontology (GO) terms (e.g., biological processes, cellular components, molecular functions).
* Pathway databases
* Disease databases
4.Statistical Testing:
Compares the overlap between the input gene list and annotated gene sets to assess overrepresentation. Methods include: * Fisher's Exact Test or Hypergeometric Test: Determines whether the overlap is statistically significant.
- Multiple Testing Correction
What procedure is commonly used to reduce the FDR?
Benjamini-Hochberg (BH)
What are the benefits with single cell-approaches compared to Bulk RNA-seq?
- bulk RNA-seq analyzes average gene expression: masks cell-to-cell variability
- Single-Cell Approaches: capture heterogeneity
- profiles gene expression at single-cell level
- insights into (rare) cell types, cell states
- dynamic processes
applications and workflow of Single-Cell RNA Sequencing (scRNA-seq) preprocessing
?
- Identifying rare cell types
- In bulk RNA-seq these would not be picked up
- Understanding differentiation
- Define “cell trajectories”
- Disease progression
WORKFLOW scTNA
1. Cell dissociation and isolation (e.g., FACS, microfluidics)
- Cell barcoding and amplification
* Amplification using PCR
* Barcoding needed to distinguish the individual cells during data analysis:
add a short nucleotide sequence to the mRNA
* All the molecules from a single cell will have the same barcode - after barcoding and amplification, the cells are pooled into one
sequencing library
Challenges in scRNA-seq
- single cell vs single Nucleus:
* Some cells are harder to capture during dissociation:
* Nuclei are more resistant to force: This makes it easier to isolate nuclei than whole cells in some cases.
* Nuclei reflect transcriptional patterns: Transcription in the nucleus can approximate gene expression but may lack full context. - Dropouts:
* A phenomenon where a gene is expressed in one cell but not detected in another cell of the same type, due to low expression levels or technical issues.
* Can complicate interpretation. - Batch Effects:
* Variations caused by technical differences between experiments (e.g., processing on different days or labs).
* These differences may overshadow true biological variation, requiring normalization to remove non-biological effects.
What is spatial transcriptomics and what is the applications of it?
- Maps gene expression to tissue locations, preserving spatial context
(Techniques: Slide-seq, Visium, MERFISH, stereo-seq)
Applications:
* Reveals spatial organization of tissues: map gene expression to brain anatomy
* Understanding cell-type diversity
* Interactions and cell-cell communication
What does single-cell ATAC-seq do and what insights can be gained from it?
Single-Cell ATAC-seq: Profiling Chromatin Accessibility
Purpose: Profiles chromatin accessibility at single-cell resolution to identify active regulatory regions (e.g., enhancers and promoters).
Insight: Reveals which regions of the genome are open and potentially regulating gene expression in specific cell types.
What does Single-Cell DNA Methylation Sequencing do and what insights can be gained from it?
*Purpose: Profiles DNA methylation (an epigenetic modification) at single-cell resolution, using bisulfite sequencing.
- Insight: Studies cell-to-cell variation in methylation, helping understand stable epigenetic regulation and its role in cell identity and developmen
Explain Multi-Omics at Single-Cell Resolution
- Integrates multiple omics data from the same single cell
- scNMT-seq (combines methylation, chromatin accessibility,
transcriptomics) - Comprehensive understanding of cellular states and functions
What are the QC metrics in single cell rna-seq?
Importance of QC: Crucial due to variability in cell quality
Metrics:
* Total reads per cell
* number of detected genes
* mitochondrial gene content
What are doublets?
(Problem in single cell rna-seq)
* reads
originating from two cells are assigned to a single cell - Doublets can skew results
* Can be computationally removed.
What are the unique challenges of SINGLE CELL rNA-seq in the Alignment and Quantification process? How are these addressed?
- Low Read Depth:
- Each cell is sequenced at a shallower depth due to the high number of cells, resulting in fewer reads per cell. - Dropout Events:
- Genes with low expression might not be detected in some cells, leading to zero counts for genes that are actually expressed. - High Technical Noise:
- Variability caused by the amplification and sequencing process, rather than biological differences.
FIiltering: * Filtering: low-quality cells and genes are removed (e.g., low gene
counts, genes not expressed in enough cells)
What are the steps in the single cell RNA-seq pipeline?
- QC, Alignment, Quantification, normalization
- Cell clustering
- Cell annotation
- differential expression
- trajectory inference
- multi-modal integration
Normalization and Scaling in scRNA-seq (challenges and solutions)
Challenges:
* Zero Inflation: Excessive zero counts due to dropout events or technical issues.
* Variable Sequencing Depth: Uneven read counts between cells.
Solutions: * Imputation: Fills in missing values using statistical models (e.g., negative binomial). * Log-Normalization: Scales counts for sequencing depth and applies log transformation to stabilize variance.
How do we evaluate clusters in single-cell RNA-seq?
- Cluster Validation: Methods for evaluating cluster quality (e.g., silhouette scores, differential
expression analysis). - Biological Interpretation:
Associating clusters with cell types or states
What is annotation
Annotating clusters involves linking them to cell types using marker genes, either manually or with automated tools like Garnett, based on differential expression analysis
goal, methods and applications of pseudotime analysis
Goal: Arrange cells along a temporal trajectory based on their gene expression profiles, simulating a time order of cellular processes without actual time points.
Methods:
Clustering-Based Approach:
Group cells into clusters.
Connect clusters to form a trajectory, reflecting transitions between cell states.
Probabilistic Frameworks:
Calculate transition probabilities between cells or clusters.
Build trajectories by modeling the most likely paths cells follow.
Applications:
Study cell differentiation (e.g., stem cells becoming specialized).
Analyze developmental processes (e.g., organ formation).
Explore cell responses to stimuli (e.g., immune activation).
Summary: Pseudotime analysis reconstructs cellular transitions, revealing dynamic processes like differentiation or development from static single-cell data.