Gene module Flashcards by Alma Hed

Explain why RNAseq reads should be mapped with splicing-aware
read mappers?

RNA-Seq reads need splicing-aware mappers because RNA comes from spliced transcripts where introns are removed, and some reads span exon-exon junctions. Regular mappers can’t handle these split reads, but splicing-aware tools (e.g., STAR, HISAT2) can align them correctly, ensuring accurate gene expression analysis and detection of splicing events.

How well did you know this?

Not at all

Perfectly

What are the RPKM/FPKM and DESeq2/VST techniques?

Normalization techniques for bulk RNA-seq
RPKM/FPKM (Reads/Fragments Per Kilobase of transcript per
Million mapped reads):
- Normalizes for gene length and sequencing depth
- RPKM (single-end reads), FPKM (paired-end reads)
TPM (Transcripts per million):
- Normalizes for gene length first, then sequencing depth
- Makes expression levels comparable across genes and samples

DESeq2/VST (Varianze stabilising transformation)
: normalizes count data and performs differential gene expression analysis using a negative binomial model. VST (Variance Stabilizing Transformation) is a technique within DESeq2 that stabilizes variance across genes, making the data more suitable for visualization and clustering.

How well did you know this?

Not at all

Perfectly

What are the key metrics for QC (in bulk DNA analysis)?

Read Quality: A measure of the accuracy and reliability of sequencing reads, often represented as a Phred score indicating the probability of an error in each base call.

Adapter Content: The presence of adapter sequences (used in library preparation) within the sequencing reads, which can interfere with downstream analysis if not removed.

Sequence Length Distribution: A summary of the lengths of the sequencing reads, used to check for consistency and identify potential trimming or sequencing issues.

GC Content: The proportion of guanine (G) and cytosine (C) bases in the sequences, often analyzed for biases that may affect sequencing coverage or downstream analysis.
Behavioral module

How well did you know this?

Not at all

Perfectly

What is the purpose and general idea of a Linear Mixed-Effects Model (LME)?

Purpose: Account for fixed and random effects
* Fixed effects: consistent and systematic across all observations (e.g.
treatment or condition)
* Random effects: batch effects, individual variability
* LME allows to control for confounding variables (random effects) while estimating impact of variables of interest (fixed effects)

How well did you know this?

Not at all

Perfectly

What is the issue with testing many genes and how can this be mitigated?

With thousands of genes, massive number of statistical tests performed
Some will be detected as differential purely by chance
Correction methods mitigate the risk of false positives, but increase the
likelihood of false negatives (missing truly differentially expressed genes)

Multiple test correction
* Differential expression: many tests are performed
* Need to take this into account, e.g. using Benjamini–Hochberg
(BH) multiple testing correction
* BH adjusts the p-value based on the number of tests
* It controls the False Discovery Rate (FDR): among all genes called
significantly differentially expressed, which proportion is in reality
from the null model (i.e. not differentially expressed

How well did you know this?

Not at all

Perfectly

Applications of PCA in RNA-seq)

Visualizing relationships between samples
Detecting outliers (problematic samples)
Identifying patterns (e.g. influence of treatment

How well did you know this?

Not at all

Perfectly

What are average linkage and complete linkage methods for and what is the difference between them?

methods used in hierarchical clustering to determine how clusters are formed by measuring the distance between groups of data points

Complete linkage uses maximal intercluster dissimilarity.
The largest of the pairwise dissimilarities is use

Average linkage uses mean intercluster dissimilarity.
The average of the pairwise dissimilarities is used

How well did you know this?

Not at all

Perfectly

What is Enrichment Analysis for and what are the steps?

statistical techniques used to identify whether specific biological categories (e.g., pathways, gene sets, or functional annotations) are overrepresented or “enriched” in a given list of genes, compared to what would be expected by chance.

Steps:
1. Input Gene List:
A set of genes of interest (e.g., differentially expressed genes, genes from a specific cluster, or genes with mutations).

Reference Background:A larger set of genes representing the entire genome, transcriptome, or experimental dataset.
Gene Annotations:Categories or functional terms, often from curated databases such as:
* Gene Ontology (GO) terms (e.g., biological processes, cellular components, molecular functions).
* Pathway databases
* Disease databases

4.Statistical Testing:

Compares the overlap between the input gene list and annotated gene sets to assess overrepresentation.
Methods include:
   * Fisher's Exact Test or Hypergeometric Test: Determines whether the overlap is statistically significant.

Multiple Testing Correction

How well did you know this?

Not at all

Perfectly

What procedure is commonly used to reduce the FDR?

Benjamini-Hochberg (BH)

How well did you know this?

Not at all

Perfectly

What are the benefits with single cell-approaches compared to Bulk RNA-seq?

bulk RNA-seq analyzes average gene expression: masks cell-to-cell variability
Single-Cell Approaches: capture heterogeneity
profiles gene expression at single-cell level
insights into (rare) cell types, cell states
dynamic processes

How well did you know this?

Not at all

Perfectly

applications and workflow of Single-Cell RNA Sequencing (scRNA-seq) preprocessing
?

Identifying rare cell types
In bulk RNA-seq these would not be picked up
Understanding differentiation
Define “cell trajectories”
Disease progression

WORKFLOW scTNA
1. Cell dissociation and isolation (e.g., FACS, microfluidics)

Cell barcoding and amplification
* Amplification using PCR
* Barcoding needed to distinguish the individual cells during data analysis:
add a short nucleotide sequence to the mRNA
* All the molecules from a single cell will have the same barcode
after barcoding and amplification, the cells are pooled into one
sequencing library

How well did you know this?

Not at all

Perfectly

Challenges in scRNA-seq

single cell vs single Nucleus:
* Some cells are harder to capture during dissociation:
* Nuclei are more resistant to force: This makes it easier to isolate nuclei than whole cells in some cases.
* Nuclei reflect transcriptional patterns: Transcription in the nucleus can approximate gene expression but may lack full context.
Dropouts:
* A phenomenon where a gene is expressed in one cell but not detected in another cell of the same type, due to low expression levels or technical issues.
* Can complicate interpretation.
Batch Effects:
* Variations caused by technical differences between experiments (e.g., processing on different days or labs).
* These differences may overshadow true biological variation, requiring normalization to remove non-biological effects.

How well did you know this?

Not at all

Perfectly

What is spatial transcriptomics and what is the applications of it?

Maps gene expression to tissue locations, preserving spatial context
(Techniques: Slide-seq, Visium, MERFISH, stereo-seq)

Applications:
* Reveals spatial organization of tissues: map gene expression to brain anatomy
* Understanding cell-type diversity
* Interactions and cell-cell communication

How well did you know this?

Not at all

Perfectly

What does single-cell ATAC-seq do and what insights can be gained from it?

Single-Cell ATAC-seq: Profiling Chromatin Accessibility

Purpose: Profiles chromatin accessibility at single-cell resolution to identify active regulatory regions (e.g., enhancers and promoters).
Insight: Reveals which regions of the genome are open and potentially regulating gene expression in specific cell types.

How well did you know this?

Not at all

Perfectly

What does Single-Cell DNA Methylation Sequencing do and what insights can be gained from it?

*Purpose: Profiles DNA methylation (an epigenetic modification) at single-cell resolution, using bisulfite sequencing.

Insight: Studies cell-to-cell variation in methylation, helping understand stable epigenetic regulation and its role in cell identity and developmen

How well did you know this?

Not at all

Perfectly

Explain Multi-Omics at Single-Cell Resolution

Study These Flashcards

Integrates multiple omics data from the same single cell
scNMT-seq (combines methylation, chromatin accessibility,
transcriptomics)
Comprehensive understanding of cellular states and functions

What are the QC metrics in single cell rna-seq?

Study These Flashcards

Importance of QC: Crucial due to variability in cell quality

Metrics:
* Total reads per cell
* number of detected genes
* mitochondrial gene content

What are doublets?

Study These Flashcards

(Problem in single cell rna-seq)
* reads
originating from two cells are assigned to a single cell - Doublets can skew results
* Can be computationally removed.

What are the unique challenges of SINGLE CELL rNA-seq in the Alignment and Quantification process? How are these addressed?

Study These Flashcards

Low Read Depth:
- Each cell is sequenced at a shallower depth due to the high number of cells, resulting in fewer reads per cell.
Dropout Events:
- Genes with low expression might not be detected in some cells, leading to zero counts for genes that are actually expressed.
High Technical Noise:
- Variability caused by the amplification and sequencing process, rather than biological differences.

FIiltering: * Filtering: low-quality cells and genes are removed (e.g., low gene
counts, genes not expressed in enough cells)

What are the steps in the single cell RNA-seq pipeline?

Study These Flashcards

QC, Alignment, Quantification, normalization
Cell clustering
Cell annotation
differential expression
trajectory inference
multi-modal integration

Normalization and Scaling in scRNA-seq (challenges and solutions)

Study These Flashcards

Challenges:
* Zero Inflation: Excessive zero counts due to dropout events or technical issues.
* Variable Sequencing Depth: Uneven read counts between cells.

Solutions: * Imputation: Fills in missing values using statistical models (e.g., negative binomial). * Log-Normalization: Scales counts for sequencing depth and applies log transformation to stabilize variance.

How do we evaluate clusters in single-cell RNA-seq?

Study These Flashcards

Cluster Validation: Methods for evaluating cluster quality (e.g., silhouette scores, differential
expression analysis).
Biological Interpretation:
Associating clusters with cell types or states

What is annotation

Study These Flashcards

Annotating clusters involves linking them to cell types using marker genes, either manually or with automated tools like Garnett, based on differential expression analysis

goal, methods and applications of pseudotime analysis

Study These Flashcards

Goal: Arrange cells along a temporal trajectory based on their gene expression profiles, simulating a time order of cellular processes without actual time points.
Methods:
Clustering-Based Approach:
Group cells into clusters.
Connect clusters to form a trajectory, reflecting transitions between cell states.

Probabilistic Frameworks:
Calculate transition probabilities between cells or clusters.
Build trajectories by modeling the most likely paths cells follow.

Applications:
Study cell differentiation (e.g., stem cells becoming specialized).
Analyze developmental processes (e.g., organ formation).
Explore cell responses to stimuli (e.g., immune activation).

Summary: Pseudotime analysis reconstructs cellular transitions, revealing dynamic processes like differentiation or development from static single-cell data.

How does advanced trajectory inference differ from pseudotime analysis?

Pseudotime Analysis: Simpler, primarily linear or unbranched paths. Advanced Trajectory Inference: * Extends pseudotime to include branching and more complex biological processes. * Ideal for studies of cell fate decisions and differentiation.

Approaches to dimensionality reduction

* density/distribution based approaches: t-SNE and UMAP * autoencoders * approaches specifically developed for single cell data (e.g. trajectory inference)

t-SNE vs UMAP in terms of focus

t-SNE: Focuses heavily on local clusters; great for visualizing small datasets but lacks meaningful global structure. UMAP: Balances local and global relationships; faster and better suited for larger datasets while retaining interpretable structures.

Explain the two most important steps of t-SNE and UMAP, and where they differ

Step 1: Construction of High-Dimensional Probability Distribution t-SNE: * pairwise similarities using a Gaussian distribution. _ Bandwidth of the Gaussian is adjusted based on perplexity (a hyperparameter that controls the effective number of neighbors). UMAP: Uses a graph-based approach: * Points are connected based on the overlap of radii. * The radius is chosen locally, based on the distance to the nth nearest neighbor (key hyperparameter: number of neighbors). * Ensures every point is connected to at least its closest neighbor for continuity. Step 2: Mapping to Lower Dimensions t-SNE: * Minimizes Kullback-Leibler (KL) divergence to preserve local relationships, focusing on grouping nearby points. UMAP: similar but minimizes cross-entropy loss, balancing local and global structure.

Regularized AE

Put constraint in loss function to prevent the model from being too complex (we dont want it to be overfitted/noisy) : penalty on latent variables denoising AE: loss compares original image with model applied to corrupted version of the image contractive AE (penalty on derivatives of nodes in hidden layer w.r.t. input; will disregard small changes in input)

Tensor factorization (+ it's relation to PCA and denoising autoencoders)

2. Tensor Factorization (for Imputation): What It Is: A method that decomposes high-dimensional data (e.g., gene expression matrices) into simpler factors (like a sum of smaller matrices or tensors). Purpose: Used to impute missing values (e.g., dropout events in single-cell data) by reconstructing the data from these factors. Relation to PCA and Denoising Autoencoders: Like PCA: Breaks data into components, identifying dominant patterns. Like Denoising Autoencoders: Learns latent representations to reconstruct noisy or incomplete data, enhancing signal clarity.

PAGA

PAGA (Partition-Based Graph Abstraction): What It Is: A method for trajectory inference in single-cell data that represents cells as nodes and their relationships as edges in a graph. Key Idea: Simplifies trajectories by grouping similar cells into clusters (partitions) and modeling transitions between these clusters instead of individual cells. Purpose: Captures both global and local cellular relationships, useful for branching or complex trajectories.

RNA velocity

Exaple answer: Approach to predict future cell states based on splicing kinetics CHat-gpt answer: RNA velocity is a computational method used in single-cell RNA-seq analysis to infer the direction and speed of gene expression changes within individual cells. It provides insights into the dynamic processes of cellular state transitions, such as differentiation or response to stimuli.

Gene module Flashcards

(32 cards)