Viva prep all Flashcards

(135 cards)

1
Q

What is the equation for causal effects on a phenotype?

A

P = G x E, where P is phenotype, G is genetics, and E is environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why can heritability estimates from twin studies be inflated upwards?

A

Due to the assumption of shared environment across monozygotic and dizygotic twins, which may not hold true in all cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What percentage of AD SNPs are found in non-coding regions?

A

98%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an example of an extreme distal enhancer?

A

limb enhancer region which regulates expression of sonic hedgehog (SHH) gene. This enhancer, the Zone of Polarizing Activity Regulatory Sequence (ZRS), is approximately 800,000 to 1,000,000 base pairs away (in mouse and human) and found within the non-coding, intron of a neighbouring gene but when eliminated, causes truncation of limbs in mice. Moreover, this regulation can be altered by single base pair changes in humans.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What percentage of the genome is protein coding?

A

1.5%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which protein-coding genes are implicated in the pathogenesis of AD?

A

APOE, APP, PSEN1 and PSEN2. APOE produced predominantly by astrocytes and activated microglia in the brain - role in transporting lipids between cells and organs, are the greatest genetic risk factor for AD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is RNA-Seq considered better than microarrays?

A

RNA-Seq offered a full view of the whole transcriptome as profiles RNA from the sequencing of complementary DNA (cDNA) , i.e. the whole repertoire of transcripts for the particular tissue of cells, including allele-specific expression and alternative splicing. Microarrays which relied on pre-defined transcripts or genes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the methodological approach of scRNA-Seq?

A

isolating cells using methods such as microfluidics, microwells, droplet‐based or in situ barcoding to co-encapsulate it with a unique DNA barcoded bead. This meant mRNA could be linked back to the cell of origin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the steps in processing scRNA-Seq data?

A

scFLOW Nextflow pipeline steps -
1. emptydroplets (distinguish true nuclei from empty droplets and determine ambient RNA profile i.e. cell-free mRNA - Models the distribution of UMIs in empty droplets to establish a background distribution. Classifies droplets with significant deviation from the background as cell-containing.)
2. Nuclei filtering based on total read counts and total genes expressed (200 minimum for each) or with > 4 median absolute deviation (MAD) for either
3. Mitochondrial reads - 10% or greater -> indicative of cell death
4. DoubletFinder - doublets detection - gen random doublets from input data, lower dim space with real, use nearest neighbour alg to calculate doublet scores. Repreat iteratively
5. LIGER - Calculate integrative factors across samples
6. UMAP - Two-dimensional embeddings of the LIGER integrated factors were calculated using UMAP
7. Leiden community detection algorithm - detect clusters of cells from the 2D UMAP (LIGER) embeddings
8. Cell-typing of clusters - using EWCE, top five marker genes for each automatically annotated cell-type were determined using Monocle 3 and validated against canonical cell-type markers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does LIGER work?

A

Uses integrative non-negative matrix factorization to identify factors shared among data sets. NMF formula V = WxH, from this you get lower representation.W -> Feature Matrix, H -> Coefficient Matrix (Weights associated with W)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does the Leiden community detection algorithm work?

A

a hierarchical clustering algorithm, that recursively merges communities into single nodes by greedily optimizing the modularity and the process repeats in the condensed graph. It modifies theLouvainalgorithm to address some of its shortcomings, namely the case where some of the communities found by Louvain are not well-connected. This is achieved by periodically randomly breaking down communities into smaller well-connected ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are pseudoreplication approaches?

A

ROTS, which takes cells as independent replicates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are pseudobulk approaches?

A

Pseudobulk + EdgeR LRT, aggregate a cell type’s reads to an individual often by sum or mean to help avoid issues with cell dropout and low sequencing depths.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are mixed model approaches? Give example?

A

generalised mixed models (GLMs) account for both subject and cell-level information often using a random effect for samples to account for the group’s subject-level heterogeneity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the motivation for looking into scRNA-Seq in AD?

A

Many studies but DEGs vary dramatically as approaches changes massively across. Need consensus on approach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the definition of epigenetics?

A

changes in gene function that are mitotically (somatic - non-reproductive cells) and/or meiotically (germline - reproductive cells) heritable and that do not entail a change in DNA sequence i.e. dynamic cellular regulation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Can epigenetic marks be inherited?

A

recent study found certain ones can perhaps due to partial escape of complete reprogramming, resulting in cross-generational changes in developmental and cellular features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are some biological roles of epigenetics?

A

cell differentiation and specialisation (as shown in Waddington’s landscape), cell cycle entry and exit and immune cell activation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Can tumorigenesis occur without genetic alteration?

A

Recent study found tumorigenesis to arise even in the absence of any genetic mutations following transient loss of epigenetic gene silencing (could be DNA methylation as located in a gene promoter, DNA methylation typically acts to repress gene transcription)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the types of epigenetics?

A

DNA methylation, DNA-binding proteins, non-coding RNAs and histone modifications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the order of epigenetic modifications?

A

pioneer transcription factors binding to compacted chromatin, inducing nucleosome structural changes and recruiting histone marks and removing methyl marks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the ChIP-Seq approach?

A

Chromatin immunoprecipitation followed by sequencing - cross-linking proteins to DNA, cleaving the chromatin, immunoprecipitating with protein-specific antibodies and finally, amplifying and sequencing the associated DNA fragments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the ATAC-Seq approach?

A

assay for transposase-accessible chromatin using sequencing - Open chromatin regions are tagged with sequencing adaptors and cleaved by hyperactive Tn5 transposases. The tagged DNA fragments are next purified, amplified and sequenced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What did Nott 2019 find?

A

Histone modifications associated with noncoding regulatory regions such as enhancers have also been assayed and found to harbour AD variants in cell type-specific analyses (AD & Microglia)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What did Skene 2016 find?
Disease associated gene lists (taken from literature) had cell type-specific enrichments (AD & Microglia).
26
Which transcription factors are dysregulated in AD?
Sp1, AP-2 and PPARγ. Sp1 has been shown to regulate _BACE1_ which processes APP (amyloid precursor protein - role in neural growth and maturation during brain development) and is essential for Aβ production. AP-2 and PPARγ regulate APOE (greatest genetic risk factor for AD - role in transporting lipids between cells and organs, found predominately in activated microglia and astrocytes)
27
What is the percentage difference in the genome between any two individuals?
0.4% originally, now 1% and higher in genomes from African and other ancestral populations.
28
What are the two types of genetic variations?
variation between individuals (germline variants) or within individuals (somatic mutations)
29
Why are somatic mutations not studied in neurodegenerative diseases?
Due to the lack of cell division in neuronal populations.
30
What is neurogenesis?
The _de novo_ formation of neurons.
31
What is genetic drift?
change in the frequency of an existing genen variant (allele) in a population due to random chance. Genetic drift may cause gene variants to disappear completely and thereby reduce genetic variation. It can also cause initially rare alleles to become much more frequent and even fixed. Genetic drift is a less powerful force compared to selection.
32
What is genetic selection?
the process by which certain traits become more prevalent in a species than other traits
33
What is genetic recombination?
also known as genetic reshuffling is the exchange of genetic material between different organisms which leads to production of offspring with combinations of traits that differ from those found in either parent.
34
What did the Human Genome Project do?
Created the first, albeit incomplete map, of the genome - facilitated large scale genetic studies like GWAS
35
How does MAGMA work?
gene-based analysis tool that converts SNP)-level P-values identified from GWAS to gene-level P-values to assign variants to their target genes.
36
Why are summary statistics valuable?
do not require the transfer of individual-level, personally identifiable information from participants and can be integrated in meta-analyses
37
What are examples of movements to standardize summary statistic file formats?
NHGRI-EBI GWAS Catalogue standardised format, the MRC IEU OpenGWAS infrastructure and catalogue, the SMR Tool binary format and the variant call format (VCF) to store GWAS summary statistics (GWAS-VCF).
38
What are copy number variations (CNVs)?
CNVs include insertions, deletions, and duplications of segments of DNA.
39
What is LD?
LD occurs when alleles are co-inherited based on their physical proximity, making it difficult to identify the causal SNPs from the tagged SNPs
40
How does finemapping SuSiE work?
The SUm of SIngle Effects (SuSiE) model is an extension of single effect regression (SER): * SuSiE models the genetic signal as a **sparse linear regression** problem, where the phenotype is a function of a small number of causal variants plus noise. The sparsity assumption reflects the biological reality that only a few variants in a region are likely to be causal. * SuSiE breaks the sparse regression problem into a sum of independent single effects. By decomposing the problem in this way, SuSiE can estimate the effects of multiple causal variants in the same region simultaneously, avoiding the issue of "masking" where one signal obscures another. * SuSiE iteratively fits the model, identifying one single effect at a time.At each step, it estimates a "single effect" (a sparse vector) while accounting for the effects of previously identified signals. This is achieved using Bayesian methods and variational inference to approximate the posterior distribution of the effect sizes. * For each signal, SuSiE computes a **posterior inclusion probability (PIP)** for every variant, which quantifies how likely that variant is to be causal given the data. * SuSiE explicitly models LD between variants, ensuring that multiple correlated variants are appropriately handled. This avoids incorrectly attributing causal signals to multiple nearby variants. * The process stops when the model has accounted for all detectable signals (up to L) or the remaining signal is indistinguishable from noise. * Disadv: SuSiE assumes that genetic effects are additive, which may not capture complex interactions
41
What was some of the first research into the cis-regulatory code?
In the 1960s, it was discovered that bacterial genes are regulated by nearby DNA sequences. In the 1980's, the discovered that enhancers could function, at least to a certain extent, independently of their native genomic context, orientation or precise distance from the gene. It would later be proven that enhancer activity depends on the arrangement of the specific bases and distance to other motifs; i.e. motif syntax
42
What is the idea of MPRA's (MAVES)?
test the activities of thousands of candidate genomic regulatory elements simultaneously via the next-generation sequencing of a barcoded reporter transcripts
43
What is JASPAR?
database of manually curated, high-quality and non-redundant DNA-binding profiles for transcription factors across differing species (sequence motifs with a median length of 9 bases)
44
How do transcription factors contribute to the cis-regulatory code?
Transcription factors provide a means by which cells can control the activation of regulatory regions like enhancers and their associated genes through processes like extracellular signals or transcriptional regulation in a spatial and temporal fashion, for example during embryonic development.
45
Why has there been no breakthrough in understanding the cis-regulatory code?
Due to the sheer complexity of the regulation where a multitude of regulatory motifs interact to, for example, control expression of a gene. Or how regulation occurs in a cell type or cell state-specific manner.
46
What software standardizes the application of DNNs in genomics?
Kipoi, tangermeme, gReLU, and EUGENe.
47
What is an example of genomic DNN in synthetic biology?
cell viability under augmented gene expression modules
48
What are genomic Language Models?
self-supervised training step, where input is masked. Types of masking - causal language modelling = next token pred, Masked Language Modeling = anywhere in sequence
49
Why are sequence-to-function and gLMs' outputs not useful?
Already know the answer - model interpretation to understand the cis-regulatory code learned by the model are useful.
50
How does TF-MoDisco work?
**(Transcription Factor Motif Discovery from Importance Scores)** is a computational tool designed to identify and interpret transcription factor (TF) binding motifs from deep learning models trained on genomic sequences: * Takes feature importance scores from interpretability methods (DeepLift/Shap) * Extracts meaningful sequence motifs (short, recurring patterns) that drive the model's predictions. Groups similar motifs together and provide a human-interpretable summary of the model's learned features. * Gets "sequence hits" - Sliding windows of the input sequences are scanned to extract regions with high cumulative importance scores. * The extracted sequence hits are clustered based on similarity in their importance score profiles and sequence content. Clustering is done to group sequence hits that likely represent the same underlying motif. * Within each cluster, the sequences are aligned to identify the consensus motif. This alignment captures the shared pattern underlying the cluster. The importance score profiles are also aligned to refine the motif and ensure it reflects the contribution of each base. * The consensus sequence and importance score profiles for each cluster are summarized into a final motif. These motifs are ranked by their prevalence or contribution to the model's predictions. * The discovered motifs can be compared to known motif databases (e.g., JASPAR, CIS-BP) to annotate them with putative transcription factor identities.
51
How does DeepLIFT/SHAP work?
* gradient-based methods work by assigning an importance to each feature simultaneously based on the gradient with respect to an output so don't require 3 * seq_len forward passes like with ISM. They do require backward passes - If you have a model with a single output this means that you can calculate attributions for each input with a single backward pass. If you have a multi-task model with many tasks you will need one backward pass for each task that you care about. This is more efficient. * basic gradient-based/saliency methods are unstable and suffer from not having a reference to compare against. * DeepLIFT provides a "rescale" correction for the instability of gradients and also calculates the gradient _with respect to a reference sequence_. * The choice of reference sequence is critical for getting meaningful attributions. Ideally, a reference should be a biologically plausible sequence that is not predicted to have the activity that you care about. In `tangermeme`, the default reference function is `dinucleotide_shuffle` so same GC content. Should be run with multiple references i.e. multiple shuffles. * DeepSHAP, which was developed concurrently, extends this idea to using multiple reference sequences and averaging over them. They are commonly called DeepLIFT/SHAP to recognize the connections between methods and concurrence of development. * The importance of an input feature is assessed based on the difference between the actual input and the baseline input. The sum of attributions is equal to the difference in predictions (known as convergence). * DeepLIFT operates by propagating information backward through the network, similar to backpropagation. DeepLIFT calculates its contribution to the activation of the next neuron, using the baseline to calculate the difference. DeepLIFT redistributes the contributions of the output back to the input features in a way that is proportional to their effect on the output, ensuring that the sum of contributions matches the observed difference in the output. Like gradients, DeepLIFT uses a chain rule to propagate contributions through the network. However, instead of raw gradients, it propagates contribution scores. DeepLIFT handles nonlinearities (e.g., ReLU, sigmoid) by computing contributions based on the difference from the baseline. For example, for ReLU, if the baseline input results in a zero activation, DeepLIFT attributes the entire output change to inputs that activated the ReLU.
52
What does DeepLIFT provide for gradient instability?
DeepLIFT provides a 'rescale' correction for the instability of gradients and also calculates the gradient with respect to a reference sequence.
53
Why is the choice of reference sequence critical in DeepLIFT?
The choice of reference sequence is critical for getting meaningful attributions. Ideally, a reference should be a biologically plausible sequence that is not predicted to have the activity that you care about.
54
What is the default reference function in tangermeme?
In tangermeme, the default reference function is 'dinucleotide_shuffle' to maintain the same GC content.
55
What does DeepSHAP extend in relation to DeepLIFT?
DeepSHAP extends the idea of using multiple reference sequences and averaging over them.
56
How is the importance of an input feature assessed in DeepLIFT?
The importance of an input feature is assessed based on the difference between the actual input and the baseline input. The sum of attributions is equal to the difference in predictions (known as convergence).
57
How does DeepLIFT operate? (backprop)
DeepLIFT operates by propagating information backward through the network, similar to backpropagation, calculating contributions to the activation of the next neuron using the baseline.
58
What is the term for predicting genetic effects with a DNN?
The term is _in silico_ mutagenesis.
59
What is the first genomic DNN and its receptive field?
The first genomic DNN is DeepSea, with a receptive field of 500 base-pairs.
60
Explain multi-headed attention layers
These layers enhance the model's ability to focus on different parts of input sequences simultaneously. * Single Attention Mechanism - Attention operates by calculating how much focus (or "weight") each element in a sequence should receive relative to others. It does so by computing a weighted sum of input values (V), where the weights are determined by comparing queries (Q) with keys (K): 1. **Inputs**: - Query (Q): What are we looking for? - Key (K): What do we have? - Value (V): What information should we attend to if there's a match? - These three learned weight matrices are used to project the input into queries (Q), keys (K), and values (V). Weight matrices specific to head h that reduce the dimensionality of the input to the dimensionality of each head. This dimensionality of each head is often chosen as input dim/number of attention heads. These projections are learned parameters, optimized during training. 2. **Computation**: * Similarity between Q and K is measured (e.g., dot product) and normalized (softmax) to produce attention weights. * These weights are used to compute a weighted sum of V. * Multi-headed attention - extends the single attention mechanism by running multiple attention heads in parallel. Each head focuses on different aspects of the input. 1. Parallel Attention Heads: * Instead of a single attention operation, the input sequence is split into multiple subspaces (via learned projections) for each head. - For each head: - Separate Q, K, and V matrices are learned and used to compute attention. - Each head operates on a smaller, lower-dimensional version of the input (to reduce computational cost). 2. Diversity of Focus: - Different heads can attend to different parts of the input or capture distinct relationships (e.g., short-range vs. long-range dependencies) 3. Concatenation and Projection: - Outputs from all heads are concatenated. - A final linear projection combines these into a single output, summarizing the diverse perspectives learned by the heads. - Equation: Attention h = softmax(Qh Kh^T / SQRT dk) Vh Breaking It Down: 1. **Input Matrices**: - Q (queries): Represents the "questions" the model is asking about the sequence. - K (keys): Represents the "knowledge" or context available for each position in the sequence. - V (values): Contains the actual content or information to be passed to the next layer. LOOK AT NOTES FOR REST
61
What is the function of the Query (Q) in the attention mechanism?
Query (Q) represents what the model is looking for.
62
What does the Key (K) represent in the attention mechanism?
Key (K) represents what the model has.
63
What is the Value (V) in the attention mechanism?
Value (V) contains the actual content or information to be passed to the next layer.
64
What is the process of multi-headed attention?
Multi-headed attention runs multiple attention heads in parallel, each focusing on different aspects of the input.
65
What do the outputs from all heads in multi-headed attention do?
The outputs from all heads are concatenated and a final linear projection combines these into a single output.
66
What is the process of Transformers?
- Tokenize input sequence and add positional encoding. - Pass through stacked encoder layers, with self-attention and feedforward steps. - If sequence generation is required, pass outputs to the decoder, which applies masked self-attention, cross-attention, and feedforward steps. - Generate outputs or predictions.
67
What are histone marks?
Post-translational modifications on the N-terminal tails of histone proteins which are a key epigenetic mechanism by which eukaryotic cells regulate transcriptional activity, via altering chromatin structure and interacting with other transcriptional regulators
68
What is H3K9ac associated with?
H3K9ac is associated with active promoter regions.
69
What is H3K4me1 associated with?
H3K4me1 is associated with active/poised distal enhancers. (poised imp as maybe why it didn't do so great in Chapter 6)
70
What is H3K4me3 associated with?
H3K4me3 is associated with active promoters.
71
What is H3K36me3 associated with?
repressive gene body, binding partner for histone deacetylases (HDACs) which prevent run-away RNA polymerase II (Pol II) transcription.
72
What are the aims of the thesis?
* standardisation of differential expression for cell type-specific transcriptional changes. * standardisation of analysis for cell type-specific transcriptional changes specifically in the study of AD * standardisation of processing and quality control of genetic information * Predicting the cell type-specific effects of genetic variants while accounting for distal regulation with genomic deep learning models * Prioritising functional and disease relevant genomic loci _in silico_ with deep learning by linking epigenetics to transcription
73
What are residual dilated convolutions?
A dilated convolution introduces spaces between the kernel elements, enabling the network to capture a larger receptive field without increasing the number of parameters or reducing resolution. The dilation rate determines how much the kernel is "spread out." Residual layer alleviate the vanishing gradient problem and enable the training of very deep networks. A residual block adds the input of a layer to its output. Projection (e.g., 1×1 convolution) of the residual layer ensures channel dimensions match if they don't
74
What does the findings of the PhD thesis address?
The findings of this PhD thesis address critical gaps in the standardisation of genetic data processing, analysis of cell type-specific transcriptional changes, and the prioritization of functional genomic loci. AD field has identified many genetic variants but whether they are causal or in LD, their function, instigated regulatory roles (as 98% are non-coding) and cell type-specific effect remains elusive. Moreover, standardisation of analysis is needed. My PhD thesis focused on developing computational and machine learning techniques to robustly detect the genome’s cell type-specific, protein coding and non-coding effects in AD. This work spanned both AD-specific and broader advancements, addressing shortcomings in the standardisation of processing and quality control of genetic information, the standardisation of analysis of single-cell transcriptional changes in disease and the development of genomic and histone mark deep learning models to attempt to elucidate the functional role of disease relevant genomic loci and genetic variants in a cell type-specific manner. The developed open-source, computational and deep learning techniques are all broadly applicable to our comprehension of the cis-regulatory code.
75
What was the motivation for Chapter 2?
Many benchmarks have shown the inadequacy of pseudoreplication approaches and showed pseudobulk as the best performing but one paper showed mixed models as better. We wanted to reinvestigate.
76
What was the main additional analysis in Chapter 2?
type 1 error rates only considered previously (the proportion of non-differentially expressed genes indicated as differentially expressed by a model) (FP/FP+TN). We considered both type 1 and type 2 (FN/(FN+TP)) error rates with MCC and ROC curves, set seed for reanalysis,
77
Why is pseudobulk better than mixed models?
From Squair et al. 2021 "systematic tendency for single-cell methods to identify highly expressed genes as DE", the aggregation approach of pseudobulk protects against this. PB accounts for the intrinsic variability of biological replicates, mixed models should do this too but based on ours and others benchmarks does a worse job (especially at low sample numbers/cell numbers).
78
Why does pseudobulk sum perform poorly on imbalanced data?
hierarchicell does not normalise the simulated datasets before passing to the pseudobulk approaches. This is a standard step in such analysis to account for differences in sequencing depth and library sizes. This approach was taken by Zimmerman _et al._ as their data is simulated one independent gene at a time without considering differences in library size. The effect of this step is more apparent on the imbalanced number of cells where pseudobulk sum’s performance degrades dramatically. Pseudobulk mean appears invariant to this missing normalisation step because of the averaging’s own normalisation effect. This was a flaw in the simulation software strategy.
79
What was the motivation for Chapter 3?
Best practices to single-cell processing and DE now defined. But many different approaches have been used in the AD field. Case study to show how disparate the results can be using the first AD scRNA-Seq study. Moreover, to avoid the AD field concentrating on spuriously identified genes for downstream work.
80
What was the main additional analysis in Chapter 3?
eprocessing of data using current best-practice approaches (resulted in 20k less cells), running pseudobulk DE on original processing and reprocessed data to show disparity with author's results. Showed authors DE's were just a product of cell counts. Randomly permuted labels 100 times for pseudoreplication and pseudobulk to highlight this effect.
81
How many cells were in the Mathys et al. data?
About 80k cells from 48 individuals (24 AD pathology)
82
How did authors filter mitochondrial reads in Chapter 3?
authors filtered out high mitochondrial read nuclei based on clusters from their t-SNE projection of the data. but ineffective as some kept with >75% mito reads -> single nucleus study so shouldn't be any/many.
83
How does edgeR work?
uses an overdispersed Poisson model/negative binomial model to model count data and account for both biological and technical variability. EdgeR estimates the genewise dispersions by conditional maximum likelihood, conditioning on the total count for that gene. An empirical Bayes procedure is used to shrink the dispersions towards a consensus value, effectively borrowing information between genes. Finally differential expression is assessed for each gene using an exact test analogous to Fisher's exact test, but adapted for overdispersed data.
84
What was the motivation for Chapter 4?
GWAS summary statistics have popularised and accelerated genetic research. However, a lack of standardisation of the file formats used and tools for formatting and quality control have proven problematic when running secondary analysis tools or performing meta-analysis studies
85
What was the motivation for Chapter 5?
understanding genetic variants' effects in a cell type-specific manner is crucial for interpreting GWAS results. However, profiling these effects across the non-coding genome remains challenging due to the scalability limits of experimental methods, assaying a cell’s epigenome. Assayed many cell types in places like ENCODE but not cell types of interest in brain for AD - can we impute with DNNs. Interested in TFs as could eventually be targets but hist mark data is avail. Recent DNNs have increased receptive field but not pred in new cell types.
86
What was the training data for EC?
104 GRCh37 DNA sequence and p-value continuous tracks after peak calling in EpiMap. p-value was used over fold change down to its greater signal-to-noise ratio. ATAC rather than DNase due to is prevalence in recent work. Arcsinh-transformed signal used to negate the effects of differing seq depths.
87
What model was benchmarked against with EC?
Epitome predicted bed regions of peaks (1/0 pred) at 200 bp res. Had to look at 3,200 base-pairs to get LCM
88
What was the motivation for uncovering the cis-regulatory code of AD? Nathan's goal for how to do this? What should be done with future research?
* major goal of this work was to understand the effect of genetic variants on the cis-regulatory code in a cell type-specific manner. * Such as epigenetic regulation of transcription factor binding in microglia within the hippocampus, a cell type implicated in the genetic burden of AD * By applying the same SLDP method developed in Chapter 5 with AD GWAS data, it would be possible to systematically assess associations for all known transcription factors across multiple cell types. * The identified transcription factors could then be inhibited in a cell type-specific manner as a form of targeted therapeutics as has been explored for cancer research * However, such epigenetic experimental assays in these targeted brain regions and cell types have only been sparsely explored to date. To address this gap, in **Chapter** **5**, we developed Enformer Celltyping to predict epigenetic signals in previously unseen cell types to circumvent the lack of experimental assays. * We chose histone marks as a first attempt due to the availability of hQTL sets to validate this approach. After such, the original goal was to apply the same approach to predict transcription factor binding assays; however, we found that current genomic deep learning models can not capture the effect of genetic variants * Future research aimed at elucidating the cis-regulatory code of AD should prioritise the collation of experimental data for all relevant cell types and transcription factors as to extrapolate predictions to previously unseen cell types leads to worse performance and assay to pred in new cell types can't account for genetic variants
89
What was the EC pre-training step?
separate training of DNA sequence and chromatin accessibility submodules. DNA Average and distribution (10 bins, 0.5 intervals) and celltyping side pred the difference between the average histone mark signal and the cell type-specific signal. The pre-training sensibly initialised the weights of the chromatin accessibility layers before combining with the DNA layers which contained the Enformer architecture for the full training step
90
What does global chromatin accessibility correspond to in EC?
Global chromatin accessibility corresponds to the chromatin accessibility for 3,000 base-pairs around the transcriptional start site of 1,216 marker genes, averaged at 250 base-pair resolution (from PangloaDB)
91
What are embedding layers used for in EC?
used for local and global CA - random init of a set size (done at multiple resolutions for EC) and updated through backprop. Can visualise these in 2D with UMAP.
92
What histone marks were predicted by EC?
H3K27ac, H3K4me1, H3K4me3, H3K9me3, H3K27me3, and H3K36me3.
93
What loss functions were used in EC?
Pre-training: Poisson negative log-likelihood loss function was used for the average signal prediction and cross-entropy loss was used for the distribution in the DNA submodule whereas a mean squared error (MSE) loss function was used for the celltyping submodule following other epigenetic embedding approaches, given the possibility of negative values. Full training - The Poisson negative log-likelihood loss function was used for the full training stage.
94
How was EC's receptive field tested?
random perms of DNA with in silico muts and measured effect. done in incremental positions
95
What is the Average Precision calculation?
(used to compare EC and Epitome) - the weighted mean of precisions at each threshold. precision = true positives/ total number predicted positives
96
What is the Off-centre correlation analysis EC?
make sure SNP effects are the same when centred on SNP of reg effect region.
97
Explain how SLDP works.
Measures the statistical concordance between the signed variant effects (our model’s predictions) and the genome-wide association study’s marginal correlations. SLDP uses generalized least-squares regression to measure the agreement between these, iteratively inverting the direction of the signed variant effect measures along with their neighbouring entries inLD blocks to derive a null distribution. The measured agreement defines how important the variants are to the phenotype’s heritability. ## Footnote It iteratively inverts the direction of signed variant effect measures to derive a null distribution.
98
What is the global chromatin accessibility signal cell type-specific motif enrichment approach?
Used hist marks that act outside of TSS or gene body as may harbour cell type-specific transcription factors and identified all peaks with -log 10 p-value cut-off. For each peak, the influence of the global signal was approximated by calculating the partial derivatives of the model with respect to the input, i.e. the gradient on the input. The results for each peak was ordered by absolute value and the top 10% of peaks reliant on the global signal were identified. The DNA at these positions were run through Homer to get TF's. Then TF's genes tested for cell type specificity with EWCE.
99
Explain how s-LDSC works.
Method of estimating heritability. Uses assumption that genetic effect values for true associations are positively correlated with LD scores whereas genetic effect values for false positives (e.g. due to population stratification/drift) are not correlated with LD scores. s-LDSC estimating heritability across functional annotations. LD score regression quantifies the contribution of polygenicity (many small genetic effects) and bias so they can be separated. Regression eq: Test stat = average causal effect per SNP * LD score + inflation due to population stratification + 1 ## Footnote Regression equation: Test stat = average causal effect per SNP * LD score + inflation due to population stratification + 1.
100
What is the cell type-specific genetic enrichment for complex traits?
H3K27ac peaks (pred - cut-off arcsinh(1), act and ATAC). Looked for enrichment in each cell type peak file with s-LDSC. Key point: pred H3K27ac better enrich than ATAC (using real H3K27ac as baseline) - Improved on ATAC in 14 of the 16 expected enrichments
101
What is the equation on a node inx a neural network?
w * X + b => weight, input X, bias.
102
What are two limitations of hQTL SLDP EC analysis?
Firstly, that the hQTL set contained interactions which caused a histone mark binding position to be removed by a genetic variant but these histone mark binding positions were not captured in Enformer Celltyping’s prediction on the major allele. Secondly, that genomic deep learning models, including Enformer Celltyping, inherently struggle to accurately predict the effect of genetic variants based on their current training paradigm.
103
What did Sasse et al. show?
Sasse _et al._, comprehensively showed how Enformer underperforms at predicting the effect of genetic variants on transcription, even predicting the incorrect direction of effect in up to 40% of tested cases
104
What is the motivation behind Chapter 6 - Predicting gene expression from histone...?
Many have previously linked hist marks to exp but did not consider all three of differing histone mark functions, distal reg effects & level of cell differentiation. Also, little done to use in silico perturbations to learn about reg from these models (like what has been done with DNA models)
105
What data and models are discussed in Chapter 6?
eleven cell types/tissues, 7 hist marks (average log2-transformed, 100 base-pair binned read depth of the histone mark signal), two neural networks (simple- 6k bp, convolutions distal - chromoformer, attention 40k bp, 3 input res 100, 500, 2k each has own transformer block. both 100bp bins, both 4-fold cross-validation, both ADAM, both early stopping, separate model per fold and cell type, LR decay) - predicted log2-transformed RPKM
106
How is an active gene defined in Chapter 6?
A gene was defined as active or inactive based on whether its expression level was above or below the median for that cell type
107
What finding is highlighted in Chapter 6?
We find that there is no universal histone mark which is consistently the most predictive of expression. We recommend that researchers consider all three of these influencing factors when determining the effect of histone mark levels on the transcriptional state of a cell in their work. Although H3K4me3 always good, also H3K27ac and H3K9ac and H3K36me3 interesting. Also active marks performed better in ESCs whereas repressive marks like H3K9me3 performed relatively better in adult primary tissues. Active marks performed better at early stages of lineage commitment, i.e. ESCs, whereas repressive marks were more predictive in fully differentiated cells, i.e differentiated tissue (more accessible DNA in ESCs and early stages of lineage). A key point of our findings is the marginal return in performances by: 1. Extending from an intentionally simple local promoter model to an attention based, computationally complex model which accounts for distal histone mark levels and 2. Increasing the number of histone marks included in the model
108
Which histone marks were used in in silico perturbations?
We selected one active and one repressive mark which are found at  both promoter and distal regulatory elements - H3K27ac and H3K27me3.
109
What was the approach for in silico perturbations?
The predictions from the different k-fold versions of the model were averaged, similar to the approach commonly used in sequence to expression models. For the promoter histone signal, the full 6,000 base pairs around the TSS were perturbed, whereas for the distal histone signal, bins of 2,000 base pairs across the 40,000 base pair receptive field were perturbed iteratively. The implemented perturbation levels were between 0 and 1 inclusively in 0.1 steps. As well as averaging the predictions from the different k-fold versions, we also tested the correlation between the different folds to ensure the model is learning consistent regulatory code. Moreover, we benchmarked this concordance against Borzoi (for equal 40k recep field). It did a bit better than Borzoi.
110
What was tested for enrichment in QTL during in silico perturbations?
upstream and downstream used only not TSS top decile of predicted expression change taken from both. Checked for overlap with finemapped eQTLs. To test for enrichment of the fine-mapped, tissue-specific SNPs, a bootstrap sampling experiment was implemented where the proportion of SNPs found in each decile were compared against 10,000 randomly sampled regions from all deciles. P-values were derived and adjusted using FDR correction for multiple testing. Also tested against Hi-C, maximum histone mark activity and proximal loci (6k bp up or downstream).
111
What is the in silico perturbation disease enrichment?
s-LDSC top decile pred change of up and downstream TSS.
112
What is a possible advantage of Chapter 6 over DNA deep learning?
By identifying genomic loci of interest based on perturbing histone mark levels, our model captures significant enrichment of eQTLs in the most predictive regions, offering an alternative to genomic deep learning models trained on DNA - don't need to worry about effects of LD.
113
What are Siamese networks?
architecture inherently designed to uncover differences in inputs like Siamese Neural Networks. Siamese networks have been heavily used in tasks like identifying differences and similarities between images. Traditionally, a Siamese network’s architecture  has identical layers for each input with shared weights across them and uses a loss function like contrastive loss based on differences in the inputs
114
How improve DL on genetic variants?
1. QTL (small variance) 2. evolutionary diversity (but naive genomic DNA only represents a limited subset of all possible genomic DNA arrangements) 3. synthetic sequences (MAVEs - MPRAs) 4. withholding the entire genomes (De Boer). All a trade-off between the amount of genetic diversity they include and their similarity to the functional assays used by genomic deep learning models trained in the current paradigm
115
What is an alternative view on why current genomic deep learning models fail?
is due to loss functions focusing on variation across genes or more generally genomic loci. The model is not penalised for the variation of each gene across cell contexts. A naïve model could thus predict the average expression of a gene across cell types to minimise loss. An argument could be made that penalising a model for not identifying cell type-specific changes in expression (or other functional assay) will force it to pay attention to cell type-specific regulatory motifs like enhancer motifs which in turn would benefit the model’s variant effect predictions. In the same manner as considering multiple personalised genomes at a location, the argument here is to consider multiple cellular states. Although the DNA input does not change, the output will, forcing the model to understand cell type-specific  regulation. This view, although promising, would be non-trivial to incorporate and may require multiple loss functions to ensure the model maintains strong performance across genes or genomic loci or, if modelling expression, using pretrained models which capture epigenetic functional assays as a starting point as their learnt regulatory syntax is paramount e.g. chromatin accessibility as a proxy for transcription factor binding sites or regulatory histone marks like H3K27ac. It is worth noting that this has yet to be fully evaluated in the field. - Done by Decima
116
Why is creating a benchmark for deep learning in genomics hard?
functional assays used to measure genomic deep learning models’ ability do not provide as straightforward a measure as this in terms of comprehending the cis-regulatory code: Each assay is subject to experimental variability within the technique such as batch effects but also variability across differing techniques resulting from technique-specific biases – for example, ChIP-Seq results differ dramatically to CUT&Tag using the same experimental conditions
117
What are the issues with GV-Rep?
1. Goal of GV-Rep is to validate genomic language models for the classification of genetic variants in a clinical settings (Li _et al._, 2024). For example, the records collated from ClinVar classify the pathogenicity of genetic variants; benign, likely benign, likely pathogenic or pathogenic, for differing diseases - not functional assay outputs 2. Other records are from GWAS summary statistics, associating SNPs to diseases based on their p-values. However, I believe these tasks are too broad for genomic deep learning models as they lack any functional measurements or cell type/cell state contexts and should not be included in such a benchmark dataset. 3. xQTL datasets. but doesn't consider LD. Even with fine-mapping usually just take highest confidence pos and negs which could bias models to easiest cases. 4. _in vitro_ experiments like MAVEs and CRISPRi assays - barriers to their immediate use however as when deposited, most experiments lack information to map their imputed genetic code to the reference genome and thus, only the immediate ~150 base-pairs are available which is far smaller than most genomic deep learning model’s receptive field. Also differ alot in output to what these models are trained to pred. Also does not capture in vivo measurements from the whole genome like cell type-specific effects.
118
What is an alternative way to measure SNP effects?
Could consider individual genomes rather than genetic variants in isolation to avoid the effect of LD however this does not test a model’s ability to differentiate between causal and tagged-SNPs. Also, small amount of variation for monetary investment of whole genome sequencing. Also coordinated efforts to decide on which cell types and functional measurements to use across the differing data sources will be paramount for the field as if model trained to pred in cell A, benchmark on cell B is not going to be very useful
119
Define the cis-regulatory code
the regulatory information of the genome
120
What is AD defined by
neuronal loss and gliosis (reactive change of glial cells in response to damage to the central nervous system. In most cases, gliosis involves the proliferation)
121
Where is affected at the early stages of AD
hippocampus and  entorhinal cortex (regions with a major role in memory and learning)
122
Selective vulnerability of specific neuronal subtypes in AD, name a neuronal type
cholinergic neurons in the nucleus basalis of Meynert which forms part of the basal forebrain are selectively lost
123
What comes before neuronal loss in AD
synaptic loss, shown in animal models and cell cultures
124
AD treatments approved by FDA
lecanemab and donanemab; moderate slowing of cognitive decline, by clearing build-ups of extracellular plaque deposits of the β-amyloid peptide (Aβ) in the brain – a hallmark of AD
125
How does Homer work?
Scans input sequences and background sequences for k-mers (6-12 bp). Tested for significant enrichment of a specific k-mer in the input sequences using a hypergeometric test. Then clusters similar significant k-mers and gets TF's for motifs from DB like Jasper.
126
Difference CAGE-Seq (Enformer) & RNA-Seq (Borzoi)
CAGE measure transcription at individual promoters (can be multiple to a gene) so moving to pred RNA-seq means the model must learn transcriptional regulation, splicing regulation etc which combined leads to a particular gene's expression.
127
EC When flattening the cell type embeddings, how was it performed? Was the position information affected in this process?
The embedding layers convert the chromatin accessibility information from 1D to 2D shape where the size of the second dimension is based on the number chosen by the user. The embedded vector representation is then updated through backpropagation to better reflect the inputted signal. Flattening after the embedding layers will not affect the positional information of the chromatin accessibility information. See the example below to explain and motivate flattening after embedding and how positional information was not affected: Say we have two sentences as our entire dictionary: Sentence 1 I play football Sentence 2 I play basketball and so we assign integers to the words as follows: Sentence to sequence Sequence 1 [1, 2, 3] Sequence 2 [1, 2, 4] Then we take the input (the integer sequences) and add them to a Embedding layer which assigns random numbers in 2 dimensions and using backpropagation to update these values: Embedding matrix: Word Index Vector I 1 [0.4, 0.2] play 2 [0.7, 0.3] football 3 [0.2, 0.1] basketball 4 [0.2, 0.8] So the embedding of the sentences looks like: Embedding: Sequence 1 [[0.4, 0.2], [0.7, 0.3], [0.2, 0.1]] Sequence 2 [[0.4, 0.2], [0.7, 0.3], [0.2, 0.8]] Next the model flattens the sequences with the embeddings in order to make it 1 dimension again: Flatten to 1 dimension: Sequence 1 [0.4, 0.2, 0.7, 0.3, 0.2, 0.1] Sequence 2 [0.4, 0.2 ,0.7, 0.3, 0.2, 0.8] I hope this is clear how the words above’s order remains the same after embedding and flattening which similarly, the order of the chromatin accessibility data is similarly preserved. This same flattening approach has been used in other genomic deep learning models like Avocado (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01977-6, model architecture: https://github.com/jmschrei/avocado/blob/master/avocado/model.py) and is also used commonly in NLP without affecting positional information such as Google research’s BUSTLE
128
Aerts work on Synthetic Enhancers (2023 - Dec)
Stein aerts and stark groups * prove a point that enhancers contain code specific to one cell type but can effect multiple cell types by having intertwined codes in the same 500bp region. Shown by making a cell type specific enhancer active in another cell type and making an enhancer which is active in multiple cell types, active in just one. * Aert's work done in fruit fly brain and then humans too but humans validated with Enformer rather than experimental work
129
Diff Mamba, Hyena, Attention
Feature Attention Hyena Mamba Complexity: 𝑂(𝑛^2) 𝑂(𝑛log𝑛) 𝑂(𝑛) or 𝑂(𝑛log⁡𝑛) Long-range Dependencies: Excellent Good Good Efficiency for Long Seq: Poor Excellent Excellent Mechanis: Token-token interactions Convolution + Gating Structured Linear Operations Best Use Cases: General-purpose NLP tasks Long-sequence modeling Long-sequence modeling
130
131
positional encoding with attention
attention mechanisms (e.g., multi-headed attention) do not inherently capture order (they treat input tokens as a "bag of words"), positional encoding introduces this sequence information. These can be learnt (back prop) or fixed positional encodings. After calculating positional encodings, they are added element-wise to the token embeddings: Input Representation=Token Embeddings+Positional Encodings. So this happens before attention. Raw tokens are first mapped to learned token embeddings using an embedding layer then the positional encodings are added to it then they are passed to the transformer (attention). Rotary Positional Embedding (RoPE) is commonly used for relative positional encodings now but not in genomics.
132
Rotary Positional Embedding (ROPE) positional encoding
encodes positional information directly into the attention mechanism by applying rotations to the key (K) and query (Q) vectors in the attention mechanism. Instead of adding positional encodings to the input embeddings (as in sinusoidal or learned encodings), RoPE multiplies the positional information as a transformation of the input vectors, allowing the attention mechanism to naturally consider relative positions between tokens. RoPE uses a sinusoidal pattern (like the original fixed positional encoding) but embeds it into the attention mechanism by rotating the embeddings. For a given position p, the transformation involves a rotation in the complex plane, using the following formula: SEE NOTES FOR FORMULA
133
Enformer's relative positional encodings
exponential (lower value as goes further away), central mask (0 for positions beyond a certain point) and gamma (another dist with lower values further away but parameterised so diff to exponential)
134
Diff atten, mamba, hyena:
att = attention O(n^2), hyena = conv + gating O(n log n), mamba = structured linear operations n(n log n)
135
aerts work syn enhancers
enhancers contain code specific to one cell type but can effect multiple cell types by having intertwined codes in the same 500bp region. Shown by making a cell type specific enhancer active in another cell type and making an enhancer which is active in multiple cell types, active in just one.