Viva prep all Flashcards

Question

What did Skene 2016 find?

Answer 1

Disease associated gene lists (taken from literature) had cell type-specific enrichments (AD & Microglia).

Answer 2

Sp1, AP-2 and PPARγ. Sp1 has been shown to regulate _BACE1_ which processes APP (amyloid precursor protein - role in neural growth and maturation during brain development) and is essential for Aβ production. AP-2 and PPARγ regulate APOE (greatest genetic risk factor for AD - role in transporting lipids between cells and organs, found predominately in activated microglia and astrocytes)

Answer 3

0.4% originally, now 1% and higher in genomes from African and other ancestral populations.

Answer 4

variation between individuals (germline variants) or within individuals (somatic mutations)

Answer 5

Due to the lack of cell division in neuronal populations.

Answer 6

The _de novo_ formation of neurons.

Answer 7

change in the frequency of an existing genen variant (allele) in a population due to random chance. Genetic drift may cause gene variants to disappear completely and thereby reduce genetic variation. It can also cause initially rare alleles to become much more frequent and even fixed. Genetic drift is a less powerful force compared to selection.

Answer 8

the process by which certain traits become more prevalent in a species than other traits

Answer 9

also known as genetic reshuffling is the exchange of genetic material between different organisms which leads to production of offspring with combinations of traits that differ from those found in either parent.

Answer 10

Created the first, albeit incomplete map, of the genome - facilitated large scale genetic studies like GWAS

Answer 11

gene-based analysis tool that converts SNP)-level P-values identified from GWAS to gene-level P-values to assign variants to their target genes.

Answer 12

do not require the transfer of individual-level, personally identifiable information from participants and can be integrated in meta-analyses

Answer 13

NHGRI-EBI GWAS Catalogue standardised format, the MRC IEU OpenGWAS infrastructure and catalogue, the SMR Tool binary format and the variant call format (VCF) to store GWAS summary statistics (GWAS-VCF).

Answer 14

CNVs include insertions, deletions, and duplications of segments of DNA.

Answer 15

LD occurs when alleles are co-inherited based on their physical proximity, making it difficult to identify the causal SNPs from the tagged SNPs

Answer 16

The SUm of SIngle Effects (SuSiE) model is an extension of single effect regression (SER): * SuSiE models the genetic signal as a **sparse linear regression** problem, where the phenotype is a function of a small number of causal variants plus noise. The sparsity assumption reflects the biological reality that only a few variants in a region are likely to be causal. * SuSiE breaks the sparse regression problem into a sum of independent single effects. By decomposing the problem in this way, SuSiE can estimate the effects of multiple causal variants in the same region simultaneously, avoiding the issue of "masking" where one signal obscures another. * SuSiE iteratively fits the model, identifying one single effect at a time.At each step, it estimates a "single effect" (a sparse vector) while accounting for the effects of previously identified signals. This is achieved using Bayesian methods and variational inference to approximate the posterior distribution of the effect sizes. * For each signal, SuSiE computes a **posterior inclusion probability (PIP)** for every variant, which quantifies how likely that variant is to be causal given the data. * SuSiE explicitly models LD between variants, ensuring that multiple correlated variants are appropriately handled. This avoids incorrectly attributing causal signals to multiple nearby variants. * The process stops when the model has accounted for all detectable signals (up to L) or the remaining signal is indistinguishable from noise. * Disadv: SuSiE assumes that genetic effects are additive, which may not capture complex interactions

Answer 17

In the 1960s, it was discovered that bacterial genes are regulated by nearby DNA sequences. In the 1980's, the discovered that enhancers could function, at least to a certain extent, independently of their native genomic context, orientation or precise distance from the gene. It would later be proven that enhancer activity depends on the arrangement of the specific bases and distance to other motifs; i.e. motif syntax

Answer 18

test the activities of thousands of candidate genomic regulatory elements simultaneously via the next-generation sequencing of a barcoded reporter transcripts

Answer 19

database of manually curated, high-quality and non-redundant DNA-binding profiles for transcription factors across differing species (sequence motifs with a median length of 9 bases)

Answer 20

Transcription factors provide a means by which cells can control the activation of regulatory regions like enhancers and their associated genes through processes like extracellular signals or transcriptional regulation in a spatial and temporal fashion, for example during embryonic development.

Answer 21

Due to the sheer complexity of the regulation where a multitude of regulatory motifs interact to, for example, control expression of a gene. Or how regulation occurs in a cell type or cell state-specific manner.

Answer 22

Kipoi, tangermeme, gReLU, and EUGENe.

Answer 23

cell viability under augmented gene expression modules

Answer 24

self-supervised training step, where input is masked. Types of masking - causal language modelling = next token pred, Masked Language Modeling = anywhere in sequence

Answer 25

Already know the answer - model interpretation to understand the cis-regulatory code learned by the model are useful.

Answer 26

**(Transcription Factor Motif Discovery from Importance Scores)** is a computational tool designed to identify and interpret transcription factor (TF) binding motifs from deep learning models trained on genomic sequences: * Takes feature importance scores from interpretability methods (DeepLift/Shap) * Extracts meaningful sequence motifs (short, recurring patterns) that drive the model's predictions. Groups similar motifs together and provide a human-interpretable summary of the model's learned features. * Gets "sequence hits" - Sliding windows of the input sequences are scanned to extract regions with high cumulative importance scores. * The extracted sequence hits are clustered based on similarity in their importance score profiles and sequence content. Clustering is done to group sequence hits that likely represent the same underlying motif. * Within each cluster, the sequences are aligned to identify the consensus motif. This alignment captures the shared pattern underlying the cluster. The importance score profiles are also aligned to refine the motif and ensure it reflects the contribution of each base. * The consensus sequence and importance score profiles for each cluster are summarized into a final motif. These motifs are ranked by their prevalence or contribution to the model's predictions. * The discovered motifs can be compared to known motif databases (e.g., JASPAR, CIS-BP) to annotate them with putative transcription factor identities.

Answer 27

* gradient-based methods work by assigning an importance to each feature simultaneously based on the gradient with respect to an output so don't require 3 * seq_len forward passes like with ISM. They do require backward passes - If you have a model with a single output this means that you can calculate attributions for each input with a single backward pass. If you have a multi-task model with many tasks you will need one backward pass for each task that you care about. This is more efficient. * basic gradient-based/saliency methods are unstable and suffer from not having a reference to compare against. * DeepLIFT provides a "rescale" correction for the instability of gradients and also calculates the gradient _with respect to a reference sequence_. * The choice of reference sequence is critical for getting meaningful attributions. Ideally, a reference should be a biologically plausible sequence that is not predicted to have the activity that you care about. In `tangermeme`, the default reference function is `dinucleotide_shuffle` so same GC content. Should be run with multiple references i.e. multiple shuffles. * DeepSHAP, which was developed concurrently, extends this idea to using multiple reference sequences and averaging over them. They are commonly called DeepLIFT/SHAP to recognize the connections between methods and concurrence of development. * The importance of an input feature is assessed based on the difference between the actual input and the baseline input. The sum of attributions is equal to the difference in predictions (known as convergence). * DeepLIFT operates by propagating information backward through the network, similar to backpropagation. DeepLIFT calculates its contribution to the activation of the next neuron, using the baseline to calculate the difference. DeepLIFT redistributes the contributions of the output back to the input features in a way that is proportional to their effect on the output, ensuring that the sum of contributions matches the observed difference in the output. Like gradients, DeepLIFT uses a chain rule to propagate contributions through the network. However, instead of raw gradients, it propagates contribution scores. DeepLIFT handles nonlinearities (e.g., ReLU, sigmoid) by computing contributions based on the difference from the baseline. For example, for ReLU, if the baseline input results in a zero activation, DeepLIFT attributes the entire output change to inputs that activated the ReLU.

Answer 28

DeepLIFT provides a 'rescale' correction for the instability of gradients and also calculates the gradient with respect to a reference sequence.

Answer 29

The choice of reference sequence is critical for getting meaningful attributions. Ideally, a reference should be a biologically plausible sequence that is not predicted to have the activity that you care about.

Answer 30

In tangermeme, the default reference function is 'dinucleotide_shuffle' to maintain the same GC content.

Answer 31

DeepSHAP extends the idea of using multiple reference sequences and averaging over them.

Answer 32

The importance of an input feature is assessed based on the difference between the actual input and the baseline input. The sum of attributions is equal to the difference in predictions (known as convergence).

Answer 33

DeepLIFT operates by propagating information backward through the network, similar to backpropagation, calculating contributions to the activation of the next neuron using the baseline.

Answer 34

The term is _in silico_ mutagenesis.

Answer 35

The first genomic DNN is DeepSea, with a receptive field of 500 base-pairs.

Answer 36

These layers enhance the model's ability to focus on different parts of input sequences simultaneously. * Single Attention Mechanism - Attention operates by calculating how much focus (or "weight") each element in a sequence should receive relative to others. It does so by computing a weighted sum of input values (V), where the weights are determined by comparing queries (Q) with keys (K): 1. **Inputs**: - Query (Q): What are we looking for? - Key (K): What do we have? - Value (V): What information should we attend to if there's a match? - These three learned weight matrices are used to project the input into queries (Q), keys (K), and values (V). Weight matrices specific to head h that reduce the dimensionality of the input to the dimensionality of each head. This dimensionality of each head is often chosen as input dim/number of attention heads. These projections are learned parameters, optimized during training. 2. **Computation**: * Similarity between Q and K is measured (e.g., dot product) and normalized (softmax) to produce attention weights. * These weights are used to compute a weighted sum of V. * Multi-headed attention - extends the single attention mechanism by running multiple attention heads in parallel. Each head focuses on different aspects of the input. 1. Parallel Attention Heads: * Instead of a single attention operation, the input sequence is split into multiple subspaces (via learned projections) for each head. - For each head: - Separate Q, K, and V matrices are learned and used to compute attention. - Each head operates on a smaller, lower-dimensional version of the input (to reduce computational cost). 2. Diversity of Focus: - Different heads can attend to different parts of the input or capture distinct relationships (e.g., short-range vs. long-range dependencies) 3. Concatenation and Projection: - Outputs from all heads are concatenated. - A final linear projection combines these into a single output, summarizing the diverse perspectives learned by the heads. - Equation: Attention h = softmax(Qh Kh^T / SQRT dk) Vh Breaking It Down: 1. **Input Matrices**: - Q (queries): Represents the "questions" the model is asking about the sequence. - K (keys): Represents the "knowledge" or context available for each position in the sequence. - V (values): Contains the actual content or information to be passed to the next layer. LOOK AT NOTES FOR REST

Answer 37

Query (Q) represents what the model is looking for.

Answer 38

Key (K) represents what the model has.

Answer 39

Value (V) contains the actual content or information to be passed to the next layer.

Answer 40

Multi-headed attention runs multiple attention heads in parallel, each focusing on different aspects of the input.

Answer 41

The outputs from all heads are concatenated and a final linear projection combines these into a single output.

Answer 42

- Tokenize input sequence and add positional encoding. - Pass through stacked encoder layers, with self-attention and feedforward steps. - If sequence generation is required, pass outputs to the decoder, which applies masked self-attention, cross-attention, and feedforward steps. - Generate outputs or predictions.

Answer 43

Post-translational modifications on the N-terminal tails of histone proteins which are a key epigenetic mechanism by which eukaryotic cells regulate transcriptional activity, via altering chromatin structure and interacting with other transcriptional regulators

Answer 44

H3K9ac is associated with active promoter regions.

Answer 45

H3K4me1 is associated with active/poised distal enhancers. (poised imp as maybe why it didn't do so great in Chapter 6)

Answer 46

H3K4me3 is associated with active promoters.

Answer 47

repressive gene body, binding partner for histone deacetylases (HDACs) which prevent run-away RNA polymerase II (Pol II) transcription.

Answer 48

* standardisation of differential expression for cell type-specific transcriptional changes. * standardisation of analysis for cell type-specific transcriptional changes specifically in the study of AD * standardisation of processing and quality control of genetic information * Predicting the cell type-specific effects of genetic variants while accounting for distal regulation with genomic deep learning models * Prioritising functional and disease relevant genomic loci _in silico_ with deep learning by linking epigenetics to transcription

Answer 49

A dilated convolution introduces spaces between the kernel elements, enabling the network to capture a larger receptive field without increasing the number of parameters or reducing resolution. The dilation rate determines how much the kernel is "spread out." Residual layer alleviate the vanishing gradient problem and enable the training of very deep networks. A residual block adds the input of a layer to its output. Projection (e.g., 1×1 convolution) of the residual layer ensures channel dimensions match if they don't

Answer 50

The findings of this PhD thesis address critical gaps in the standardisation of genetic data processing, analysis of cell type-specific transcriptional changes, and the prioritization of functional genomic loci. AD field has identified many genetic variants but whether they are causal or in LD, their function, instigated regulatory roles (as 98% are non-coding) and cell type-specific effect remains elusive. Moreover, standardisation of analysis is needed. My PhD thesis focused on developing computational and machine learning techniques to robustly detect the genome’s cell type-specific, protein coding and non-coding effects in AD. This work spanned both AD-specific and broader advancements, addressing shortcomings in the standardisation of processing and quality control of genetic information, the standardisation of analysis of single-cell transcriptional changes in disease and the development of genomic and histone mark deep learning models to attempt to elucidate the functional role of disease relevant genomic loci and genetic variants in a cell type-specific manner. The developed open-source, computational and deep learning techniques are all broadly applicable to our comprehension of the cis-regulatory code.

Answer 51

Many benchmarks have shown the inadequacy of pseudoreplication approaches and showed pseudobulk as the best performing but one paper showed mixed models as better. We wanted to reinvestigate.

Answer 52

type 1 error rates only considered previously (the proportion of non-differentially expressed genes indicated as differentially expressed by a model) (FP/FP+TN). We considered both type 1 and type 2 (FN/(FN+TP)) error rates with MCC and ROC curves, set seed for reanalysis,

Answer 53

From Squair et al. 2021 "systematic tendency for single-cell methods to identify highly expressed genes as DE", the aggregation approach of pseudobulk protects against this. PB accounts for the intrinsic variability of biological replicates, mixed models should do this too but based on ours and others benchmarks does a worse job (especially at low sample numbers/cell numbers).

Answer 54

hierarchicell does not normalise the simulated datasets before passing to the pseudobulk approaches. This is a standard step in such analysis to account for differences in sequencing depth and library sizes. This approach was taken by Zimmerman _et al._ as their data is simulated one independent gene at a time without considering differences in library size. The effect of this step is more apparent on the imbalanced number of cells where pseudobulk sum’s performance degrades dramatically. Pseudobulk mean appears invariant to this missing normalisation step because of the averaging’s own normalisation effect. This was a flaw in the simulation software strategy.

Answer 55

Best practices to single-cell processing and DE now defined. But many different approaches have been used in the AD field. Case study to show how disparate the results can be using the first AD scRNA-Seq study. Moreover, to avoid the AD field concentrating on spuriously identified genes for downstream work.

Answer 56

eprocessing of data using current best-practice approaches (resulted in 20k less cells), running pseudobulk DE on original processing and reprocessed data to show disparity with author's results. Showed authors DE's were just a product of cell counts. Randomly permuted labels 100 times for pseudoreplication and pseudobulk to highlight this effect.

Answer 57

About 80k cells from 48 individuals (24 AD pathology)

Answer 58

authors filtered out high mitochondrial read nuclei based on clusters from their t-SNE projection of the data. but ineffective as some kept with >75% mito reads -> single nucleus study so shouldn't be any/many.

Answer 59

uses an overdispersed Poisson model/negative binomial model to model count data and account for both biological and technical variability. EdgeR estimates the genewise dispersions by conditional maximum likelihood, conditioning on the total count for that gene. An empirical Bayes procedure is used to shrink the dispersions towards a consensus value, effectively borrowing information between genes. Finally differential expression is assessed for each gene using an exact test analogous to Fisher's exact test, but adapted for overdispersed data.

Answer 60

GWAS summary statistics have popularised and accelerated genetic research. However, a lack of standardisation of the file formats used and tools for formatting and quality control have proven problematic when running secondary analysis tools or performing meta-analysis studies

Answer 61

understanding genetic variants' effects in a cell type-specific manner is crucial for interpreting GWAS results. However, profiling these effects across the non-coding genome remains challenging due to the scalability limits of experimental methods, assaying a cell’s epigenome. Assayed many cell types in places like ENCODE but not cell types of interest in brain for AD - can we impute with DNNs. Interested in TFs as could eventually be targets but hist mark data is avail. Recent DNNs have increased receptive field but not pred in new cell types.

Answer 62

104 GRCh37 DNA sequence and p-value continuous tracks after peak calling in EpiMap. p-value was used over fold change down to its greater signal-to-noise ratio. ATAC rather than DNase due to is prevalence in recent work. Arcsinh-transformed signal used to negate the effects of differing seq depths.

Answer 63

Epitome predicted bed regions of peaks (1/0 pred) at 200 bp res. Had to look at 3,200 base-pairs to get LCM

Answer 64

* major goal of this work was to understand the effect of genetic variants on the cis-regulatory code in a cell type-specific manner. * Such as epigenetic regulation of transcription factor binding in microglia within the hippocampus, a cell type implicated in the genetic burden of AD * By applying the same SLDP method developed in Chapter 5 with AD GWAS data, it would be possible to systematically assess associations for all known transcription factors across multiple cell types. * The identified transcription factors could then be inhibited in a cell type-specific manner as a form of targeted therapeutics as has been explored for cancer research * However, such epigenetic experimental assays in these targeted brain regions and cell types have only been sparsely explored to date. To address this gap, in **Chapter** **5**, we developed Enformer Celltyping to predict epigenetic signals in previously unseen cell types to circumvent the lack of experimental assays. * We chose histone marks as a first attempt due to the availability of hQTL sets to validate this approach. After such, the original goal was to apply the same approach to predict transcription factor binding assays; however, we found that current genomic deep learning models can not capture the effect of genetic variants * Future research aimed at elucidating the cis-regulatory code of AD should prioritise the collation of experimental data for all relevant cell types and transcription factors as to extrapolate predictions to previously unseen cell types leads to worse performance and assay to pred in new cell types can't account for genetic variants

Answer 65

separate training of DNA sequence and chromatin accessibility submodules. DNA Average and distribution (10 bins, 0.5 intervals) and celltyping side pred the difference between the average histone mark signal and the cell type-specific signal. The pre-training sensibly initialised the weights of the chromatin accessibility layers before combining with the DNA layers which contained the Enformer architecture for the full training step

Answer 66

Global chromatin accessibility corresponds to the chromatin accessibility for 3,000 base-pairs around the transcriptional start site of 1,216 marker genes, averaged at 250 base-pair resolution (from PangloaDB)

Answer 67

used for local and global CA - random init of a set size (done at multiple resolutions for EC) and updated through backprop. Can visualise these in 2D with UMAP.

Answer 68

H3K27ac, H3K4me1, H3K4me3, H3K9me3, H3K27me3, and H3K36me3.

Answer 69

Pre-training: Poisson negative log-likelihood loss function was used for the average signal prediction and cross-entropy loss was used for the distribution in the DNA submodule whereas a mean squared error (MSE) loss function was used for the celltyping submodule following other epigenetic embedding approaches, given the possibility of negative values. Full training - The Poisson negative log-likelihood loss function was used for the full training stage.

Answer 70

random perms of DNA with in silico muts and measured effect. done in incremental positions

Answer 71

(used to compare EC and Epitome) - the weighted mean of precisions at each threshold. precision = true positives/ total number predicted positives

Answer 72

make sure SNP effects are the same when centred on SNP of reg effect region.

Answer 73

Measures the statistical concordance between the signed variant effects (our model’s predictions) and the genome-wide association study’s marginal correlations. SLDP uses generalized least-squares regression to measure the agreement between these, iteratively inverting the direction of the signed variant effect measures along with their neighbouring entries inLD blocks to derive a null distribution. The measured agreement defines how important the variants are to the phenotype’s heritability. ## Footnote It iteratively inverts the direction of signed variant effect measures to derive a null distribution.

Answer 74

Used hist marks that act outside of TSS or gene body as may harbour cell type-specific transcription factors and identified all peaks with -log 10 p-value cut-off. For each peak, the influence of the global signal was approximated by calculating the partial derivatives of the model with respect to the input, i.e. the gradient on the input. The results for each peak was ordered by absolute value and the top 10% of peaks reliant on the global signal were identified. The DNA at these positions were run through Homer to get TF's. Then TF's genes tested for cell type specificity with EWCE.

Answer 75

Method of estimating heritability. Uses assumption that genetic effect values for true associations are positively correlated with LD scores whereas genetic effect values for false positives (e.g. due to population stratification/drift) are not correlated with LD scores. s-LDSC estimating heritability across functional annotations. LD score regression quantifies the contribution of polygenicity (many small genetic effects) and bias so they can be separated. Regression eq: Test stat = average causal effect per SNP * LD score + inflation due to population stratification + 1 ## Footnote Regression equation: Test stat = average causal effect per SNP * LD score + inflation due to population stratification + 1.

Answer 76

H3K27ac peaks (pred - cut-off arcsinh(1), act and ATAC). Looked for enrichment in each cell type peak file with s-LDSC. Key point: pred H3K27ac better enrich than ATAC (using real H3K27ac as baseline) - Improved on ATAC in 14 of the 16 expected enrichments

Answer 77

w * X + b => weight, input X, bias.

Answer 78

Firstly, that the hQTL set contained interactions which caused a histone mark binding position to be removed by a genetic variant but these histone mark binding positions were not captured in Enformer Celltyping’s prediction on the major allele. Secondly, that genomic deep learning models, including Enformer Celltyping, inherently struggle to accurately predict the effect of genetic variants based on their current training paradigm.

Answer 79

Sasse _et al._, comprehensively showed how Enformer underperforms at predicting the effect of genetic variants on transcription, even predicting the incorrect direction of effect in up to 40% of tested cases

Answer 80

Many have previously linked hist marks to exp but did not consider all three of differing histone mark functions, distal reg effects & level of cell differentiation. Also, little done to use in silico perturbations to learn about reg from these models (like what has been done with DNA models)

Answer 81

eleven cell types/tissues, 7 hist marks (average log2-transformed, 100 base-pair binned read depth of the histone mark signal), two neural networks (simple- 6k bp, convolutions distal - chromoformer, attention 40k bp, 3 input res 100, 500, 2k each has own transformer block. both 100bp bins, both 4-fold cross-validation, both ADAM, both early stopping, separate model per fold and cell type, LR decay) - predicted log2-transformed RPKM

Answer 82

A gene was defined as active or inactive based on whether its expression level was above or below the median for that cell type

Answer 83

We find that there is no universal histone mark which is consistently the most predictive of expression. We recommend that researchers consider all three of these influencing factors when determining the effect of histone mark levels on the transcriptional state of a cell in their work. Although H3K4me3 always good, also H3K27ac and H3K9ac and H3K36me3 interesting. Also active marks performed better in ESCs whereas repressive marks like H3K9me3 performed relatively better in adult primary tissues. Active marks performed better at early stages of lineage commitment, i.e. ESCs, whereas repressive marks were more predictive in fully differentiated cells, i.e differentiated tissue (more accessible DNA in ESCs and early stages of lineage). A key point of our findings is the marginal return in performances by: 1. Extending from an intentionally simple local promoter model to an attention based, computationally complex model which accounts for distal histone mark levels and 2. Increasing the number of histone marks included in the model

Answer 84

We selected one active and one repressive mark which are found at both promoter and distal regulatory elements - H3K27ac and H3K27me3.

Answer 85

The predictions from the different k-fold versions of the model were averaged, similar to the approach commonly used in sequence to expression models. For the promoter histone signal, the full 6,000 base pairs around the TSS were perturbed, whereas for the distal histone signal, bins of 2,000 base pairs across the 40,000 base pair receptive field were perturbed iteratively. The implemented perturbation levels were between 0 and 1 inclusively in 0.1 steps. As well as averaging the predictions from the different k-fold versions, we also tested the correlation between the different folds to ensure the model is learning consistent regulatory code. Moreover, we benchmarked this concordance against Borzoi (for equal 40k recep field). It did a bit better than Borzoi.

Answer 86

upstream and downstream used only not TSS top decile of predicted expression change taken from both. Checked for overlap with finemapped eQTLs. To test for enrichment of the fine-mapped, tissue-specific SNPs, a bootstrap sampling experiment was implemented where the proportion of SNPs found in each decile were compared against 10,000 randomly sampled regions from all deciles. P-values were derived and adjusted using FDR correction for multiple testing. Also tested against Hi-C, maximum histone mark activity and proximal loci (6k bp up or downstream).

Answer 87

s-LDSC top decile pred change of up and downstream TSS.

Answer 88

By identifying genomic loci of interest based on perturbing histone mark levels, our model captures significant enrichment of eQTLs in the most predictive regions, offering an alternative to genomic deep learning models trained on DNA - don't need to worry about effects of LD.

Answer 89

architecture inherently designed to uncover differences in inputs like Siamese Neural Networks. Siamese networks have been heavily used in tasks like identifying differences and similarities between images. Traditionally, a Siamese network’s architecture has identical layers for each input with shared weights across them and uses a loss function like contrastive loss based on differences in the inputs

Answer 90

1. QTL (small variance) 2. evolutionary diversity (but naive genomic DNA only represents a limited subset of all possible genomic DNA arrangements) 3. synthetic sequences (MAVEs - MPRAs) 4. withholding the entire genomes (De Boer). All a trade-off between the amount of genetic diversity they include and their similarity to the functional assays used by genomic deep learning models trained in the current paradigm

Answer 91

is due to loss functions focusing on variation across genes or more generally genomic loci. The model is not penalised for the variation of each gene across cell contexts. A naïve model could thus predict the average expression of a gene across cell types to minimise loss. An argument could be made that penalising a model for not identifying cell type-specific changes in expression (or other functional assay) will force it to pay attention to cell type-specific regulatory motifs like enhancer motifs which in turn would benefit the model’s variant effect predictions. In the same manner as considering multiple personalised genomes at a location, the argument here is to consider multiple cellular states. Although the DNA input does not change, the output will, forcing the model to understand cell type-specific regulation. This view, although promising, would be non-trivial to incorporate and may require multiple loss functions to ensure the model maintains strong performance across genes or genomic loci or, if modelling expression, using pretrained models which capture epigenetic functional assays as a starting point as their learnt regulatory syntax is paramount e.g. chromatin accessibility as a proxy for transcription factor binding sites or regulatory histone marks like H3K27ac. It is worth noting that this has yet to be fully evaluated in the field. - Done by Decima

Answer 92

functional assays used to measure genomic deep learning models’ ability do not provide as straightforward a measure as this in terms of comprehending the cis-regulatory code: Each assay is subject to experimental variability within the technique such as batch effects but also variability across differing techniques resulting from technique-specific biases – for example, ChIP-Seq results differ dramatically to CUT&Tag using the same experimental conditions

Answer 93

1. Goal of GV-Rep is to validate genomic language models for the classification of genetic variants in a clinical settings (Li _et al._, 2024). For example, the records collated from ClinVar classify the pathogenicity of genetic variants; benign, likely benign, likely pathogenic or pathogenic, for differing diseases - not functional assay outputs 2. Other records are from GWAS summary statistics, associating SNPs to diseases based on their p-values. However, I believe these tasks are too broad for genomic deep learning models as they lack any functional measurements or cell type/cell state contexts and should not be included in such a benchmark dataset. 3. xQTL datasets. but doesn't consider LD. Even with fine-mapping usually just take highest confidence pos and negs which could bias models to easiest cases. 4. _in vitro_ experiments like MAVEs and CRISPRi assays - barriers to their immediate use however as when deposited, most experiments lack information to map their imputed genetic code to the reference genome and thus, only the immediate ~150 base-pairs are available which is far smaller than most genomic deep learning model’s receptive field. Also differ alot in output to what these models are trained to pred. Also does not capture in vivo measurements from the whole genome like cell type-specific effects.

Answer 94

Could consider individual genomes rather than genetic variants in isolation to avoid the effect of LD however this does not test a model’s ability to differentiate between causal and tagged-SNPs. Also, small amount of variation for monetary investment of whole genome sequencing. Also coordinated efforts to decide on which cell types and functional measurements to use across the differing data sources will be paramount for the field as if model trained to pred in cell A, benchmark on cell B is not going to be very useful

Answer 95

the regulatory information of the genome

Answer 96

neuronal loss and gliosis (reactive change of glial cells in response to damage to the central nervous system. In most cases, gliosis involves the proliferation)

Answer 97

hippocampus and entorhinal cortex (regions with a major role in memory and learning)

Answer 98

cholinergic neurons in the nucleus basalis of Meynert which forms part of the basal forebrain are selectively lost

Answer 99

synaptic loss, shown in animal models and cell cultures

Answer 100

lecanemab and donanemab; moderate slowing of cognitive decline, by clearing build-ups of extracellular plaque deposits of the β-amyloid peptide (Aβ) in the brain – a hallmark of AD

Answer 101

Scans input sequences and background sequences for k-mers (6-12 bp). Tested for significant enrichment of a specific k-mer in the input sequences using a hypergeometric test. Then clusters similar significant k-mers and gets TF's for motifs from DB like Jasper.

Answer 102

CAGE measure transcription at individual promoters (can be multiple to a gene) so moving to pred RNA-seq means the model must learn transcriptional regulation, splicing regulation etc which combined leads to a particular gene's expression.

Answer 103

The embedding layers convert the chromatin accessibility information from 1D to 2D shape where the size of the second dimension is based on the number chosen by the user. The embedded vector representation is then updated through backpropagation to better reflect the inputted signal. Flattening after the embedding layers will not affect the positional information of the chromatin accessibility information. See the example below to explain and motivate flattening after embedding and how positional information was not affected: Say we have two sentences as our entire dictionary: Sentence 1 I play football Sentence 2 I play basketball and so we assign integers to the words as follows: Sentence to sequence Sequence 1 [1, 2, 3] Sequence 2 [1, 2, 4] Then we take the input (the integer sequences) and add them to a Embedding layer which assigns random numbers in 2 dimensions and using backpropagation to update these values: Embedding matrix: Word Index Vector I 1 [0.4, 0.2] play 2 [0.7, 0.3] football 3 [0.2, 0.1] basketball 4 [0.2, 0.8] So the embedding of the sentences looks like: Embedding: Sequence 1 [[0.4, 0.2], [0.7, 0.3], [0.2, 0.1]] Sequence 2 [[0.4, 0.2], [0.7, 0.3], [0.2, 0.8]] Next the model flattens the sequences with the embeddings in order to make it 1 dimension again: Flatten to 1 dimension: Sequence 1 [0.4, 0.2, 0.7, 0.3, 0.2, 0.1] Sequence 2 [0.4, 0.2 ,0.7, 0.3, 0.2, 0.8] I hope this is clear how the words above’s order remains the same after embedding and flattening which similarly, the order of the chromatin accessibility data is similarly preserved. This same flattening approach has been used in other genomic deep learning models like Avocado (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01977-6, model architecture: https://github.com/jmschrei/avocado/blob/master/avocado/model.py) and is also used commonly in NLP without affecting positional information such as Google research’s BUSTLE

Answer 104

Stein aerts and stark groups * prove a point that enhancers contain code specific to one cell type but can effect multiple cell types by having intertwined codes in the same 500bp region. Shown by making a cell type specific enhancer active in another cell type and making an enhancer which is active in multiple cell types, active in just one. * Aert's work done in fruit fly brain and then humans too but humans validated with Enformer rather than experimental work

Answer 105

Feature Attention Hyena Mamba Complexity: 𝑂(𝑛^2) 𝑂(𝑛log𝑛) 𝑂(𝑛) or 𝑂(𝑛log⁡𝑛) Long-range Dependencies: Excellent Good Good Efficiency for Long Seq: Poor Excellent Excellent Mechanis: Token-token interactions Convolution + Gating Structured Linear Operations Best Use Cases: General-purpose NLP tasks Long-sequence modeling Long-sequence modeling

Answer 106

attention mechanisms (e.g., multi-headed attention) do not inherently capture order (they treat input tokens as a "bag of words"), positional encoding introduces this sequence information. These can be learnt (back prop) or fixed positional encodings. After calculating positional encodings, they are added element-wise to the token embeddings: Input Representation=Token Embeddings+Positional Encodings. So this happens before attention. Raw tokens are first mapped to learned token embeddings using an embedding layer then the positional encodings are added to it then they are passed to the transformer (attention). Rotary Positional Embedding (RoPE) is commonly used for relative positional encodings now but not in genomics.

Answer 107

encodes positional information directly into the attention mechanism by applying rotations to the key (K) and query (Q) vectors in the attention mechanism. Instead of adding positional encodings to the input embeddings (as in sinusoidal or learned encodings), RoPE multiplies the positional information as a transformation of the input vectors, allowing the attention mechanism to naturally consider relative positions between tokens. RoPE uses a sinusoidal pattern (like the original fixed positional encoding) but embeds it into the attention mechanism by rotating the embeddings. For a given position p, the transformation involves a rotation in the complex plane, using the following formula: SEE NOTES FOR FORMULA

Answer 108

exponential (lower value as goes further away), central mask (0 for positions beyond a certain point) and gamma (another dist with lower values further away but parameterised so diff to exponential)

Answer 109

att = attention O(n^2), hyena = conv + gating O(n log n), mamba = structured linear operations n(n log n)

Answer 110

enhancers contain code specific to one cell type but can effect multiple cell types by having intertwined codes in the same 500bp region. Shown by making a cell type specific enhancer active in another cell type and making an enhancer which is active in multiple cell types, active in just one.

Viva prep all Flashcards

(135 cards)