Computational paleogenomics and data types Flashcards
(37 cards)
Why are mitogenomes so great in paleogenomics?
The advantages of using mitogenomes are:
- Mitogenomes are way more abundant than nuclear DNA
- The genome is much smaller
- Haploid (no heterozygous sites)
- Inherited maternally
- No recombination
- Conserved across eukaryotes… but also, higher mutation rates
Can you be 100% sure that the mitogenome you have assembled is correct? motivate.
Know that you’ll never be 100% sure, there are so many approaches / software out there (some very bad and/or difficult to understand) and every ancient DNA sample comes with its own complexity (I’m looking at you, museum specimens!)
The best you can do is:
- To find the best tool according to your needs and try more than one
- To tailor your analysis to take care of DNA damage (if not removed before sequencing)
What is mapping iterative assembly? When is it useful
Mapping iterative assembly (MIA) is sort of like an intermediate point between mapping to a reference genome and a de novo assembly. It is a method used in bioinformatics that works by:
- initial mapping: mapping reads to a reference genome (can be complete or partial) which allows for determining the order and maps out conserved regions. This requires good reference genomes.
- consensus calling: The resulting alignment of the reads are assembled into a consensus sequence. The more stringent you are with the consensus seq %, the more data you lose. In this step you can for example see if you have a lot of C and T in the same position, could be due to deamination (T) - if you did remove damage it might not have been complete.
- Iterative Refinement: The consensus sequence is used as the new reference genome for the next iteration. This cycle of mapping, consensus calling, and reference update is repeated until the consensus sequence no longer changes significantly. The final consensus sequence is the assembled genome or the target region of interest.
This is useful if you don’t have a reference for your species or think your samples might be very divergent from it.
There has been a big improvement in terms of references and mapped individuals in phylogenetics the last 40 years, what are the benefits with this?
With more individuals and good references, you get a lot higher resolution of temporal data, and can make more narrow conclusions.
What is the Molecular clock hypothesis?
The molecular clock hypothesis poses that the rate of evolutionary change of any specified Protein, DNA or RNA sequence is approximately constant over time and over different lineages. This allows you to analyze time of divergence between samples. For DNA the substitution rate is roughly 1 substitution per 20 nucleotides in 1 million years, so looking at a stretch of DNA of 20 NTs for several samples, the number of substitutions compared to the modern sample is the amount of million years since their divergence (roughly).
There are many phylogenetic inference methods, describe two.
The most commonly used ones are:
- Maximum Likelihood (ML): ML calculates the probability of observing the given data (e.g., DNA sequences) given a specific tree topology and branch lengths. The method then seeks the tree topology and branch lengths that maximize this probability. ML has sound statistical foundations and often performs well in simulations. However, ML can be computationally expensive, especially for large datasets.
- Bayesian Inference: Bayesian methods incorporate prior knowledge about the evolutionary process into the analysis, allowing for a more robust assessment of the phylogenetic tree.
They use Markov Chain Monte Carlo (MCMC) to explore the space of possible phylogenetic trees and estimate the posterior probability distribution of the tree. Bayesian methods offer more flexibility in modeling evolutionary processes and are particularly well-suited for analyzing large datasets.
Others:
- Maximum Parsimony (MP)
- Distance-matrix methods: e.g. Neighbor-Joining (NJ) or UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
- Coalescence-based methods
List the steps in a typical phylogenetic workflow.
- Raw data processing
- Mitogenome assembly
- Alignment
- Bayesian analysis
- Phylogeny construction
What is the goal of Bayesian molecular clock dating (AKA: tip dating)? What are the challenges?
To confidently infer the age of samples that are beyond the radiocarbon limit (> 50 ka). The challenge is to obtain estimates that match stratigraphic context information (when available), it is also hard and complex as it takes in a lot of factors into consideration, for example: Several models are used for rates of mutation and accumulation of mutation also priors need to be specified, e.g divergence time of the species with an outgroup and radiocarbon dates (or modern samples) used as references. Several probability distributions can be used to model their uncertainty.
Give two examples of applications for molecular tip dating.
- With tip dating we are able to Infer better phylogenies → more accurate reconstructions of species’ evolutionary histories.
- Tip dating also allows for establishing which samples are truly old (oftentimes the context in which they are found is messy).
Are all regions of DNA mutating at the same rate?
No, different regions of DNA - coding, non-coding, tRNA, mtDNA etc - are mutating at different rates, and because of that they are usually handled as their own separate fractions, each with it’s own mutation rate (based on previous knowledge).
After having received the sequencing data back, when read mapping, filtering and QC is done, one can use single or multi-sample calling to infer genotypes. What is the difference and which one is mostly used in aDNA studies?
In genomic analysis, single-sample calling involves analyzing each sample’s data independently to identify variants, while multi-sample calling analyzes the data from multiple samples simultaneously. Single-sample calling is useful when you want to get the best consensus for each sample, while multi-sample calling is preferred for large cohorts and for identifying variants with low false positive rates. As aDNA data is typically scarce, we often don’t have large population data, so single-sample calling is more often used to identify SNPs and associated genotypes.
A common approach is to aggregate all reads for each position, then calculate the likelihood for each possible genotype (using tools/algorithms) and then selecting the most likely genotype as correct and comparing it to the reference genome.
What is good to remove/filter out from the sequencing data (after filtering out other species)?
It’s a good idea to remove:
- Genotype calls with low confidence
- repetitive (low-complexity) regions: often contains more sequencing errors.
- CpG regions (if damage was removed by USER): as CpG islands get methylated, the USER enxyme doesn’t work there.
- Allele variants with low support (possible false positives): This is done through allele balance filtering.
Quality controls and filters are your friends (it’s better to lose some data than to reach erroneous conclusions)
Explain how allele balance filtering works briefly.
Allele balance refers to the ratio of reads supporting the reference allele to the reads supporting the alternate allele in heterozygous individuals. In ideal scenarios, a heterozygous genotype should show roughly equal support for both alleles. If the allele balance deviates significantly from the expected 50/50 ratio, it suggests a potential issue with the variant call. This could be due to various factors, including sequencing errors, alignment issues, or other technical artifacts. When filtering, you set a threshold for how low occurrence an alternative can have to be counted as “real” variation, e.g over 20%. If you have 10 reads and 9 of them shows a G and 1 shows a C, the support for C is not over 20% and thus is probably not a real variation, if however 6 of the 10 reads shows a G and 4 a C, there is good support for C being an alternative allele. In order to use this the coverage need to be high, mainly depth - ideally over 10X coverage. The higher the coverage, the confidence in establishing heterozygous sites.
Allele balance filtering helps to identify and remove false positive variant calls, improving the accuracy of downstream analysis and reducing the risk of interpreting errors as true genetic variations. Imagine a sequencing read supporting a variant at a specific locus. If a large majority of reads support one allele, and very few support the other, the allele balance would be highly skewed. This skewed balance might be a sign of a sequencing or alignment error, and the variant call might be filtered out based on this.
What does the number for genome-wide heterozygosity tell you?
Genome-wide heterozygosity is an estimation of how much heterozygous sites are present in the sample genome, which can be good for comparison between samples. From this data you can also see if some regions in the genome is overrepresented in heterozygosity/variability - which can say something about dual-fitness alleles etc. Genome wide heterozygosity can also tell you something about the genetic variation within populations, a lot of endangered species have a log GWH.
What are Runs of Homozygosity (ROH), how can this measure be useful?
The inheritance of identical haplotypes from a common ancestor creates long tracts of homozygous genotypes known as runs of homozygosity (ROH). They are widely use as an estimator of inbreeding known as the fraction of Runs of Homozygosity (^F ROH) and can tell you something about population history.
In the context of Runs of Homozygosity (ROH) in different populations, what can you infer from looking at SROH (Sum of Runs of Homozygosity) as a function of NROH (Number of Runs of Homozygosity)?
Larger populations usually have a lower SROH and NROH than smaller, while admixed pops have the lowest. Bottlenecked populations usually have high SROH and NROH.
Keep in mind that populations that have been small for a long time can have high ROH even though they don’t share a recent common ancestor, so one needs to be careful with interpretations.
What is a Principal Component Analysis (PCA) used for?
Principal Component Analysis (PCA)`is an exploratory analysis method that is used to make the complexity of a dataset lower, and only identify principal differences. Very useful to visualize data but are hard to interpret, so interpretations should not solely be based on PCA.
Do not torture the data until in confess, often simpler analyses are more straightforward to interpret.
Name two admixture modelling methods.
Examples of admixture modelling methods are:
- Model based clustering: eg ADMIXTURE, STRUCTURE which builds on PCA.
- F statistics: specifically F3 statistics, which is defined as the product of allele frequency differences between population C to A and B, respectively, to test whether pop C is admixed between A and B. A negative F3, provides unambiguous proof that population C is admixed between populations A and B. m
For further explanation on F3 statistics, see https://mpi-eva-archaeogenetics.github.io/comp_human_adna_book/fstats.html
What method is used to study demographic history inferred from genomic data?
Sequentially Markovian Coalescent (SMC) methods use patterns of heterozygosity across the genome (where the two chromosomes differ) to infer the distribution of coalescent times—i.e., when the two alleles at a given site last shared a common ancestor to infer how effective population size (Ne) has changed over time. Short time to last common ancestor = smaller effective population size back in time (Effective population size is inversely proportional to coalescence rate).
Which genetic data types can we analyze?
From the nucleus (undergoing recombination (mostly), we can analyze:
- Autosomal chromosomes
- Sex chromosomes: X and Y (X is recombining in females but X and Y is mostly no recombining in males)
From the mitochondria (no recombination) we can analyze
- mtDNA
What is the difference between the census population (Nc) and effective population (Ne)?
Census Population size (Nc): total number of individuals
Effective population size (Ne): number of breeding individuals that will contribute to the genepool in an idealized population
Usually Ne is lower than Nc as not all individuals will have viable offspring.
What contribution does the autosomal, X and Y chromosome and mitochondrial DNA make to the effective pop size?
- Autosomal chromosomes contribute to 100% of the Ne (all of them are transferred to offspring)
- X chromosomes contribute to 75% of the Ne compared to autosomes (as females have two and either is transferred, and males have one that is only contributed to roughly half of the offspring - the X chromosome contribute to 75% of Ne)
- Y chromosome: Effective population size: 25% compared to autosomes. A bit tricky to analyze and sequence due to repetitive regions, not many coding genes on the Y-chromosome
- MtDNA: Effective population size: 25% compared to autosomes. High mutation rate (evolves fast) Human mitochondrial DNA: 16 569 bp.
What can mtDNA be used to study?
mtDNA be used to study/establish:
- Phylogeny
- Haplotype networks
- Demographic history
- Molecular dating
Compare mtDNA and Nuclear DNA with pros and cons.
MtDNA:
+ Easy to extract, sequence, and analyze
+ Evolutionary interesting (high variability but also conserved regions)
+ Universal PCR primers (for example for mammals) can be used
+ High copy number → better preservation
- Fairly small proportion of the genome: low resolution as only 25% contribution to Ne
- Only gives insight into maternal population history, no recombination
Nuclear DNA:
+ Comprises more complete picture, including recombination
+ High resolution: 100% contribution to Ne
- But also more complex → more complicated to generate and analyze
- Lower copy number → in ancient and historical samples often less preserved
- Expensive to generate