Computational paleogenomics and data types Flashcards by Lina Håkansson

Why are mitogenomes so great in paleogenomics?

The advantages of using mitogenomes are:

Mitogenomes are way more abundant than nuclear DNA
The genome is much smaller
Haploid (no heterozygous sites)
Inherited maternally
No recombination
Conserved across eukaryotes… but also, higher mutation rates

How well did you know this?

Not at all

Perfectly

Can you be 100% sure that the mitogenome you have assembled is correct? motivate.

Know that you’ll never be 100% sure, there are so many approaches / software out there (some very bad and/or difficult to understand) and every ancient DNA sample comes with its own complexity (I’m looking at you, museum specimens!)

The best you can do is:
- To find the best tool according to your needs and try more than one
- To tailor your analysis to take care of DNA damage (if not removed before sequencing)

How well did you know this?

Not at all

Perfectly

What is mapping iterative assembly? When is it useful

Mapping iterative assembly (MIA) is sort of like an intermediate point between mapping to a reference genome and a de novo assembly. It is a method used in bioinformatics that works by:

initial mapping: mapping reads to a reference genome (can be complete or partial) which allows for determining the order and maps out conserved regions. This requires good reference genomes.
consensus calling: The resulting alignment of the reads are assembled into a consensus sequence. The more stringent you are with the consensus seq %, the more data you lose. In this step you can for example see if you have a lot of C and T in the same position, could be due to deamination (T) - if you did remove damage it might not have been complete.
Iterative Refinement: The consensus sequence is used as the new reference genome for the next iteration. This cycle of mapping, consensus calling, and reference update is repeated until the consensus sequence no longer changes significantly. The final consensus sequence is the assembled genome or the target region of interest.

This is useful if you don’t have a reference for your species or think your samples might be very divergent from it.

How well did you know this?

Not at all

Perfectly

There has been a big improvement in terms of references and mapped individuals in phylogenetics the last 40 years, what are the benefits with this?

With more individuals and good references, you get a lot higher resolution of temporal data, and can make more narrow conclusions.

How well did you know this?

Not at all

Perfectly

What is the Molecular clock hypothesis?

The molecular clock hypothesis poses that the rate of evolutionary change of any specified Protein, DNA or RNA sequence is approximately constant over time and over different lineages. This allows you to analyze time of divergence between samples. For DNA the substitution rate is roughly 1 substitution per 20 nucleotides in 1 million years, so looking at a stretch of DNA of 20 NTs for several samples, the number of substitutions compared to the modern sample is the amount of million years since their divergence (roughly).

How well did you know this?

Not at all

Perfectly

There are many phylogenetic inference methods, describe two.

The most commonly used ones are:

Maximum Likelihood (ML): ML calculates the probability of observing the given data (e.g., DNA sequences) given a specific tree topology and branch lengths. The method then seeks the tree topology and branch lengths that maximize this probability. ML has sound statistical foundations and often performs well in simulations. However, ML can be computationally expensive, especially for large datasets.
Bayesian Inference: Bayesian methods incorporate prior knowledge about the evolutionary process into the analysis, allowing for a more robust assessment of the phylogenetic tree.
They use Markov Chain Monte Carlo (MCMC) to explore the space of possible phylogenetic trees and estimate the posterior probability distribution of the tree. Bayesian methods offer more flexibility in modeling evolutionary processes and are particularly well-suited for analyzing large datasets.

Others:
- Maximum Parsimony (MP)
- Distance-matrix methods: e.g. Neighbor-Joining (NJ) or UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
- Coalescence-based methods

How well did you know this?

Not at all

Perfectly

List the steps in a typical phylogenetic workflow.

Raw data processing
Mitogenome assembly
Alignment
Bayesian analysis
Phylogeny construction

How well did you know this?

Not at all

Perfectly

What is the goal of Bayesian molecular clock dating (AKA: tip dating)? What are the challenges?

To confidently infer the age of samples that are beyond the radiocarbon limit (> 50 ka). The challenge is to obtain estimates that match stratigraphic context information (when available), it is also hard and complex as it takes in a lot of factors into consideration, for example: Several models are used for rates of mutation and accumulation of mutation also priors need to be specified, e.g divergence time of the species with an outgroup and radiocarbon dates (or modern samples) used as references. Several probability distributions can be used to model their uncertainty.

How well did you know this?

Not at all

Perfectly

Give two examples of applications for molecular tip dating.

With tip dating we are able to Infer better phylogenies → more accurate reconstructions of species’ evolutionary histories.
Tip dating also allows for establishing which samples are truly old (oftentimes the context in which they are found is messy).

How well did you know this?

Not at all

Perfectly

Are all regions of DNA mutating at the same rate?

No, different regions of DNA - coding, non-coding, tRNA, mtDNA etc - are mutating at different rates, and because of that they are usually handled as their own separate fractions, each with it’s own mutation rate (based on previous knowledge).

How well did you know this?

Not at all

Perfectly

After having received the sequencing data back, when read mapping, filtering and QC is done, one can use single or multi-sample calling to infer genotypes. What is the difference and which one is mostly used in aDNA studies?

In genomic analysis, single-sample calling involves analyzing each sample’s data independently to identify variants, while multi-sample calling analyzes the data from multiple samples simultaneously. Single-sample calling is useful when you want to get the best consensus for each sample, while multi-sample calling is preferred for large cohorts and for identifying variants with low false positive rates. As aDNA data is typically scarce, we often don’t have large population data, so single-sample calling is more often used to identify SNPs and associated genotypes.

A common approach is to aggregate all reads for each position, then calculate the likelihood for each possible genotype (using tools/algorithms) and then selecting the most likely genotype as correct and comparing it to the reference genome.

How well did you know this?

Not at all

Perfectly

What is good to remove/filter out from the sequencing data (after filtering out other species)?

It’s a good idea to remove:
- Genotype calls with low confidence
- repetitive (low-complexity) regions: often contains more sequencing errors.
- CpG regions (if damage was removed by USER): as CpG islands get methylated, the USER enxyme doesn’t work there.
- Allele variants with low support (possible false positives): This is done through allele balance filtering.

Quality controls and filters are your friends (it’s better to lose some data than to reach erroneous conclusions)

How well did you know this?

Not at all

Perfectly

Explain how allele balance filtering works briefly.

Allele balance refers to the ratio of reads supporting the reference allele to the reads supporting the alternate allele in heterozygous individuals. In ideal scenarios, a heterozygous genotype should show roughly equal support for both alleles. If the allele balance deviates significantly from the expected 50/50 ratio, it suggests a potential issue with the variant call. This could be due to various factors, including sequencing errors, alignment issues, or other technical artifacts. When filtering, you set a threshold for how low occurrence an alternative can have to be counted as “real” variation, e.g over 20%. If you have 10 reads and 9 of them shows a G and 1 shows a C, the support for C is not over 20% and thus is probably not a real variation, if however 6 of the 10 reads shows a G and 4 a C, there is good support for C being an alternative allele. In order to use this the coverage need to be high, mainly depth - ideally over 10X coverage. The higher the coverage, the confidence in establishing heterozygous sites.

Allele balance filtering helps to identify and remove false positive variant calls, improving the accuracy of downstream analysis and reducing the risk of interpreting errors as true genetic variations. Imagine a sequencing read supporting a variant at a specific locus. If a large majority of reads support one allele, and very few support the other, the allele balance would be highly skewed. This skewed balance might be a sign of a sequencing or alignment error, and the variant call might be filtered out based on this.

How well did you know this?

Not at all

Perfectly

What does the number for genome-wide heterozygosity tell you?

Genome-wide heterozygosity is an estimation of how much heterozygous sites are present in the sample genome, which can be good for comparison between samples. From this data you can also see if some regions in the genome is overrepresented in heterozygosity/variability - which can say something about dual-fitness alleles etc. Genome wide heterozygosity can also tell you something about the genetic variation within populations, a lot of endangered species have a log GWH.

How well did you know this?

Not at all

Perfectly

What are Runs of Homozygosity (ROH), how can this measure be useful?

The inheritance of identical haplotypes from a common ancestor creates long tracts of homozygous genotypes known as runs of homozygosity (ROH). They are widely use as an estimator of inbreeding known as the fraction of Runs of Homozygosity (^F ROH) and can tell you something about population history.

How well did you know this?

Not at all

Perfectly

In the context of Runs of Homozygosity (ROH) in different populations, what can you infer from looking at SROH (Sum of Runs of Homozygosity) as a function of NROH (Number of Runs of Homozygosity)?

Study These Flashcards

Larger populations usually have a lower SROH and NROH than smaller, while admixed pops have the lowest. Bottlenecked populations usually have high SROH and NROH.

Keep in mind that populations that have been small for a long time can have high ROH even though they don’t share a recent common ancestor, so one needs to be careful with interpretations.

What is a Principal Component Analysis (PCA) used for?

Study These Flashcards

Principal Component Analysis (PCA)`is an exploratory analysis method that is used to make the complexity of a dataset lower, and only identify principal differences. Very useful to visualize data but are hard to interpret, so interpretations should not solely be based on PCA.

Do not torture the data until in confess, often simpler analyses are more straightforward to interpret.

Name two admixture modelling methods.

Study These Flashcards

Examples of admixture modelling methods are:

Model based clustering: eg ADMIXTURE, STRUCTURE which builds on PCA.
F statistics: specifically F3 statistics, which is defined as the product of allele frequency differences between population C to A and B, respectively, to test whether pop C is admixed between A and B. A negative F3, provides unambiguous proof that population C is admixed between populations A and B. m

For further explanation on F3 statistics, see https://mpi-eva-archaeogenetics.github.io/comp_human_adna_book/fstats.html

What method is used to study demographic history inferred from genomic data?

Study These Flashcards

Sequentially Markovian Coalescent (SMC) methods use patterns of heterozygosity across the genome (where the two chromosomes differ) to infer the distribution of coalescent times—i.e., when the two alleles at a given site last shared a common ancestor to infer how effective population size (Ne) has changed over time. Short time to last common ancestor = smaller effective population size back in time (Effective population size is inversely proportional to coalescence rate).

Which genetic data types can we analyze?

Study These Flashcards

From the nucleus (undergoing recombination (mostly), we can analyze:
- Autosomal chromosomes
- Sex chromosomes: X and Y (X is recombining in females but X and Y is mostly no recombining in males)

From the mitochondria (no recombination) we can analyze
- mtDNA

What is the difference between the census population (Nc) and effective population (Ne)?

Study These Flashcards

Census Population size (Nc): total number of individuals

Effective population size (Ne): number of breeding individuals that will contribute to the genepool in an idealized population

Usually Ne is lower than Nc as not all individuals will have viable offspring.

What contribution does the autosomal, X and Y chromosome and mitochondrial DNA make to the effective pop size?

Study These Flashcards

Autosomal chromosomes contribute to 100% of the Ne (all of them are transferred to offspring)
X chromosomes contribute to 75% of the Ne compared to autosomes (as females have two and either is transferred, and males have one that is only contributed to roughly half of the offspring - the X chromosome contribute to 75% of Ne)
Y chromosome: Effective population size: 25% compared to autosomes. A bit tricky to analyze and sequence due to repetitive regions, not many coding genes on the Y-chromosome
MtDNA: Effective population size: 25% compared to autosomes. High mutation rate (evolves fast) Human mitochondrial DNA: 16 569 bp.

What can mtDNA be used to study?

Study These Flashcards

mtDNA be used to study/establish:

Phylogeny
Haplotype networks
Demographic history
Molecular dating

Compare mtDNA and Nuclear DNA with pros and cons.

Study These Flashcards

MtDNA:
+ Easy to extract, sequence, and analyze
+ Evolutionary interesting (high variability but also conserved regions)
+ Universal PCR primers (for example for mammals) can be used
+ High copy number → better preservation

Fairly small proportion of the genome: low resolution as only 25% contribution to Ne
Only gives insight into maternal population history, no recombination

Nuclear DNA:
+ Comprises more complete picture, including recombination
+ High resolution: 100% contribution to Ne

But also more complex → more complicated to generate and analyze
Lower copy number → in ancient and historical samples often less preserved
Expensive to generate

Which genetic markers can we analyze?

Genetic markers we can analyze include: - Microsatellites - Indels: Insertion: one (or several) additional bases, Deletion: one (or several) bases missing - Single nucleotide polymorphisms (SNPs)

What are microsatellites?

Microsatellites AKA Short Sequence Repeats (SSR) or Short Tandem Repeats (STR): Short base pair segments (1-10 bp) in variable number in between unique sequences. Same amount of repeats = more likely to be related. Note: Homoplasy is possible, Homoplasy = same repeat length due to convergent mutation(s) → identical in state, different in descent. So this is just part of the picture, more markers needed for full picture.

How does microsatellites appear and how do they generally change over time?

Microsatellites can appear similar to mutations: strand slipping during DNA replication →loss or gain or repeat(s). Over the course of evolution, it is more common that microsatellites increase in number over time, which can be useful in building phylogenies. For example, more repeats in humans than chimpanzee which have more that gorilla and so on down to tamarins. Also used in paternity testing.

What are the challenges of studying microsatellites in aDNA?

The challenges with microsatellites in aDNA is are: - Low copy number of aDNA: low confidence in correct amount of repeats. - Allelic dropout comparably high due to fragmentation of aDNA: a phenomenon in DNA analysis where one or more alleles are not present in a sample, leading to an inaccurate representation of the individual's genetic makeup - Reads have to be quite long to capture all repeats, and the older the DNA the more fragmented

What is the definition of a Single nucleotide polymorphism (SNP)?

A SNP is a base substitution (genetic variation), which is found in at least 1 % of population. If less than 1% it's considered a mutation. So for aDNA studies, where we don't have that many samples, thevariation need to be detected in more than one sample to be considered a SNP. SNPs are identical by descent, as it is highly unlikely that that exact base is mutated (so low mutation rate), and therefor is inherited.

When identifying SNPs, it is important to keep in mind which base is the original and "new", why?

Since aDNA is subject to PMD, a substitution from C-T is not a reliable SNP, as it could have been caused by PMD and sequencing. C<->T and G<->A are transitions (not reliable due to PMD) but A<->T and G<->C are transversions (more reliable as not caused by PMD and subsequent sequencing).

Why are SNPs relevant?

Besides being useful in establishing descent, different combinations of SNPs can result in distinguished phenotypes. For example 6 SNPs, each located in the regulatory region of the LCT gene → leads to a phenotype where the lactase enzyme persists during adulthood (=”lactose tolerant”). We have used this to study when lactase persistence came around (around the agricultural revolution) as opposed to the ancestral state of lactose intolerance.

Describe what a SNP array is.

Using SNP arrays is a method where one is sequencing specific bases which are known to differ instead of the whole genome. This is done by using beads covered in complementary sequences to the target SNPs, so you need to design these according to what you're interested in. This way, we only sequence known “informative” positions (several at a time) which is very cost and time effective compared to whole-genome sequencing and still allows for extracting important information.

What are the advantages and disadvantages of using whole genome data?

The advantages of using whole genome data are: + Complete genomes: Nuclear (autosomal and sex chromosomes), mitochondrial DNA and short markers --> Everything in one data type + extreme statistical power + massive amounts of data that can be used to answer many different scientific questions Disadvantages: - aDNA can be poorly preserved → high sequencing effort and costs - Analysis and interpretation is very complex - Computationally intensive

Compare Short markers vs. Whole genome data by key factors.

Short markers: - Possible to generate with “old” sequencing methods (Sanger sequencing) --> much previous data can be compared - Cheaper - Enough to answer some questions Whole genomes - Complete & complex - Only possible on large scale since the Next generation sequencing methods (ca 2012) - Much more powerful - Expensive

Do you get the same results when looking at mtDNA, Y chromosome and autosomes?

Not necessarily! For example, when only looking at autosomes Denisovans and Neanderthals' are sister groups, with modern humans as outgroup. If we instead look at Y chromosome and mitochondrial DNA, Neanderthals and humans are sister groups while Denisovans are the outgroup. This shows a clear example of using several data types provides a fuller picture than using a single one (even though it is more expensive and labor heavy). Since both Y chromosome DNA and mitochondrial DNA from humans is fixed in Neanderthals', this shows that there is no sex bias, and why this came to be is still a mystery. Autosomal DNA and mtDNA often show discrepancies in phylogeny (don't match) but autosomal DNA is better at showing the full picture better. Also, just looking at mtDNA is not enough for example in conservation, as mtDNA can have low variation while the autosomal variation is high --> no issue. Gene tree and species tree can also look different, best to look at several data types.

Why can the results obtained from different data types differ?

The results can differ due to many reasons, including: - sex biased gene flow: since mtDNA is maternal but autosomal is inherited from both, the autosomal genome can be diluted while mtDNA is not, for example if only males move into new populations - they will only contribute to autosomal variation, not mtDNA. - Incomplete lineage sorting: an evolutionary phenomenon where the history of a gene (gene tree) doesn't align with the history of the species it belongs to (species tree). ILS is a consequence of ancestral genetic variation, where multiple alleles (different versions of a gene) exist within a population before a speciation event. - Selection: If part of the genome inherited from an admixture event is more advantageous, this can get fixed in the population.

Besides animal DNA, is there other sources of data types in paleogenomics?

Yes! Besides animal DNA (mostly mammals) there is also: - Chloroplast DNA in plants: similar to mtDNA in animals in that it is very abundant, but as mtDNA is less variable in plants than in mammals, chloroplast DNA is a good alternative. - aRNA: While it's possible to look at aRNA, it is much less stable and thus preserve poorly over time, so only possible for younger specimens. For example used in Ötzi. - Pathogens: Looking at DNA from pathogens is a growing field, e.g. used to determine causes of past epidemics.

Computational paleogenomics and data types Flashcards

(37 cards)