14 | Large scale analyses Flashcards
Define population genetics.
- subfield of genetics
- part of evolutionary biology
- deals with genetic differences within and among pops
- examine phenomena eg: adaptation, speciation, population structure
from 1920s on
(Fisher, Haldane, Wright)
* formal models of evolution, statistical methods
* allele and gene frequencies in populations over time
* data: few genotypes of a limited number of individuals
Define phylogenomics
Ultimate goal: reconstruct the evolutionary history of species through their genomes
Intersection of the fields of evolution and genomics
Analysis that involves genome data and evolutionary reconstructions.
What is molecular genetics / genomics about?
What type of data?
from 1950s on
(Watson & Crick)?
* makeup, expression, regulation of genes;
genotype-phenotype
* data: gene & genome sequences, phenotypes
Define Population Genomics.
Data used?
Concepts and tools?
Questions?
- large-scale application of genomic technologies to study populations of individuals
- data: multiple genomes from the same (or closely
related) species; thousands / millions of SNPs per
individual - studies genome-wide effects to improve our understanding of microevolution –> learn phylogenetic history and demography of a pop
concepts & tools
- linkage disequilibrium, genetic drift, coalescent,
multivariate statistics, …
questions: population structure & history; detect evolutionary processes along the genome
- gene & genome evolution, recombination events, split times, gene flow, population sizes, demographic events, selection, diversification, relatedness, pop. structure, etc
- contemporary & ancestral populations/species
Genome-scale evolutionary analysis
Based on which regions? (two options)
In both cases: several gene sequences (can’t be one like in the previous small scale analyses we studied)
based on coding regions:
several (or all) gene sequences from different species –> families of homologs / orthologs
- often one individual per species
- homology/orthology assignment required
based on any/all genomic regions:
independent of gene content –> homologous genomic regions –> alignments and phylogenies
- generally done based on re-sequencing data
- separate homology assignment step not necessary
- same or closely related species
Which step in the pipeline sets large scale analyses apart from the small scale analyses we studied previously?
Just this step is done differently and makes large scale more difficult:
Assigning sequences (genomic data from multiple species) to families of homologous/orthologous genes
What are the three general approaches to Genome-scale inference of homology / orthology?
And what are the possible types of data? (plus examples)
Approaches:
- Tree based
- Graph based
- Hybrid
Data:
- Databases: use pre-computed sets of homologs / orthologs (eg Treefam,. OMA)
- Customized data set: compute project-specific homologs / orthologs
What is an example of a database which can be used for tree-based genome scale inference of homology/orthology?
`Treefam
What are the two types of graph-based genome scale inference of homology/orthology?
Name an example database for each type.
graph-based
- reciprocal best match (RBH): COG
- clustering (MCL): OrthoMCL
What is usually the best approach for genome-scale inference of homology / orthology?
Name a database that can be used for this
A hybrid approach combining graph-based and tree-based methods.
For example OMA (Orthologous MAtrix)
What is OMA?
OMA (Orthologous MAtrix) algorithm & database
- best strategy for large-scale assignment of orthologs/homologs
- using a combination of graph- and tree-based methods
- multi-step pipeline
- pre-computed (and cross-linked from other dbs)
- 1:1 orthologs
- homologs (orthologs & paralogs)
- stand-alone software
- companion tools (analysis, visualization)
- HOGS
–> large scale analysis and assignment of homology is really difficult, number of strategies, result often incorrect, time-consuming
–> rather go with pre existing
What are seven challenges of assigning homology / orthology
- pairwise orthology definition (non-transitive)
- differential gene loss (or incomplete sampling)
- multi-domain proteins / mosaics
- horizontal transfer (xenologs)
- high rates of sequence divergence
- poor genome assembly / annotation
- computational demand
What is resequencing?
Resequencing is typically performed when a reference genome sequence is available.
Sequencing reads are aligned back to the reference to determine the location in the genome the specific read best matches.
Only works for same or closely related species
What is a disadvantage of re-sequencing projects compared to independently (de novo) assembled genomes ?
Re-sequencing against a reference genome can lead to reference bias.
What can genome scale phylogenetic analysis result in, when carried out from one gene, entire genomes, or many genes (genomic windows)
one gene –> one tree
entire genomes –> one tree
many genes
- if you concatenate the genes –> one tree
- separate analyses of each gene –> many trees
What is phylogenetic incongruence?
gene/locus/window trees
- are different from each other
- (are different from the known/expected species tree)
Phylogenetic incongruence - technical explanation?
- insufficient taxon sampling
- orthology mis-assignment
- misalignment
- excessive trimming
- inappropriate model, …
Phylogenetic incongruence - biological explanation? name 5
different genome regions have different evol. histories!
- incomplete lineage sorting / deep coalescence
- hybridization or introgression
- horizontal gene transfer (HGT)
- differential duplication and loss
- natural selection
Define incomplete lineage sorting / coalescence
A cause of phylogenetic incongruence - can lead to gene tree ≠ gene tree ≠ species tree
allelic polymorphisms exist across speciation events
alleles coalesce first with alleles from more distantly
related species
Random sorting of ancestral polymorphisms:
Anything other than perfect segregation of all alleles into all lineages is called “incomplete lineage sorting” – and for a large genome, it is a given that at least some genes will exhibit this effect.
Also termed hemiplasy, deep coalescence, retention of ancestral polymorphism, or trans-species polymorphism, describes a phenomenon in population genetics when ancestral gene copies fail to coalesce (looking backwards in time) into a common ancestral copy until deeper than previous speciation events.
Consider an ancestral polymorphism: in the common ancestor we have three alleles A, B, C, then two speciation events resulting in the following species tree for the corresponding species: ((A,B)C)
What are the three possible gene trees according to when the alleles coalesced?
What is the expected frequency of gene trees discordant with the species tree? (UNLESS …?)
- ((A,B)C)
- ((A,C)B) –> discordant
- ((B,C)A) –> discordant
Each gene tree is equally likely, so the probability of a discordant tree is 2/3.
UNLESS there was gene flow
Define gene flow
Gene flow (aka gene migration)
- transfer of genetic material from one population to another
- between two pops of closely related species (or lineages) or between the same species
- mediated by reproduction and vertical gene transfer from parent to offspring.
- ancient or recent / rare or ongoing
- a lot more frequent than initially thought!
Potential outcomes of gene flow?
- nothing
- merge into 1 species
- invasion of 1 species
- form hybrid zone
- form new hybrid species
most important for us: exchange few genes = introgression
Define introgression
(aka backcrossing)
gene flow between closely related species (lineages)
- ancient or recent / rare or ongoing
- a lot more frequent than initially thought!
- hybrids are rare
- they backcross with parental species
- parental species remain distinct
How can introgression be detected? Most important method for us?
using genomic data
- trees that are discordant from species tree?
- tests to identify introgressed genomic regions, direction or amount of gene flow
- …
Most important for us:
- excess of shared alleles between hybridizing taxa
- D-statistic / ABBA-BABA test, f-statistic