OMA (Orthologous MAtrix) algorithm & database - best strategy for large-scale assignment of orthologs/homologs - using a combination of graph- and tree-based methods - multi-step pipeline - pre-computed (and cross-linked from other dbs) - 1:1 orthologs - homologs (orthologs & paralogs) - stand-alone software - companion tools (analysis, visualization) - HOGS --> large scale analysis and assignment of homology is really difficult, number of strategies, result often incorrect, time-consuming --> rather go with pre existing

14 | Large scale analyses Flashcards by Stevie Davies

Define population genetics.

subfield of genetics
part of evolutionary biology
deals with genetic differences within and among pops
examine phenomena eg: adaptation, speciation, population structure

from 1920s on
(Fisher, Haldane, Wright)
* formal models of evolution, statistical methods
* allele and gene frequencies in populations over time
* data: few genotypes of a limited number of individuals

How well did you know this?

Not at all

Perfectly

Define phylogenomics

Ultimate goal: reconstruct the evolutionary history of species through their genomes

Intersection of the fields of evolution and genomics

Analysis that involves genome data and evolutionary reconstructions.

How well did you know this?

Not at all

Perfectly

What is molecular genetics / genomics about?

What type of data?

from 1950s on
(Watson & Crick)?
* makeup, expression, regulation of genes;
genotype-phenotype
* data: gene & genome sequences, phenotypes

How well did you know this?

Not at all

Perfectly

Define Population Genomics.

Data used?
Concepts and tools?
Questions?

large-scale application of genomic technologies to study populations of individuals
data: multiple genomes from the same (or closely
related) species; thousands / millions of SNPs per
individual
studies genome-wide effects to improve our understanding of microevolution –> learn phylogenetic history and demography of a pop

concepts & tools
- linkage disequilibrium, genetic drift, coalescent,
multivariate statistics, …

questions: population structure & history; detect evolutionary processes along the genome
- gene & genome evolution, recombination events, split times, gene flow, population sizes, demographic events, selection, diversification, relatedness, pop. structure, etc
- contemporary & ancestral populations/species

How well did you know this?

Not at all

Perfectly

Genome-scale evolutionary analysis

Based on which regions? (two options)

In both cases: several gene sequences (can’t be one like in the previous small scale analyses we studied)

based on coding regions:
several (or all) gene sequences from different species –> families of homologs / orthologs
- often one individual per species
- homology/orthology assignment required

based on any/all genomic regions:
independent of gene content –> homologous genomic regions –> alignments and phylogenies
- generally done based on re-sequencing data
- separate homology assignment step not necessary
- same or closely related species

How well did you know this?

Not at all

Perfectly

Which step in the pipeline sets large scale analyses apart from the small scale analyses we studied previously?

Just this step is done differently and makes large scale more difficult:

Assigning sequences (genomic data from multiple species) to families of homologous/orthologous genes

How well did you know this?

Not at all

Perfectly

What are the three general approaches to Genome-scale inference of homology / orthology?

And what are the possible types of data? (plus examples)

Approaches:
- Tree based
- Graph based
- Hybrid

Data:
- Databases: use pre-computed sets of homologs / orthologs (eg Treefam,. OMA)
- Customized data set: compute project-specific homologs / orthologs

How well did you know this?

Not at all

Perfectly

What is an example of a database which can be used for tree-based genome scale inference of homology/orthology?

`Treefam

How well did you know this?

Not at all

Perfectly

What are the two types of graph-based genome scale inference of homology/orthology?

Name an example database for each type.

graph-based
- reciprocal best match (RBH): COG
- clustering (MCL): OrthoMCL

How well did you know this?

Not at all

Perfectly

What is usually the best approach for genome-scale inference of homology / orthology?

Name a database that can be used for this

A hybrid approach combining graph-based and tree-based methods.

For example OMA (Orthologous MAtrix)

How well did you know this?

Not at all

Perfectly

What is OMA?

OMA (Orthologous MAtrix) algorithm & database

best strategy for large-scale assignment of orthologs/homologs
using a combination of graph- and tree-based methods
multi-step pipeline
pre-computed (and cross-linked from other dbs)
- 1:1 orthologs
- homologs (orthologs & paralogs)
stand-alone software
companion tools (analysis, visualization)
HOGS

–> large scale analysis and assignment of homology is really difficult, number of strategies, result often incorrect, time-consuming
–> rather go with pre existing

How well did you know this?

Not at all

Perfectly

What are seven challenges of assigning homology / orthology

pairwise orthology definition (non-transitive)
differential gene loss (or incomplete sampling)
multi-domain proteins / mosaics
horizontal transfer (xenologs)
high rates of sequence divergence
poor genome assembly / annotation
computational demand

How well did you know this?

Not at all

Perfectly

What is resequencing?

Resequencing is typically performed when a reference genome sequence is available.

Sequencing reads are aligned back to the reference to determine the location in the genome the specific read best matches.

Only works for same or closely related species

How well did you know this?

Not at all

Perfectly

What is a disadvantage of re-sequencing projects compared to independently (de novo) assembled genomes ?

Re-sequencing against a reference genome can lead to reference bias.

How well did you know this?

Not at all

Perfectly

What can genome scale phylogenetic analysis result in, when carried out from one gene, entire genomes, or many genes (genomic windows)

one gene –> one tree

entire genomes –> one tree

many genes
- if you concatenate the genes –> one tree
- separate analyses of each gene –> many trees

How well did you know this?

Not at all

Perfectly

What is phylogenetic incongruence?

Study These Flashcards

gene/locus/window trees
- are different from each other
- (are different from the known/expected species tree)

Phylogenetic incongruence - technical explanation?

Study These Flashcards

insufficient taxon sampling
orthology mis-assignment
misalignment
excessive trimming
inappropriate model, …

Phylogenetic incongruence - biological explanation? name 5

Study These Flashcards

different genome regions have different evol. histories!
- incomplete lineage sorting / deep coalescence
- hybridization or introgression
- horizontal gene transfer (HGT)
- differential duplication and loss
- natural selection

Define incomplete lineage sorting / coalescence

Study These Flashcards

A cause of phylogenetic incongruence - can lead to gene tree ≠ gene tree ≠ species tree

allelic polymorphisms exist across speciation events

alleles coalesce first with alleles from more distantly
related species

Random sorting of ancestral polymorphisms:
Anything other than perfect segregation of all alleles into all lineages is called “incomplete lineage sorting” – and for a large genome, it is a given that at least some genes will exhibit this effect.

Also termed hemiplasy, deep coalescence, retention of ancestral polymorphism, or trans-species polymorphism, describes a phenomenon in population genetics when ancestral gene copies fail to coalesce (looking backwards in time) into a common ancestral copy until deeper than previous speciation events.

Consider an ancestral polymorphism: in the common ancestor we have three alleles A, B, C, then two speciation events resulting in the following species tree for the corresponding species: ((A,B)C)

What are the three possible gene trees according to when the alleles coalesced?

What is the expected frequency of gene trees discordant with the species tree? (UNLESS …?)

Study These Flashcards

((A,B)C)
((A,C)B) –> discordant
((B,C)A) –> discordant

Each gene tree is equally likely, so the probability of a discordant tree is 2/3.

UNLESS there was gene flow

Define gene flow

Study These Flashcards

Gene flow (aka gene migration)

transfer of genetic material from one population to another
between two pops of closely related species (or lineages) or between the same species
mediated by reproduction and vertical gene transfer from parent to offspring.
ancient or recent / rare or ongoing
a lot more frequent than initially thought!

Potential outcomes of gene flow?

Study These Flashcards

nothing
merge into 1 species
invasion of 1 species
form hybrid zone
form new hybrid species

most important for us: exchange few genes = introgression

Define introgression

Study These Flashcards

(aka backcrossing)

gene flow between closely related species (lineages)
- ancient or recent / rare or ongoing
- a lot more frequent than initially thought!
- hybrids are rare
- they backcross with parental species
- parental species remain distinct

How can introgression be detected? Most important method for us?

Study These Flashcards

using genomic data
- trees that are discordant from species tree?
- tests to identify introgressed genomic regions, direction or amount of gene flow
- …

Most important for us:
- excess of shared alleles between hybridizing taxa
- D-statistic / ABBA-BABA test, f-statistic

What is the ABBA BABA test / statistic (D-Statistic)? (research more)

ABBA BABA statistics (also called D statistics) - simple and powerful test for a deviation from a strict bifurcating evolutionary history. - frequently used to test for introgression using genome-scale SNP data. - developed to quantify the amount of genetic exchange between Neanderthals and modern humans An excess of either ABBA or BABA, resulting in a D-statistic that is significantly different from zero, is indicative of gene flow between two taxa. (A positive D-statistic (i.e. an excess of ABBA) points to introgression between P2 and P3, whereas a negative D-statistic (i.e. an excess of BABA) points to introgression between P1 and P3.)

Explain the D-statistic and the ABBA-BABA test with an example. (research more)

Consider four taxa P1, P2, P3, and O (outgroup) with the following species tree: (((P1,P2),P3),O) Which have either ancestral (‘A’) or derived (‘B’) alleles across their genomes. An analysis of A and B alleles can result in the following: - The outgroup species has only A - P3 has only B - P1 and P2 have one of each This means there are two possible gene trees: --> (((P2,P3),P1),O) --> (((P1,P3),P2),O) Using the frequencies of A and B in P1 and P2 (ie frequencies of ABBA and BABA pattern), we can determine if introgression has taken place between P3 and either P1 or P2 if D = 0: - equal frequencies of ABBA and BABA trees - only incomplete lineage sorting, no introgression if D ≠ 0: - introgression has taken place

one genome = one phylogeny? one single genome/species tree??

different genome regions have different evolutionary histories different gene/locus/window trees can differ from each other and from the species/organismal tree

Phylogenetic incongruence: challenge or opportunity?

challenge! - assumptions about species evolution opportunity! - a tool to learn about the evolution of lineages & their genomes

Genome-level evolutionary analyses in R?

Many libraries developed eg for: - identify species tree nodes affected by gene flow - identify admixed genomic regions - identify direction of admixture - determine relative age of gene flow - (graphically) summarize discordance - …

What method did we learn which can be used to evaluate incongruent trees ?

D-Statistic / ABBA-BABA

EXAM (2019, 2020) List three biological reasons for which we may get incongruences in gene trees. Explain one of them, and how it is reflected in the tree.

- incomplete lineage sorting / deep coalescence - hybridization or introgression - horizontal gene transfer (HGT) - differential duplication and loss - natural selection still to do - how it is reflected in the tree?

EXAM (2020) Given multicast file of homolog sequences, how to extract orthologs and paralogs

All against all comparisons - based on score & length criteria --> homologs (candidate pairs) Formation of stable pairs - analysis within and between genomes - pairwise & multiple sequence comparisons - ML evolutionary distances - protein similarity graph, clustering --> putative orthologs (stable pairs) Verification of stable pairs - compare with third genome: check for hidden paralogs, differential loss - use species tree information - graph theoretic approaches --> Orthologs (verified pairs)

Is orthology transitive?

No pairwise orthology definition --> non-transitive

Explain tree-based vs graph-based approaches for inferring orthology

Graph based - rely on graphs with genes as nodes and evolutionary relationships as edge. - infer whether edges represent orthology or paralogy - build clusters of genes on the basis of the graph. Tree-based - gene/species tree reconciliation - annotating all splits of a given gene tree as duplication or speciation, - given the phylogeny of the relevant species - reconciled tree --> trivial to derive all pairs of orthologous and paralogous genes. - gene pairs coalesce in speciation node = orthologs - paralogs if they split at a duplication node

14 | Large scale analyses Flashcards

(34 cards)