- a highly important and common approach for testing if selection has occurred - nonsynonymous subs per site / synonymous subs per site - = 1 : neutral not undergoing selection - >1 : positive selection so a mutation made that is beneficial - <1 : purifying selection so a mutation change is bad and these will lead to fixed sites

- sub where the amino acid will stay the same - more likely to be neutral

- sub where the amino acid will change - more likely to change phenotype - positive selection may result from a beneficial change in phenotype

Bioinformatics - Final Exam Content Flashcards by Raegen Esenwein

What are the differences between substitution models?

the substitution changes based on what parameters you include, simplest models include just the number of substitutions (hamming distance), others correct for unobserved mutations, some may characterize transitions vs transversions differently, others may have proportions of invariable sites and gamma distributions, differences between models result from what parameters each model includes

How well did you know this?

Not at all

Perfectly

What parameters are included in substitution models?

transitions vs transversions
hamming distance
jukes and cantor distance (correcting for unobserved mutations)
equal/unequal base frequencies
proportion of invariable sites
gamma distributed rate variation among sites

How well did you know this?

Not at all

Perfectly

How do you find the best substitution model?

the best thing to do is test ALL models and find the one that best fits your sequence data, this is done under the maximum likelihood framework, based mostly on lowest BIC and highest AIC values
after all of this is determined you also want to include bootstrap analysis

How well did you know this?

Not at all

Perfectly

What are the steps to finding the best Tree?

do a tree search under each model
calculate the maximum likelihood score of the best tree for each model
compare them using BIC or AIC scores, which are estimators of relative quality of statistical models

How well did you know this?

Not at all

Perfectly

How do phylogenetic approaches provide insight on evolution?

phylogeny - compare phylogenies to biogeography and major paleoecological events
evolutionary processes - pattern heterogeneity and selection ratios (dN/dS)

How well did you know this?

Not at all

Perfectly

How do we use the Disparity Index (I) to estimate pattern heterogeneity?

a common WRONG assumption is that sequences evolve in homogeneity (same conditions and processes)
we know that sequence evolve differently based on locations and pressures
we measure pattern heterogeneity via the disparity index
the disparity index identifies pairs of sequences that evolved under substantially different evolutionary processes

How well did you know this?

Not at all

Perfectly

What is the basis for dN/dS ratio tests?

it is a means to test if selection is occuring, substitution rate outliers will include sequences which affect an organism’s ability to survive and reproduce, substitution patterns reflect selection and dN/dS is the best thing we have for this

How well did you know this?

Not at all

Perfectly

How do you interpret I (disparity index) statistics?

I = 0 means the sequences evolved under the same processes and pressures
I > 0 means the sequences evolved under different processes and pressures

How well did you know this?

Not at all

Perfectly

how do you interpret dN/dS statistics?

dN/dS = 1 : neutral not undergoing selection
dN/dS > 1 : positive selection so a mutation made that is beneficial
dN/dS < 1 : purifying selection so a mutation change is bad and these will lead to fixed sites

How well did you know this?

Not at all

Perfectly

Transition

a change from an A to G or C to T
- in other words these are substitutions which are more likely to happen because we are not changing from purine to pyrimidine or vice versa

How well did you know this?

Not at all

Perfectly

Transversion

a change from A>C, A<T, G<C, G<C
- these are substitutions which happen less frequently and are more serious because it is change from purine to pyrimidine or vice versa

How well did you know this?

Not at all

Perfectly

Hamming Distance ( Dh)

the simplest approach to modeling substitutions, it counts the number of difference, this is differences divided by length
Dh = n / N
n is the number sites which are different
N is the length of the alignment

How well did you know this?

Not at all

Perfectly

Jukes and Cantor (1969)

a model for distance of substitutions which corrects for unobserved mutations
Djc1969 = (-3/4)ln(1-4/3p)
p = the proportion of sites which differ between sequences

How well did you know this?

Not at all

Perfectly

Distance (phylogenetic tree sense)

essentially it is based on how different sequences in the alignment are taking into account the differences or substitutions which have occurred

How well did you know this?

Not at all

Perfectly

Proportion of invariable sites (I)

a parameter to significantly improve models
(I) is the extent of static, unchanging site in the dataset

How well did you know this?

Not at all

Perfectly

gamma distribution (G)

a parameter to significantly improve models
indicates a gamma distributed rate variation among sites

How well did you know this?

Not at all

Perfectly

BIC value

bayesian information criteria (lowest scored model is best)

How well did you know this?

Not at all

Perfectly

AIC value

akaike information criteria (highest scored model is best)

How well did you know this?

Not at all

Perfectly

pattern heterogeneity

if two sequences evolved under the same processes their nucleotide composition will be similar, however if they evolved under separate pressures their nucleotide composition will reflect that

How well did you know this?

Not at all

Perfectly

dN/dS ratio

a highly important and common approach for testing if selection has occurred
nonsynonymous subs per site / synonymous subs per site
= 1 : neutral not undergoing selection
> 1 : positive selection so a mutation made that is beneficial
<1 : purifying selection so a mutation change is bad and these will lead to fixed sites

How well did you know this?

Not at all

Perfectly

disparity index (I)

the observed difference in evolutionary patterns for a pair of sequences based on nucleotide composition
- I = 1/2 summation (xi - yi) squared - Nd
- xi = composition of ith nucleotide
- yi = composition of ith nucleotide
- Nd = composition of distance under homogeneity
values associated w disparity index:
I = 0 -> same evolutionary pressures
I > 0 -> different evolutionary pressures

How well did you know this?

Not at all

Perfectly

neutral theory of molecular evolution (Kimura 1968)

most mutations are neutral or “nearly neutral
it is a basic principle that differences in fecundity lead to natural selection and fixation of mutations
substitution pattern reflect selection

How well did you know this?

Not at all

Perfectly

synonymous

sub where the amino acid will stay the same
more likely to be neutral

How well did you know this?

Not at all

Perfectly

nonsynonymous

sub where the amino acid will change
more likely to change phenotype
positive selection may result from a beneficial change in phenotype

How well did you know this?

Not at all

Perfectly

neutrality

- dN/dS ratio = 1 where the number of dN and dS are the same, indicates no selection happening

positive selection

- when the dN/dS ratio > 1 - a mutation is beneficial so selection is occuring to change to that mutation

purifying selection

- when the dN/dS ratio < 1 - a mutation is detrimental so selection is preventing that bad mutation and working to fix a site in a population

What is the perspective that molecular genetics uses to examine variation?

molecular evolution/genetics focuses on fixed differences between species

What is the perspective that population genetics uses to examine variation?

population genetics focuses on the differences between populations of one species - so like how does a mountain range separating two populations of the same species affect how those species have evolved

What parameters are estimated in population genetics?

gene pool, allele frequency, genotype frequency - these population parameters will affect the gene pool in a predicted way

what are the basics of the hardy weinberg equilibrium (HWE)?

- extending Mendel's law of inheritance to populations yield HWE - when gametes containing either two alleles, A or a, unite in random to form the next generation, the genotype frequencies in offspring (zygote) is A : Aa : a (p2 : 2pq : q2) - we maintain genotype frequency by allele frequency

what are the assumptions of the HWE?

allele frequencies will remain constant over time if these assumptions are met: - random mating - infinite population size - no migration - no selection - no mutation violations to these assumptions have predicted effects on allele and genotype frequencies

how does violating assumptions of HWE effect parameters?

inbreeding - decreases heterogeneity, so genotype frequencies change but allele frequencies to not, lead to heritable diseases genetic drift (small pop) - randomly drift towards one allele, so we converge on one allele type (fixation), but which allele becomes fixed is random migration - may lead to admixture, combining two or more pops w different allele frequencies into one group selection - maybe recessive, dominant, or additive, a frequency of a certain allele becomes fixed in a population mutation - randomly change genotype

what can you estimate if you know allele and genotype frequencies?

we can go backwards and guess which assumptions were violated - inbreeding rates - population sizes - effective population size (number of breeders) - migration/dispersal - population structure/gene flow - recent changes in population sizes - selection coefficients - genotype-phenotype associations

how do we estimate population structure with fixation index (Fst) values?

- we look at how alleles are distributed among vs within populations - Fst is an estimate of the genetic divergence between species - Fst = AP / (WI + AI + AP) AP = estimated variance in allele frequencies Among Populations WI = estimated variance in allele frequencies Within Individuals AI = AP = estimated variance in allele frequencies Among Individuals

what are microsatellites?

short repeats found within a species and certain populations may have varying numbers of these repeats - short segment of DNA, usually one to six or more base pairs in length, that is repeated multiple times in succession at a particular genomic location

how are microsatellites genotyped?

- obtain primer for microsat - PCR - fragment analysis - see how big the pieces are to determine how many repeats they have - genotype

populations

group of individuals of one species living in the same geographical area

subpopulations

local populations within which most individuals find their mates

gene pool

all genetic variation within a population

allele

variant at a locus, comes from a mutation

locus

independent location on a chromosome, can be a gene

allele frequency

proportion of any specific allele in a population

genotype frequency

proportion of individuals in a population with a specific genotype (in diploid, the genotype is the combination of two alleles in individual hetero or homo)

Hardy Weinberg equilibrium

when gametes containing either of two alleles, A or a, unite at random to form the next generation, the genotype frequencies in offspring (zygote) is AA : Aa : aa (alo p2 : 2pq : q2) and p + q = 1

inbreeding

violates non-random mating, decreases heterogeneity and usually fitness

genetic drift

migration

- movement of individuals between populations followed by breeding

selection

additive selection

recessive selection

dominant selection

fixation index (Fst)

microsatellites

a short segment of DNA, usually one to six or more base pairs in length, that is repeated multiple times in succession at a particular genomic location. These DNA sequences are typically non-coding

how are phenotypes associated with genotypes?

why are phenotypes associated with genotypes?

how do we model gene-phenotype interactions?

how does linkage disequalibrium lead to haplotype blocks?

how does linkage disequilibrium lead to haplotype blocks?

how are GWAS studies performed?

how are GWAS studies interpreted?

what are some ways to decrease error in GWAS studies?

genome wide association studies (GWAS)

quantitative traits

genotype-phenotype association

genotype-phenotype models

multiplicative : genotype-phenotype model

additive : genotype-phenotype model

recessive : genotype-phenotype model

common dominant : genotype-phenotype model

polygenic : genotype-phenotype model

linkage map

linkage disequilibrium

haplotype block

coefficient of linkage disequilibrium (D)

TAG SNP

Bonferroni correction

power

odds ratio

multi-stage approach

permutation

false positives

population stratification

admixture

why do need NGS? what did we hope to learn?

most phenotypes and diseases are complex Health things to learn - genetic factors affecting health - predict, prevent, detect disease - personalized effective treatment - monitor disease progression Wildlife/domestic animals things to learn - genes that affect traits - better management and conservation - improve important traits

what makes up our genome?

- 45% of the genome is repetitive elements - 30% of genome from genes, of that only about 2% is coding exons, there are also noncoding RNAs - 70% of genome is intergenic (between genes), this includes repetitive elements (simple repeats, transposons, SINES and LINES), conserved noncoding regions, regulatory regions, and structural regions (centromeres and telomeres)

what types of variation are present in genomes?

- deletion - duplication - inversion - translocation

why do we need next gen sequencing (NGS)?

elaborate on the development of NGS

how has NGS impacted genomics

what is illumina sequencing technology?

what are the methods of illumina sequencing technology?

how is sequence data presented and formatted in Fastaq files?

how is a De Novo sequencing assembly constructed using NGS?

how do you evaluate how good an assembly is?

how do you deal with repeats when assembling contigs and scaffolds?

how/why do we re-sequence genomes to characterize variation?

repetitive elements

Alu transposable element

L1 transposon

Hemophilia A

indel

SNP

structural variation

insertion : structural variation

deletion : structural variation

translocation : structural variation

inversion : structural variation

alternative splicing

MAPT gene

next generation sequencing (NGS)

illumina

adapter

barcode

flow cell

cluster

bridge amplification

cycle

paired reads

Fastaq

phred score

vector

De Novo assembly

C (coverage)

string graph

consensus

N50

contigs

scaffolds

collapsed contig

repeat region

mate pair reads

assembly programs

velvet

re-sequencing

split mapping

what are the strategies behind genome re-sequencing?

what is the design of low coverage re-sequencing?

what are different types of reduced-representation sequencing?

what are ampliconic libraries?

what are the different types of targeted enrichment libraries?

elaborate on the methods for RadSeq libraries

how does one interpret the results of RadSeq libraries?

how can genomics be used to understand adaptation?

genome re-sequencing

low-coverage sequencing

reduced-representation sequencing

restriction enzyme digestion

plasmodium flaciparum

amylase

targted enrichment

uniplex

multiplex

RainStorm

hybridization

oligo probes

biotin

streptavidin

miller syndrome

RadSeq

Sbf

ApeKI

GBS

RadTag

sliding window analysis

selective sweep

Bobcat

GPR158

LECT2

LECT

TRPM

what is meant when referring to the dynamic nature of gene expression?

what are the pitfalls of gene expression analysis?

what are some experimental approaches needed to understand gene expression?

how are microarrays designed?

how are microarrays analyzed?

what are the 7 main steps to differential gene expression?

how is RNAseq data analyzed?

elaborate on microarray and RNAseq analysis?

what are the main approaches to data analysis of gene expression?

what is involved in the pre-processing to clean up data?

elaborate a bit on inferential (t-tests and ANOVA) and descriptive statistics (scatter plots, volcano plots)

what are inferential statistics?

what are descriptive statistics?

how do we interpret results for biological significance?

how do we analyze clustering and heatmaps?

how does gene ontology allow for understanding function?

functional analysis

gene expression differences

microarrays

RNAseq

inferential statistics

exploratory statistics

oligos

probes

cDNA

hybridization

fluorescent tags

Rett syndrome

a b crystallin

clustering

classification

northern blots

western blots

RT-PCR

in situ hybridization

technical replicates

biological replicates

RNAseq pipeline

gene expression omnibus (GEO) databases

metadata

MIAME

annotated reference

FPKM

fragment count

isoforms

preprocessing

systematic bias

normalization

scatter plot

volcano plot

heat map

validation

gene ontology

cellular component

biological process

molecular function

enrichment analysis

pathways

Bioinformatics - Final Exam Content Flashcards

(232 cards)