Bioinformatics - Final Exam Content Flashcards

(232 cards)

1
Q

What are the differences between substitution models?

A

the substitution changes based on what parameters you include, simplest models include just the number of substitutions (hamming distance), others correct for unobserved mutations, some may characterize transitions vs transversions differently, others may have proportions of invariable sites and gamma distributions, differences between models result from what parameters each model includes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What parameters are included in substitution models?

A
  • transitions vs transversions
  • hamming distance
  • jukes and cantor distance (correcting for unobserved mutations)
  • equal/unequal base frequencies
  • proportion of invariable sites
  • gamma distributed rate variation among sites
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you find the best substitution model?

A
  • the best thing to do is test ALL models and find the one that best fits your sequence data, this is done under the maximum likelihood framework, based mostly on lowest BIC and highest AIC values
  • after all of this is determined you also want to include bootstrap analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the steps to finding the best Tree?

A
  1. do a tree search under each model
  2. calculate the maximum likelihood score of the best tree for each model
  3. compare them using BIC or AIC scores, which are estimators of relative quality of statistical models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do phylogenetic approaches provide insight on evolution?

A

phylogeny - compare phylogenies to biogeography and major paleoecological events
evolutionary processes - pattern heterogeneity and selection ratios (dN/dS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we use the Disparity Index (I) to estimate pattern heterogeneity?

A
  • a common WRONG assumption is that sequences evolve in homogeneity (same conditions and processes)
  • we know that sequence evolve differently based on locations and pressures
  • we measure pattern heterogeneity via the disparity index
  • the disparity index identifies pairs of sequences that evolved under substantially different evolutionary processes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the basis for dN/dS ratio tests?

A

it is a means to test if selection is occuring, substitution rate outliers will include sequences which affect an organism’s ability to survive and reproduce, substitution patterns reflect selection and dN/dS is the best thing we have for this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you interpret I (disparity index) statistics?

A

I = 0 means the sequences evolved under the same processes and pressures
I > 0 means the sequences evolved under different processes and pressures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how do you interpret dN/dS statistics?

A

dN/dS = 1 : neutral not undergoing selection
dN/dS > 1 : positive selection so a mutation made that is beneficial
dN/dS < 1 : purifying selection so a mutation change is bad and these will lead to fixed sites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Transition

A

a change from an A to G or C to T
- in other words these are substitutions which are more likely to happen because we are not changing from purine to pyrimidine or vice versa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Transversion

A

a change from A>C, A<T, G<C, G<C
- these are substitutions which happen less frequently and are more serious because it is change from purine to pyrimidine or vice versa

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Hamming Distance ( Dh)

A
  • the simplest approach to modeling substitutions, it counts the number of difference, this is differences divided by length
  • Dh = n / N
  • n is the number sites which are different
  • N is the length of the alignment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Jukes and Cantor (1969)

A
  • a model for distance of substitutions which corrects for unobserved mutations
  • Djc1969 = (-3/4)ln(1-4/3p)
  • p = the proportion of sites which differ between sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Distance (phylogenetic tree sense)

A

essentially it is based on how different sequences in the alignment are taking into account the differences or substitutions which have occurred

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Proportion of invariable sites (I)

A
  • a parameter to significantly improve models
  • (I) is the extent of static, unchanging site in the dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

gamma distribution (G)

A
  • a parameter to significantly improve models
  • indicates a gamma distributed rate variation among sites
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

BIC value

A

bayesian information criteria (lowest scored model is best)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

AIC value

A

akaike information criteria (highest scored model is best)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

pattern heterogeneity

A

if two sequences evolved under the same processes their nucleotide composition will be similar, however if they evolved under separate pressures their nucleotide composition will reflect that

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

dN/dS ratio

A
  • a highly important and common approach for testing if selection has occurred
  • nonsynonymous subs per site / synonymous subs per site
  • = 1 : neutral not undergoing selection
  • > 1 : positive selection so a mutation made that is beneficial
  • <1 : purifying selection so a mutation change is bad and these will lead to fixed sites
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

disparity index (I)

A

the observed difference in evolutionary patterns for a pair of sequences based on nucleotide composition
- I = 1/2 summation (xi - yi) squared - Nd
- xi = composition of ith nucleotide
- yi = composition of ith nucleotide
- Nd = composition of distance under homogeneity
values associated w disparity index:
I = 0 -> same evolutionary pressures
I > 0 -> different evolutionary pressures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

neutral theory of molecular evolution (Kimura 1968)

A
  • most mutations are neutral or “nearly neutral
  • it is a basic principle that differences in fecundity lead to natural selection and fixation of mutations
  • substitution pattern reflect selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

synonymous

A
  • sub where the amino acid will stay the same
  • more likely to be neutral
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

nonsynonymous

A
  • sub where the amino acid will change
  • more likely to change phenotype
  • positive selection may result from a beneficial change in phenotype
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
neutrality
- dN/dS ratio = 1 where the number of dN and dS are the same, indicates no selection happening
26
positive selection
- when the dN/dS ratio > 1 - a mutation is beneficial so selection is occuring to change to that mutation
27
purifying selection
- when the dN/dS ratio < 1 - a mutation is detrimental so selection is preventing that bad mutation and working to fix a site in a population
28
What is the perspective that molecular genetics uses to examine variation?
molecular evolution/genetics focuses on fixed differences between species
29
What is the perspective that population genetics uses to examine variation?
population genetics focuses on the differences between populations of one species - so like how does a mountain range separating two populations of the same species affect how those species have evolved
30
What parameters are estimated in population genetics?
gene pool, allele frequency, genotype frequency - these population parameters will affect the gene pool in a predicted way
31
what are the basics of the hardy weinberg equilibrium (HWE)?
- extending Mendel's law of inheritance to populations yield HWE - when gametes containing either two alleles, A or a, unite in random to form the next generation, the genotype frequencies in offspring (zygote) is A : Aa : a (p2 : 2pq : q2) - we maintain genotype frequency by allele frequency
32
what are the assumptions of the HWE?
allele frequencies will remain constant over time if these assumptions are met: - random mating - infinite population size - no migration - no selection - no mutation violations to these assumptions have predicted effects on allele and genotype frequencies
33
how does violating assumptions of HWE effect parameters?
inbreeding - decreases heterogeneity, so genotype frequencies change but allele frequencies to not, lead to heritable diseases genetic drift (small pop) - randomly drift towards one allele, so we converge on one allele type (fixation), but which allele becomes fixed is random migration - may lead to admixture, combining two or more pops w different allele frequencies into one group selection - maybe recessive, dominant, or additive, a frequency of a certain allele becomes fixed in a population mutation - randomly change genotype
34
what can you estimate if you know allele and genotype frequencies?
we can go backwards and guess which assumptions were violated - inbreeding rates - population sizes - effective population size (number of breeders) - migration/dispersal - population structure/gene flow - recent changes in population sizes - selection coefficients - genotype-phenotype associations
35
how do we estimate population structure with fixation index (Fst) values?
- we look at how alleles are distributed among vs within populations - Fst is an estimate of the genetic divergence between species - Fst = AP / (WI + AI + AP) AP = estimated variance in allele frequencies Among Populations WI = estimated variance in allele frequencies Within Individuals AI = AP = estimated variance in allele frequencies Among Individuals
36
what are microsatellites?
short repeats found within a species and certain populations may have varying numbers of these repeats - short segment of DNA, usually one to six or more base pairs in length, that is repeated multiple times in succession at a particular genomic location
37
how are microsatellites genotyped?
- obtain primer for microsat - PCR - fragment analysis - see how big the pieces are to determine how many repeats they have - genotype
38
populations
group of individuals of one species living in the same geographical area
39
subpopulations
local populations within which most individuals find their mates
40
gene pool
all genetic variation within a population
41
allele
variant at a locus, comes from a mutation
42
locus
independent location on a chromosome, can be a gene
43
allele frequency
proportion of any specific allele in a population
44
genotype frequency
proportion of individuals in a population with a specific genotype (in diploid, the genotype is the combination of two alleles in individual hetero or homo)
45
Hardy Weinberg equilibrium
when gametes containing either of two alleles, A or a, unite at random to form the next generation, the genotype frequencies in offspring (zygote) is AA : Aa : aa (alo p2 : 2pq : q2) and p + q = 1
46
inbreeding
violates non-random mating, decreases heterogeneity and usually fitness
47
genetic drift
48
migration
- movement of individuals between populations followed by breeding
49
selection
50
additive selection
51
recessive selection
52
dominant selection
53
fixation index (Fst)
54
microsatellites
a short segment of DNA, usually one to six or more base pairs in length, that is repeated multiple times in succession at a particular genomic location. These DNA sequences are typically non-coding
55
how are phenotypes associated with genotypes?
56
why are phenotypes associated with genotypes?
57
how do we model gene-phenotype interactions?
58
how does linkage disequalibrium lead to haplotype blocks?
59
how does linkage disequalibrium lead to haplotype blocks?
60
how does linkage disequilibrium lead to haplotype blocks?
61
how are GWAS studies performed?
62
how are GWAS studies interpreted?
63
what are some ways to decrease error in GWAS studies?
64
genome wide association studies (GWAS)
65
quantitative traits
66
genotype-phenotype association
67
genotype-phenotype models
68
multiplicative : genotype-phenotype model
69
additive : genotype-phenotype model
70
additive : genotype-phenotype model
71
recessive : genotype-phenotype model
72
common dominant : genotype-phenotype model
73
polygenic : genotype-phenotype model
74
linkage map
75
cM
76
linkage disequilibrium
77
haplotype block
78
coefficient of linkage disequilibrium (D)
79
TAG SNP
80
Bonferroni correction
81
power
82
odds ratio
83
multi-stage approach
84
permutation
85
false positives
86
population stratification
87
admixture
88
why do need NGS? what did we hope to learn?
most phenotypes and diseases are complex Health things to learn - genetic factors affecting health - predict, prevent, detect disease - personalized effective treatment - monitor disease progression Wildlife/domestic animals things to learn - genes that affect traits - better management and conservation - improve important traits
89
what makes up our genome?
- 45% of the genome is repetitive elements - 30% of genome from genes, of that only about 2% is coding exons, there are also noncoding RNAs - 70% of genome is intergenic (between genes), this includes repetitive elements (simple repeats, transposons, SINES and LINES), conserved noncoding regions, regulatory regions, and structural regions (centromeres and telomeres)
90
what types of variation are present in genomes?
- deletion - duplication - inversion - translocation
91
why do we need next gen sequencing (NGS)?
92
elaborate on the development of NGS
93
how has NGS impacted genomics
94
what is illumina sequencing technology?
95
what are the methods of illumina sequencing technology?
96
how is sequence data presented and formatted in Fastaq files?
97
how is a De Novo sequencing assembly constructed using NGS?
98
how do you evaluate how good an assembly is?
99
how do you deal with repeats when assembling contigs and scaffolds?
100
how/why do we re-sequence genomes to characterize variation?
101
repetitive elements
102
Alu transposable element
103
L1 transposon
104
Hemophilia A
105
indel
106
SNP
107
structural variation
108
insertion : structural variation
109
deletion : structural variation
110
translocation : structural variation
111
inversion : structural variation
112
alternative splicing
113
MAPT gene
114
next generation sequencing (NGS)
115
illumina
116
adapter
117
barcode
118
flow cell
119
cluster
120
bridge amplification
121
cycle
122
paired reads
123
Fastaq
124
phred score
125
vector
126
De Novo assembly
127
C (coverage)
128
string graph
129
consensus
130
N50
131
contigs
132
scaffolds
133
collapsed contig
134
repeat region
135
mate pair reads
136
assembly programs
137
velvet
138
re-sequencing
139
split mapping
140
what are the strategies behind genome re-sequencing?
141
what is the design of low coverage re-sequencing?
142
what are different types of reduced-representation sequencing?
143
what are ampliconic libraries?
144
what are the different types of targeted enrichment libraries?
145
elaborate on the methods for RadSeq libraries
146
how does one interpret the results of RadSeq libraries?
147
how can genomics be used to understand adaptation?
148
genome re-sequencing
149
low-coverage sequencing
150
reduced-representation sequencing
151
restriction enzyme digestion
152
plasmodium flaciparum
153
amylase
154
targted enrichment
155
uniplex
156
multiplex
157
RainStorm
158
hybridization
159
oligo probes
160
biotin
161
streptavidin
162
miller syndrome
163
RadSeq
164
Sbf
165
ApeKI
166
GBS
167
RadTag
168
sliding window analysis
169
selective sweep
170
Bobcat
171
GPR158
172
LECT2
173
LECT
174
TRPM
175
what is meant when referring to the dynamic nature of gene expression?
176
what are the pitfalls of gene expression analysis?
177
what are some experimental approaches needed to understand gene expression?
178
how are microarrays designed?
179
how are microarrays analyzed?
180
what are the 7 main steps to differential gene expression?
181
how is RNAseq data analyzed?
182
elaborate on microarray and RNAseq analysis?
183
what are the main approaches to data analysis of gene expression?
184
what is involved in the pre-processing to clean up data?
185
elaborate a bit on inferential (t-tests and ANOVA) and descriptive statistics (scatter plots, volcano plots)
186
what are inferential statistics?
187
what are descriptive statistics?
188
how do we interpret results for biological significance?
189
how do we analyze clustering and heatmaps?
190
how does gene ontology allow for understanding function?
191
functional analysis
192
gene expression differences
193
microarrays
194
RNAseq
195
inferential statistics
196
exploratory statistics
197
oligos
198
probes
199
cDNA
200
hybridization
201
fluorescent tags
202
Rett syndrome
203
a b crystallin
204
clustering
205
classification
206
northern blots
207
western blots
208
RT-PCR
209
in situ hybridization
210
technical replicates
211
biological replicates
212
RNAseq pipeline
213
gene expression omnibus (GEO) databases
214
metadata
215
MIAME
216
annotated reference
217
FPKM
218
fragment count
219
isoforms
220
preprocessing
221
systematic bias
222
normalization
223
scatter plot
224
volcano plot
225
heat map
226
validation
227
gene ontology
228
cellular component
229
biological process
230
molecular function
231
enrichment analysis
232
pathways