Week 11 (1000 Genomes Project) Flashcards

1
Q

what is the 1000 genome project?

A

The 1000 Genomes Project is an international research consortium that was set up in 2007 with the aim of sequencing the genomes of at least 1,000 volunteers from multiple populations worldwide in order to improve our understanding of the genetic contribution to human health and disease.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what was the first model for the 1000 genomes project? why?

A

humans! human research is more funded so they had the money to do this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what combination of sequencing tools did they use to complete the 1000 genome project?

A
  • low coverage whole genome
  • exome sequencing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

the 1000s genome project validated a haplotype map of ____ _____ single nucleotide polymorphisms

A

38 million

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

why do low frequency variants tend to be recent?

A

a frequency is the amount of times something shows up, so something that is new tends to have a lower frequency (like a new variant or mutation in the population)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

is it possible for mutations to occur over time? if so, how?

A

yes! possible mutations can occur during cell division

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is the equation that you use to determine the frequency of a mutation in a population?

A

1/2N (N=number of individuals)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is the chance of transmission from parent to offspring?

A

50/50 (to transmit ot to not transmit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

in every generation recombination occurs, this is an example of _______ __________

A

linkage disequilibrium

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

while doing the 1000s genome project, they found 3.6 million SNPs per individual. On average, how many variants or how different is the genome?

A

0.1%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is low coverage?

A

<5%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is high coverage?

A

> 20%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

why did the 1000 genomes project use 5x coverage?

A

it was really expensive to do more than that! (it cost $5 million)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is the typical amount of coverage that we use today?

A

30x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

we transmit ________ NOT _______ to the next generation

A

chromosomes; alleles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what amount of coverage did the 1000s genome project use?

A

low coverage (2-6x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

the 1000s genome project used wide sampling and low coverage, why?

A

they wanted to characterize common variation, they were able to sample more individuals but sequence at a lower coverage to achieve this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

how did the 1000 genomes project contract an integrated map of variation?

A
  1. primary data
  2. canidate variants and quality metrics
  3. variant calls and genotype likelihoods
  4. integrated haplotypes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

which would produce more accurate variant calls, low coverage WGS or high coverage exome?

A

high coverage exome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

pro and con of low coverage WGS?

A
  • pro: cost effective, can conduct large scale studies
  • con: less accurate variant calls
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

pro and con to high coverage exome?

A
  • pro: more accurate variant calls
  • con: only sequencing 2% of the genome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

what are exomes sequencing?

A

they sequence only exons (the protein coding regions) and nothing else in the genome, so only 2% of the genome is sequenced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

why 0, 1, or 2 copies of a variant for an individual?

A

that is the amount of chromosomes available, so you can either have it on neither, one, or both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

why is the evidence for a single genotype typically weak in low coverage regions?

A

(low coverage=5x), at each position we sequences only 5 reads so there are only 5 reads available to support reference calls

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
the evidence for a single genotype typically weak in low coverage regions. why is it more difficult for heterozygous traits?
a single read is sufficient for there to be error, but it could mean it is heterozygous, so your confidence on the call is low
26
the evidence for a single genotype typically weak in low coverage regions. how can we address this?
sequence deeper (increase coverage)
27
what procedure/ what is it called when you try to determine if a variant is true or not?
variant quality score calibration
28
the 1000 genomes froject identified 38 million variants. how many variants (SNPs) have been discovered today?
1.1 billion
29
remember that other type of variation we said we were NOT going to talk about?
structural variation
30
what was another name we gave to "regions of low complexity"?
repetitive sequence
31
what technology should we use in regions with low complexity? why?
long read sequencers, so we can span across the repeat
32
when we make a call about DNA at a position, what are the options for the condition?
- true positive - false positice - false negative
33
FDR
false discovery rate
34
FDR equation
FP / FP+TP (FDR= false discovery rate, FP=false positive, TP = true positive)
35
de novo
new
36
accessible genome
the fraction of the reference genome in which short-read data can lead to reliable variant discovery
37
the 1000 genomes project had challenges identifying large and complex structural variants and shorter indels in regions of low complexity. so what conservative but high quality subsets did they focus on?
- balletic indels - large deletions
38
everyone carries "bad" variants. however, not everyone shows them or they never cause issues. why can this happen?
we have two chromosomes, so if the other chromosome is functioning it can mask the bad variant
39
variation among samples in genotype accuracy is primarily driven by sequencing depth. WHY is this true?
Sequencing depth, which is the number of times each nucleotide position is read during sequencing, significantly impacts genotype accuracy because it directly influences the ability to accurately infer the genetic makeup of a sample
40
moderate to high frequency variants tend to be "______" while low frequency variants tend to be "_______"
old; new
41
what biological trait is necessary for new mutation to spread and increase in frequency in a population?
the mutation cannot kill the person it is affecting, it must allow the individual to survive and reproduce
42
human populations are expanding. why would you expect many low frequency variants?
Human population expansion, while increasing overall genetic diversity, actually leads to more low-frequency variants rather than common variants. This is because rapid growth creates a "load" of new mutations, many of which are rare.
43
why would "old" mutations be present on short haplotypes while "new" mutations be present on long haplotypes?
Old mutations tend to be found on short haplotypes while new mutations are more likely on long haplotypes due to the process of recombination and mutation breaking down ancestral haplotypes over time
44
______ mutations are found on SHORT haplotypes. _______ mutations are found on LONG haplotypes. This is due to _____________.
old; short; recombination
45
why (how) would regulatory sequence tolerate deleterious variation?
redundancy (we have 2 chromosomes, a copy of everything)
46
T/F: the 1000 genomes project supported the hypothesis that regulatory sequences contain substantial amounts of weakly deleterious variation
true
47
define deleterious variation
genetic changes, or mutations, that negatively impact an individual's fitness and reproductive success
48
How is it that we can have on average 150 “broken” genes but still be normal?
redundancy (we have another chromosome that, if functional, can support the organism)
49
T/F: everyone carries "bad" variants
true
50
Why (how) would regulatory sequence tolerate deleterious variation?
redundancy, some deleterious variation does not have as big of an impact as others
51
a second major use of the 1000 Genomes Project data in medical genetics is ________ genotypes in existing GWAS
imputing
52
regions of low sequence complexity, satellite regions, large repeats and many large-scale structural variants continue to present a major challenge for short read technologies. What technology helps us in these regions?
long read sequencing
53
errors can occur and sometimes you can end up with bad data. Would you throw out this bad data?
it depends! It depends on if you are able to achieve better data or if this is the best you can do with the resources you have (ex: ancient DNA)
54
rare variants need to be evaluated using the correct null distribution. Why? What does this mean?
the implication that the interpretation of rare variants in individuals with a particular disease should be within the context of the local (geographic or ancestral) genetic background. variation can be different between populations so it is important to sequence individuals from diverse populations
55
what is the purpose of RNAseq?
the method of choice to study gene expression and identify novel RNA species
56
define transcriptomics
the study of transcriptomes (the complete set of RNA transcripts, both coding and non-coding, within a cell, tissue, or organism) and their functions
57
what are the two main differences between DNA and RNA?
- uracil - ribose
58
what is a required step for RNA-Seq because commercial instruments are made for DNA based sequencing?
cDNA library preparation (converting RNA to DNA)
59
what is the most common application of RNAseq? (the most common type of RNA that is sequenced?
polyadenylated RNA
60
what type of RNA are we trying to avoid when sequencing polyadenylated RNA?
ribosomal RNA
61
what primer is used to target polyadenylated RNA?
oligo dT
62
oligo-dT priming based methods can exhibit 3' bias so the ______ __________ is a preferred method to select poly(A) RNA
poly(A) purification
63
oligo-dT priming based methods can exhibit ______ bias
3'
64
what is the most abundant type of RNA?
ribosomal RNA
65
what are methods to remove ribosomal RNA when we are trying to isolate polyadenelated RNA?
1. ribosomal depletion (more expensive) 2. oligo-dT (most common)
66
what is a major issue in sequencing polyadenylated RNA?
eliminating ribosomal RNA
67
after poly(A) RNA selection, RNA samples are typically subject to RNA fragmentation to a certain size range. why?
the size limitation of most current sequencing platforms (ex: Illumina needs <600 bp)
68
what sequencing technology can RNA be run on?
nanopore
69
For RNA-seq, if the fragment sizes in your library have this size distribution, what is the optimal format for sequencing?
long read sequencing (but it is expensive so short read is more common)
70
a lack of strand specificity would make it difficult to identify antisense and novel RNA species and cause inaccurate measurement of sense RNA expression. WHY?
without strand information, antisense RNA, which is complementary to the sense RNA, is incorrectly counted as sense RNA, leading to an overestimation of sense RNA expression and the inability to accurately quantify antisense RNA
71
a lack of strand specificity would make it difficult to identify antisense and novel RNA species and cause inaccurate measurement of sense RNA expression. what is a common solutions to this?
dUTP in the second strand of cDNA, then enzymes will degrade the strand that contains uracil
72
what are methods for strand specific RNA seq? (what is the most common)
- ligation of the 3' preadenylated and 5' adapters - labeling the second strand with dUTP followed by enzymatic degradation (MOST COMMON) - the peregrine method
73
almost all multi-exon genes display ______ _______. this plays an important role in regulation of cellular processes, and aberrations of the process are associated with many human diseases.
alternative splicing
74
what is the ultimate solution to unravel the complexity of alternatively splicing and gene fusion isoforms is to...?
sequence each transcript from beginning to end
75
is this high quality, medium quality or low quality total RNA image using fragment analyzer?
high quality (we want to see the tall ribosomal RNA peaks)
76
is this high quality, medium quality or low quality total RNA image using fragment analyzer?
medium quality
77
is this high quality, medium quality or low quality total RNA image using fragment analyzer?
low quality
78
what are the 4 parts of gene regulation?
what, when, where, how much
79
what is the power of RNA sequencing?
aspects of discovery and quantification can be combined
80
what is a crucial prerequisite for a successful RNAseq study?
experimental design, that the data generated have the potential to answer the biological questions of interest
81
single end: if you only want to know "____ ____" paired end: if you also want to know "_______"
how much; which
82
one important aspect of the experimental design is the ____________ protocol used to remove ribosomal RNA
RNA-extraction
83
what does the best sequencing option depend on?
the analysis goals
84
_____ _____ reads are normally sufficient for studies of gene expression levels in well-annotated organisms, whereas _____ _____ are preferable to characterize poorly annotated transcriptomes
short read; long read
85
define isoform
different versions of the same gene, arising from variations in alternative splicing or different transcription initiation and termination sites
86
what is the minimum number of biological replicates that you must have for an RNA-seq experiment?
3 (more is always better)
87
a crucial design factor for RNA-seq is the number of replicates. the number of replicates that should be included in a RNA-seq experiment depends on both the amount os _____ _______ and the _______ _________ as wall as on the desired statistical power
technical variability; biological variability
88
what are the three factors that determine the number of replicates required in a RNA-seq experiment?
technical variability, biological variability, and the desired statistical power
89
during RNAseq, when mapping, sometimes you will have unmapped genes. what can we assume these are?
you collected DNA from the bacteria or virus or fungi from the tissue
90
should we discard unmapped genes?
NO, it will give you information on the tissue that you collected and indicate info about expression
91
what are the three basic strategies for regular RNA-seq analysis?
- genome mapping - transcriptome mapping - reference free assembly
92
what is the most common application of RNA-seq?
to estimate the gene and transcript expression (the how much)
93
in RNAseq what is the normalization method?
normalization methods are used to address technical biases and ensure accurate comparisons of gene expression levels across samples
94
Consider two genes, A & B, that are truly expressed at the same level in your tissue of interest. Gene A has a coding sequence of 1000 bp while Gene B has a coding sequence of 10,000 bp.
In this scenario, RPKM (Reads Per Kilobase per Million mapped reads) or similar normalization methods would be crucial. Since Gene B is 10 times longer than Gene A, raw read counts for Gene B would inherently be higher than for Gene A, even if the true expression levels are identical. Normalization is needed to account for this length bias and ensure that comparisons reflect the true expression differences, not simply the length of the gene.
95
when is it necessary to correctly rank gene expression levels? when is it not necessary to correct for gene length?
- within the sample to account for the fact that longer genes accumulate more reads - NOT necessary when comparing changes in gene expression within the same gene across samples
96
what is an example of when you would not want to throw out unmapped genes?
COVID, collect from patients and you can tell the amount of exposure to COVID and it can give you information on the level or expression
97
in the absence of _______ ________, no population inference can be made and hence any p value calculation is invalid
biological replication
98
T/F: in the absence of biological replication, no population inference can be made and hence and p value calculation is invalid
true
99
how do we genotype positions int he genome?
DNA sequencing