Final Exam - Quizzes Flashcards
___________ in genetics refers to the statistical inference of unobserved genotypes. It is achieved by using known haplotypes in a population, for instance from the HapMap or the 1000 Genomes Project in humans
imputation
why would one chose to use a SNP-chip rather than genome sequencing?
SNP chips are cheaper to run
The Illumina Infinium genotyping chemistry uses one probe for A/C, A/G, C/T, and G/T variants, but it uses two probes for A/T and G/C variants. Why?
because A/T are both red and G/C are both green so they would be difficult to distinguish with one probe. so you can use two probes to tell them apart.
You performed a GWAS with say 500,000 SNPs. You are interested in a particular gene, call it YourFavoriteGene (YFG). There are 3 variants present within YFG that have the p-values below. which variant is most strongly associated with the trait that you used for your GWAS?
a) Marker1 p=1.0E-5
b) Marker2 p=1.0E-6
c) Marker3 p=1.0E-7
c) Marker3 p=1.0E-7
the statistical power to detect associations between DNA variants and a trait depends on:
- sample size
- distribution of effect sizes
- frequency
- linkage disequilibrium (LD)
“The non-random association of alleles at different loci in a given population” is. the definition of what genetics term?
linkage disequilibrium
genome assembly is a computationally hard problem. why do longer sequencing reads make assembly comparably easier than using short reads?
it includes more unique bases in the sequence so that you can more easily find were it belongs in the genome sequence
N50 is a metric used to quantify the quality of a genome assembly. when comparing two assemblies that used the same underlying data, assembly ‘A’ has an N50=25 Mb and assembly ‘B’ has an N50=23 Mb. considering only this information, which assembly is “better”?
assembly A
(the reason is because for N50 higher is better, for L50 lower is better)
to date, most reference genomes are linear representation of one individual, which presents limitations. Today, there is an increasing interest in providing a reference genome representation incorporating many individual reference genomes. What is the name of this method?
pangenome
what is the air speed velocity of an unladen swallow?
is it African or European?
a simplifying assumption made in the field of population genetics is that a new variant (mutation) only arises once. When a new mutation occurs, what is the frequency of the variant in the general population?
1/2N
(the frequency would be low in the population because it is new)
the 1000 genomes project sequenced ~1000 individuals at 2-6x coverage using whole genome sequencing. We referred to that as “wide but shallow” sequencing. They also did targeted deep exome sequencing at 50-100x. what was the main motivation and justification for this experimental design?
They wanted to sequence a lot of individuals to see variation in the population. WGS allowed for the entire genome to be sequenced at a lower cost. They used the tools to help generate the data they needed for their experiment.
in the 1000 genomes project apart they stated “the accuracy of individual genotype calls at heterozygous sites is more that 99% for common SNPs and 95% for SNPs at a frequency of 0.5%”. Why would individual calls be more accurate at common variants than at low frequency variants?
we would have more data should the variants which would give us more confidence making the variant call
In the 1000 genomes paper, they stated that “the initial call set was found to have a high False Discovery Rate (FDR), which led to the application of further filters….” Consider your own hypothetical experiment where you wanted to evaluate FDR based on a small sample of all the variants you discovered. For your experiment, you found that you obtained 27 false positives out of 76 total variants tested. What is the FDR for your experiment?
FDR = FP/FP+TP = 27/76 = 0.36
the 1000 genomes project estimated “…that individuals carry and excess of 76-190 rare deleterious non-synonymous variants and up to 20 LOF and disease associated variants.” What is the biological reason that explains how each individual can carry such seemingly large number of “bad” variants?
we have 2 chromosomes. so as long as the other chromosome is still functional the bad variant may not effect the individual.
most RNAseq experiments focus on protein coding genes. what method is most commonly used to enrich for protein coding genes? what type of RNA transcripts does the above method seek to exclude?
- Poly A selection
- rid of ribosomal RNA
why has RNAseq become such a prolific tool in the field of biology?
the discovery of sequencing RNA lies in the fact that the twin aspects of discovery and quantification can be combined
what is probably the single most important prerequisite of a successful RNAseq experiment?
the data generated has the potential to answer the biological question of interest
what is the minimum number of biological replicates that you must have for an RNAseq experiment?
3
in the RNAseq best practice paper they recommended that unmapped reads “should not be discarded.” give one example of how there is valuable information contained in the unmapped reads.
when studying a disease, you have have sequences DNA from the bacteria or virus which may give you more information about your data. For example if you are studying covid 19.
we discussed four methods that can be performed to analyze single cell data downstream from the data cleaning, normalization, integration, clustering, and cell type identification. Name these four methods.
- differential expression
- cell-cell communication
- gene set enrichment
- trajectory inference
what are the three key advancements that were described when we discussed the historical timelines for the development of single cell technology?
- integrated fluid circuits
- nanodroplets
- cell barcoding
in the isolation, sequencing, and analysis of single nuclei data we discussed the ambient signal as being problematic in the analysis of the liquid droplet containing the nuclei transcript. This is a source of data noise that we need to consider and remove as best as possible. define what is the ambient signal in the single nuclei transcriptome data.
ambient signal is from RNA that the call is not using so it can mess up or skew our evaluation of what genes the nuclei is actually expressing
when isolating cells or nuclei from tissue for single cell or nuclei transcriptome sequencing experiments what is the necessary starting Reagan in the isolation of each?
- cells = proteases
- nuclei = detergents