Topic D & E. DNA Sequencing... Function Genomics Flashcards Preview

PrelimQuestions > Topic D & E. DNA Sequencing... Function Genomics > Flashcards

Flashcards in Topic D & E. DNA Sequencing... Function Genomics Deck (121):

What are these large regions of chromosomes that maintain homology between grape and poplar termed? Describe an approach you would use to further characterize these regions.

Syntenic Regions
I will look within these highly conserved regions to see what proteins they code for, which would indicate their functionality.


What are the genes connected by lines between grape and poplar known as? Describe a compu- tational approach used to define these types of genes between organisms.

We can use computational approach, like sequence alignment, if two genes are very similar in their sequence together. BlastZ, ClusterW


If there were lines connecting genes within grape what would these genes be known as? Describe a computational approach used to define these types of genes within an organism.

We can use homology search, probably HMM.


Three years after the human genome was declared essentially finished, gaps in the sequence persist. Describe briefly 3 reasons for the remaining gaps in the euchromatic region of the genome. Do you think it is possible with current technology to close the heterochromatic gaps? Why or why not?

tandem repeats, non-uniquely mapping reads, structural variations We need longer reads to close the gaps


What sequence features or genetic properties might be associated with these gaps? How might they be causing the gaps?

repeats: it’s hard to determine how long the repeat region is if you have reads falling within it heterochromatic regions: Hard to actually get the sequence because it does not dissociate well


Acquisition and mapping of fosmid end sequences derived from unrelated individual genomes to the current human reference sequence forms the basis for the human Structural Variation Project. What kinds of important genetic information might one expect to discover from this analysis? Give 3 examples.

CNVs, inversions, translocations and SNP


Whole genome shotgun sequencing strategy:

An approach to genome sequencing where the whole genome is sheared into sequencable fragments, and computationally assembled. All sequencing is done ahead of time using PCR products, to form shotgun libraries of sequence reads.


Clone-by-clone sequencing strategy:

An alternative to WGS where a divide and conquer approach is utilized. First, create genomic libraries of clones immortalized in vectors such as BACs. Ideally you want 5- 10x redundancy of genomic coverage in your libraries. Then form a tiling path by end sequencing clones and aligning overlapping fragments. In so doing, you will be able to quantify gaps where clones lack coverage. You will sequence individual clones along the tiling path and assemble contigs spanning the genome. Finally work on finishing sequence and plugging gaps.


Hybrid sequencing strategies:

A combination of clone by clone and WGS which was used for the mouse and chicken genome projects. Such a compartmentalized shotgun, could for example break the genome up into chromosomes, and then do shotgun sequencing on each chromosome. Probably the best of both worlds, as many genome projects are now adopting a combines approach.


Draft Sequence:
Finished Sequence:

Sequence with an error rate of 10−3 → q=30
Sequence with an error rate of 10−4 → q=40


Segmental Duplications:

>1kb > 90% similarity



-10log(p) where p = the error rate (or probability of an error)


Mate-pair sequences:

A pair of sequences derived from the two ends of a single clone. An essential component of shot gun sequencing as the distance between the pairs gives spatial information and assists in resolving repeats.


BAC end sequences:

Used to establish mate pairs and construct the tiling path in clone by clonesequencing. mRNA sequences Messemger RNA Eukaryotic transcribed sequences that have been pro- cessed (ie spliced and exported out of the nucleus)


EST sequences:

Expressed Sequence Tags a sequenced piece of cDNA, however may not span the whole cDNA transcript. cDNA library generation uses primers to the poly a tail of the mRNA transcript, and a single sequencing trace is usually performed toward the 5 portion of the gene (all this is done on the complement strand).



Sequence Tagged Site any sequenced fragment of DNA derived from a library of clones that is placed on the physical map of the genome. Each STS is unique and primers, PCR conditions, and product size are immediately quantifiable and storable in a database. Fundamental to the HGP.



tretch of repetitive DNA made up os a variable number of several to one hundread or more tandem repeats of a small number of nucleotides. Ex (AG)n or (CAG)n. Highly polymorphic (in n at least) and heterozygous, and occur around several per hundred kilobases in higher eukaryotes.



Single Nucleotide Polymorphisms. Useful for mapping phenotype to gene. Highest resolution of polymorphic markers 1/kb


Meiotic Linkage Maps:

Linkage maps based on natural meiotic breaks from homologous recombina- tion.


Radiation Hybrid Maps:

Linkage maps based on induced chromosomal breaks from X-ray irradia- tion. Fragmented chromosomes are then exposed to hamster cell lines and fragments become either incorporated into the hamster chromosomes (via homologous recombination), or segregate as mini chromosomes.



tudy of chromosomes and the related disease states caused by numerical and structural chromosome abnormalities. FISH is especially used in cytogenetics



Flourescence Insitu Hybridization. Hybridize fluorescent DNA probe on mitotic chromosome at metaphase. Used in ”chromosome painting” where one species chromosomes are labeled and synteny with another species is sought.



Bacterial Artificial Chromosomes. A system to clone approk 100kb of DNA into bacteria. Clone-based Physical Maps: Assembled genomic sequence base on hierarchical sequencing of clone libraries Contig alignment to chromosomes



Open active DNA with genes being actively transcribed. Classically associated with acetylation of histones and HATs



Closed inactive DNA, tightly coiled and not actively transcribes. Classically associ- ated with methylation of and methyl transferases



Structures of eukaryotic chromosomes that serve as the attachment for the spindle appa- ratus during mitosis. Highly repetitive, and separates long arm for short arm in human chromosomes.



Sequences toward the end of chromosomes that contain mainly simple repeats and du- plicates. They prevent chromosomes from fusing with each other by forming tertiary structures that protect termini. They are interestingly not replicated by polII, but rather their own telomerase.


What are the two differences between finished and draft genome sequences?

finished genome repaired many, but not all, of the gaps in the draft sequences. some heterochromatic gaps, gaps at eukaryotic boundary regions and interior regions remained. Finished genome increased continuity with an increase in N50 contig size. the finished genome corrected order and orientation of draft contigs and eliminated artefactual sequence duplications.


Why is sequencing telomeric DNA more difficult than euchromatic DNA?

telomeric DNA is more condensed and contains many repeating sequences that are hard to assemble with short reads


You are part of a large consortium that performs a large GWAS study of 10,000 individuals that aims to identify risk factors for coronary artery diesase or CAD. You identify four genomic locations that show significant association with CAD. Together these loci explain 2 percent of the heritability of CAD in your population with relative risks ranging from 1.3 to 1.8. Is this suprising? Name at least three reasons that might explain why the heritability is so low.

Rare alleles, interaticons, environmental. maybe many variants with smaller effects are acting together rather than one or two variants with large effect size. make sure there wasn’t population stratification underlying your study.


Name three or more possible sources of bias introduced by T7 RNA polymerase amplification of mRNA from single cells



Name three or more possible strategies that a cell can use to reduce gene expression noise in vivo, assuming the same steady-state protein concentration.



For your graduate research project, you are interested in studying the highly repetitive genome of Sequoia trees. You need to produce a reference genome sequence. What high throughput sequencing technique would you use and why?

hierarchical sequencing. you want to use technology that allows for longer reads and paired end reads because the genome is so highly repetitive


As a postdoc you identify a novel class of human RNA molecules that are likely not polyadeny- lated. You want to know how prevalent they are in the human transcriptome. What high throughput sequencing technique would you use and why?

RNA sequencing with rRNA depeletion instead of poly A selection for library prep because it captures non poly adenylated RNAs and can measure relative expression levels of these novel RNAs.


As a PI, you become obsessed with identifying all transcripts that have expressed, overlapping 3 prime UTRs in K562 cells, a human blood model cell line. What high throughput sequencing technique would you use and why?



A pseudogene is a locus that resembles a protein coding gene but lacks the ability to encode a functional protein? Given this oberservation what are three possible ways that you could distinguish if the sequence you identify is a gene or pseduogene?



Bisulfite sequencing

changes C’s to U’s in unmethylated sites but C’s are unchanged in methylated sites. The green signals indicate sites where methylation patterns aren’t significantly different in normal versus tumor cells, as the singals in the bottom two panels are similar. The red regions indiciate regions that are significantly more methylated, or repressed, in tumor cells but not in normal cells.


Exome sequencing is becoming a standard tool for mapping Mendelian disease causing (or pathogenic) non synonymous single nucleotide variants (nsSNVs). Minor allele frquency (MAF) filter- ing approach is often used to identify candidate pathogenic mutations in these studies. However, hard filtering in exome sequencing of Mendelian diseases still leaves a large number (typically around 100 to 1000) of candidated nsSNVs. Please provide at lease three different ideas/methods that you can use to predict which of the remaining ones have serious funcitonal consequences and prioritize them for validation.



The ENCODE Project has generated hundreds of ChIP Seq experiments spanning 119 tran- scription factors, histone marks, and other DNA-binding proteins and hundreds of cell lines for public use.
Suppose you have a list of genes that are dysregulated in a particular condition or tissue, based on the ENCODE databse, how do you identify the possible transcription factors regulating these genes?

look at the Chip Seq peaks from the encode database for your cell type or tissue of interest and overlay it with RNA seq data. Look to see that you are ssing the dysregualtion of the same genes. Look at Chip seq peaks for normal tissue and tissue of interest to identify significant differences in peaks at transcription factors, inidcating up or down regulation of a transcription factor that may effect expression of your gene of interest


Suppose that you have identified a GWAS lead SNP (SNP1), which is in a LD block of 5 other SNPs. Explain how you can use the ENCODE data to potentially identify the functional SNPs.

aligning multiple binding information. look at how prevalent SNP is across similar tissues or conditions comparted to surrounding SNPs. look to see if SNP is in funtional or non function region. is it in an exon of a gene? look to see if there are any ohter nearby regulatory markers that may affect this SNP but not the others.


What can you conclude about chromating modification with respect to cell types?

chromatin modifications near promoters seem to be similar irrespective of cell type chromating modi- ficantions near non redundant enhancers seem to be more variable and more cell type specific


The prairie dog genome has been predicted to be 1.9 Gb. Using the brand new Illumina HiSeq- 2500, 2 X 150 paired-end sequencing with average output of 600 million reads per lane is possible, and on average, 75% of all bases are Q30 or above (1 error in 1,000). Using this system, what level of coverage would one lane give you with all bases of Q30 or better? Show your work for full credit.

(600, 000, 000 × 2 × 150 × 0.75)/1, 900, 000, 000 = 71


The sequencing gives you great assembly of gene-rich regions of the genome, but you still have 2,000 scaffolds with a total size of 1.7 Gb and 28 predicted chromosomes (n=28). You decide that an assembled genome is essential to your research. Therefore, you decide to create a BAC library with average insert sizes of 100 kilobases. With this in mind, how many total BAC clones will you need to reach 10X coverage for this genome? Show your work for full credit.

1.7Gb × 10(10Xcoverage)/100, 000(100kilobases) = 1.7 × 105 BAC: up to 200Kbs, more commonly used
YAC: really huge inserts, sometimes 1Mb


The human and mouse genomes are said to be finished, whereas all other vertebrate genomes currently in NCBI are said to be either high-quality draft sequences or low-coverage draft sequences. What criteria are used to declare a vertebrate genome ”finished”, and what is meant by high quality draft genome and ”low-coverage draft genome”?

Finished has as few gaps as possible by focused strategies, high quality of base calls q>40 (


The finished human genome still contains gaps. Give 4 different reasons for why there are still gaps

Repetitive regions which cannot be resolved by relatively short read sequences
Heterochromatic regions: hard to sequence
Multigene families that have a lot of structural similarities but polymorphisms between individual gene members
Structural variations: inversions, segmental duplications, insertions, deletions


Why are heterochromatin regions hard to sequence?

Constitutive heterochromatin is composed mainly of high copy number tandem repeats known as satellite repeats, minisatellite and microsatellite repeats, and transposon repeats


Describe 2 features that a finished genome provides that are lacking or clearly suboptimal in a high-quality draft genome.

Draft has many more gaps, less continuity, more incorrect order and orientation of draft contigs, more artifactual sequence duplications, segemental duplications and structural variations unresolved. This provides a finished genome definition for experimentation.


Describe 2 applications for which a low-coverage draft genome is useful.

SNP calling
Simple sequence motif matching


What are ”mate-pair” or ”paired-end” sequences, and how are they used in assembling a high- quality draft genome?

Paired-end sequences are derived from opposite ends of the same BAC clone (or general reference sequence). Having sequences from both ends of a BAC clone is important for arranging the relative order of clones or contigs in producing a tiling map for assembling high-quality draft genome and resolving gaps. Connections aid assembly and allow inference of full sequence.


An assembly of the mouse genome prepared using Next-gen paired-end sequencing has recently been completed. What are the potential problem areas you would be wary of in this assembly? Why?

Next-gen sequencing reads are very short and thus certain regions of the genome are unlikely to be accurately constructed. This includes repetitive regions (such as pericentromeric or peritelomeric) and regions with high frequency structural variations. To accurately sequence such regions, longer reads are necessary.


Briefly describe a novel sequencing strategy that does not involve the standard matrix-based (e.g. polyacrimide gel) length separation.

pyro sequencing, single molecule sequencing...


What is the main computational challenge in DNA sequence assembly? Briefly describe one way to alleviate this problem.

Aligning the reads even with sequencing errors – determining true variation compared to sequence errors. Reads that map to many regions of the genome.


What biologically relevant information can a finished genome sequence tell you that a high-quality draft genome sequence cannot? Give 2 examples.

draft genomes are in scaffolds and contigs
1 - lengths of repeats that may be missed in the draft 2 - actual genetic positions, full chromosomes, etc.
3 - SNPs vs. Sequencing Error.


Diagram and describe how a mammalian genome is sequenced and assembled using a hierarchical shotgun sequencing strategy. Describe 3 ways to test the assembly for completeness and fidelity

create DNA libraries
form a tiling path by aligning overlapping fragments sequence individual clones along the tiling path assemble draft genome
quality scores, coverage, resequencing,


Solexa and 454 strategies have recently become available. What problems would you foresee in using this data for sequencing new vertebrate genomes using whole- genome shotgun method? Name at least two.

Reads too short, especially harder to map repeat regions
Less accurate than sanger
More prone to error in poly-nucleotide sequences (e.g. 5 A’s v. 6 A’s in a row)


What are the advantages of true single-molecule sequencing ? Name at least 2.

1. No PCR errors
2. No PCR “jackpots”
3. get the haplotype directly


How are paired-end read datasets from next-generation sequencers used to analyze both struc- tural variation and SNPs?

1.We can detect CNV, segment duplication by distance variation between paired-end reads.
2. Paired-end read can deal with assembling problem better than normal next-generation with short reads. Thus it can determinate SNP variation from mis-mapping.
3. It can also identify haplotype, if pair of reads both identify SNPs


Why is it so difficult to assemble whole genomes using current next-generation sequencing tech- nology?

Reads are too short, hard to close the gap, and cannot escape repetitive region, thus remain chal- lenge to assemble.


Give 2 ways that DNA sequences from large, specific genome regions can be enriched prior to next-gen sequencing.

1. ChIP
2. PCR
3. Target capture for interesting regions


How is the ABI SOLiD sequencing platform different from both the Illumina and the 454 plat- forms?

ABI SOLiD is sequencing by ligation whereas Illumina and 454 are sequencing by synthesis.


A class of transposable elements (Alus) has increased significantly in abundance in humans compared to chimps. Speculate why there might be a difference in the abundance of this transposable element between these two recently diverged organisms? Describe two approaches you could use to test your hypotheses.

Alu is reactived in humans under positive selection, inferring functionally significance.
We can compare the genomes from humans and chimps, looking for regions under strong positive selection by calculation Ka/Ks ratio in the genome or doing Tajima’s D test.


Provide two models whereby an increase in a specific class of transposable element could drive the process of speciation. Describe an experiment you would develop to test each of your two models.

make new promoters, make new alternative splice sites, etc


You are a graduate student in a lab attempting to identify expressed genes in a strain of corn that has been bred over many generations for extremely high content of polysaccharides and polyphenolic compounds. These compounds make this strain of corn extremely resistant to freezing and herbivory by insects. You’ve spent the last 5 months attempting to extract RNA from this strain of corn, but are still unsuccessful. Your advisor informs you that polysaccharides and polyphenols interfere with your ability to extract RNA and that you should think of different approaches for identifying the expressed genes in this corn strain. Propose two methods for identifying the expressed genes in this plant that do not involve the use of RNA.

We can look for open chromatin, nucleosome shifts for DNA, should see peak of H3K4me3 peak over promoter indicates activity, should see H3K36me3 over gene body showing that it is being actively transcribed.
ChIP-seq of RNA pol II


Give a genome-wide approach to Project 1: The lab has identified a new human transcription factor Maniac, but nothing is known about its regulatory targets. Your first project is to identify all regulatory targets of this novel transcription regulator.

Project 1: The lab has identified a new human transcription factor Maniac, but nothing is known about its regulatory targets. Your first project is to identify all regulatory targets of this novel transcription regulator.


Genome-wide approach: Project 2: Some preliminary results suggest that Maniac may co-localize with a specific histone modification, H3K4 mono-methylation. Your second project is to test this hypothesis.

We can perform ChIP-seq for H3k4Me and use MACS for peak calling. Then we compare the binding peaks to test if they are highly overlapped compared to random (Fisher’s exact test). Also, co-IP which is a bit more complicated in terms of bench work.


Yeast Artificial Chromosome (YAC) cloning method played a crucial role in the first 5 years of human genome project yet almost never used today. Why?

Low cloning efficiency


ways to knockout gene function

homologous recombination (HR), double stranded RNA (DS), short hairpin RNA (SH), or small interfering RNA (SI), CRISPR


Explain the retroviral life cycle, including 5 specific steps. What distinguishes retro- viruses from other types of viruses?.

Fussion – binds receptors on host cell and releases RNA and proteins upon degredation of capsid Reverse transcription – RNA → cDNA
Integration – transports into nucleus and inserts itself in the host genome
Transcription from genome – translates viral DNAs using host machinery
Budding – progeny virions pinch off
They reverse transcribe their RNA and integrate as cDNA into host genome – others viruses don’t integrate into the host genome


Explain briefly the key steps in acquiring a finished vertebrate genome assembly using hierarchical shotgun sequencing. Describe three ways to test the assemble or completeness and assembly.

1) Create DNA libraries of genomic 50-200kb
2) tiling path by aligning overlapping fragments
3) sequence individual clones along the tiling path
4) assemble the draft genome
5) finishing – filling in gaps, increase quality of reads, etc.
3 ways to test the assemble or completeness accuracy based on STS and EST information look at BAC clone fingerprints
quality scores, resequencing coverage


The Rhesus monkey genome has recently been completed. What important new information can it give us that a two-way comparison of the already-completed human and chimpanzee genomes cannot?

The Rhesus monkey serves as an outgroup when comparing this 3-taxon phylogeny. It allows us to determine what was present in the MRCA of the three primates. We can see what are the human specific mutations as well as chimp specific, etc.


Describe GWAS.

We recruit DNA samples from 3000 affected patients and 3000 healthy controls. All these DNA samples were genotyped on Illumina 1M SNP microarray. All the genotypes for 1M SNPs were called by GenomeStudio software. QC was performed to remove low-quality samples and SNPs. After that, we exclude duplicate and relative samples and match controls with cases in ethnicity to avoid population stratification. Then Chi-square test can be applied to detect SNP association with the disease.


Describe the computational steps that you used to identify the significant SNPs in GWAS?

If population structure or substructure have been detected, we can use logistic regression with few principal components as covariant to identify which SNPs are significantly associated with the disease. If cases and controls are well matched in ethnicity, we can use Chi-square test or Fisher’s extact test to discover disease-associated SNPs.
Since there is multiple testing issue, we need to correct the association P-value by FDR control or Bonferroni correction.


Name 4 omics technologies used in systems biology approaches.

Genomics: DNA sequencing to find polymorphisms
Transcriptomics: RNA-Seq, microarrays
Proteomics: Mass Spectrometry, Yeast 2 hybrid
Epigenomics: ChIP-Seq, ChIP-Chip to screen for genome-wide epigenetic changes or transcription factor loading


Identify two possible application of single cell sequencing

Single cell RNA-Seq: study the gene expression in individual neuron cell. Single cell DNA-Seq: study the mutation in individual cancer cell.


Briefly describe the purpose of Hi-C sequencing

Study the organization of chromosomes and interaction of the chromosome regions.


Briefly define CpG islands and their significance in genomics.

CpG islands are genomic regions that contain a high frequency of CpG sites (¿200 bp and GC% ¿ 50%). CpG islands typically occur at or promoter region. The gene is repressed when methylated and activated when unmethylated.


In up to three sentences, how does systems biology differ from reductionist approaches?

A systems biology approach differs from a reductionist approach by considering all players in a pathway and all pathways involved in a phenotype or trait of interest, whereas a reductionist approach focuses on single gene/single pathway methods to study biological phenomenon.


During analysis of a library of cDNA clones made from RNA extracted from a normal human skin sample, you discover a small subset of clones (about 0.1 %) that do not match the human reference genome sequence by BLAST analysis of their sequences. II.E.a genome-wide gene expression tech and analysis. What are 4 possible reasons for this unexpected result?RNA editing (C→U or A→I(G) by deamination)
Insertions, deletions, or other structural variations between the sample and reference Technical error in the base call
Incorrect mapping of the sequencing read. How would you test whether your hypothesis in (b) is correct ?

RNA editing (C→U or A→I(G) by deamination)
Insertions, deletions, or other structural variations between the sample and reference Technical error in the base call
Incorrect mapping of the sequencing read

Technical error in the base calls. Next generation sequencing has a higher error rate per base calls than Sanger sequencing despite partially compensating by higher read depth. Typically targeted sequencing of the discordant base is done by Sanger sequencing for validation by independent technology and some are identified to in fact be concordant to the reference genome.
Check if the mismatch is tending towards the end of sequencing reads, where the error rate tends to be higher. Do Sanger sequencing on the same samples where the discordance was observed.


We wish to test whether various cis elements are effective in acting as transcriptional activators. We decide to clone the cis elements into a luciferase reporter vector and transfect into a primary cell culture. Briefly describe controls that will be important to quantitatively assess the relative effective- ness of different cis elements.

Negative control: a same vector but without any cis elements
Positive control: a known cis transcriptional activators, An empty vector to control transfection efficiency A known repressor for comparison of expression levels


Describe in detail the methods you would use to characterize microRNAs in breast cancer samples.

Perform HITS-Clip sequencing or small RNA sequencing. HITS-Clip sequencing is done by UV cross linking RNA to protein and immunoprecipitation of Argonaute proteins. After purification and re- versing cross links, make libraries out of extracted RNA molecules which should have come from mRNA-microRNA duplexes. After sequencing these, map them to the human genome and look specif- ically for matches to the 3 UTR regions of known genes. Look for differential gene expression between the normal and breast tumor tissues.
Small RNA sequencing is done by isolating the population of small RNA in the cell by gel electrophore- sis separation and cutting out a band around 18-22 nucleotides. After sequencing, map the reads to the reference human genome and look for differential expression between normal and breast tumor samples.


Provide an analysis of the quality of quantification as a function of the number of reads se- quenced for an RNAseq experiment. Based on your analysis what is your recommended sequencing depth for mammalian transcriptome assay?

FPKM or RPKM. Fragments Per Kilobase of transcript per Million fragments/reads mapped. This normalizes the read/fragment counts based on total read/fragment counts and gene/exon/fragment.length. Generally, people recommend 100 million reads.


Describe in detail the methods you would use to characterize mRNAs and long noncoding RNAs in these samples. Include how would you identify and measure transcript levels, gene fusions, alterna- tive splice forms, and alternative promoters and alternative polyA addition sites in the breast cancer genome from the RNAseq data.

To characterize mRNA and long noncoding RNA, isolate out respective populations (by poly A tail selection and size selection respectively) and create libraries, subsequently amplify and subject to high throughput sequencing by paired-end to allow identification of gene fusions. After mapping paired-end sequences to the human genome, measure/analyze:
Transcript levels: Measure FPKM levels of fragments (number of reads per kilobase of exon model per million mapped reads) to quantify the expression levels of individual gene transcripts.
Gene fusion: Look for reads whose paired ends map to different genes to identify possible gene fusions. Alternative splice forms: Align read using a splice aware aligner such as Tophat to look for reads that came from 2 different exon-exon junctions.
Alternative promoters: Peaks or regions of high density coverage upstream of known promoters to look for alternative promoter usage. Use Cufflinks.
Alternative poly A addition sites: Extension of the 3 UTR sites relative to known 3 UTR sites by looking for the presence of reads downstream of known 3 UTR sites. Use Cufflinks.


Describe in detail at least two approaches you would use to discriminate between transcripts with driver mutations likely to be important for the cancer vs those bearing passenger mutations with no likely function, using these datasets.

Look for conserved mutations in all samples. Conserved mutations are likely to be driver mutations. These mutations should have low levels of heterozygosity if they are recessive and if they are domi- nant, the should have low minor allele frequencies. Another approach is to evaluate the dn/ds ratio of mutations and look for those under purifying/negative selection (dn/ds


Using the RNAseq datasets described above and existing public databases, describe in detail how you would identify and validate microRNAs likely to be important for the formation and/or main- tenance of the breast tumors.

Look for microRNAs that are differentially expressed in the tumor tissue vs. normal samples and then using public databases, see if they have any known gene targets. If not, leverage mRNA seq datasets and see if any genes have high sequence complimentarity in their 3 UTR regions to the miRNA seed sequence. Further, see if expression of the miRNA is negatively correlated with the mRNA across the 20 samples.
Validate the miRNAs with rescue of the WT gene or gene resistant to miRNA identified. See if the phenotype is rescued. For miRNAs found to be differentially expressed (higher expression in tumors) validate to see if transfection of the WT gene prevents metastasis, the genes targeted by the miRNAs in this case are expected to be tumor suppressor genes. For miRNAs found more expressed in normal miRNAs, see if transfection of the WT gene induces metastasis. Expect miRNA in this case to be targeting a proto-oncogene.


Suppose we were to design 50-mer probes for a microarray for yeast genome. List five important factors to consider in the probe design.

annealing temp
GC content (stability)
folding over on itself (palandromic sequences) – hairpins
location of primers in genome – coverage
uniquely mapping to genome
perfect match and imperfect match (to test for background binding) need multiple spots for each


Briefly discuss the organization and significances of the top layer of the Gene Ontology hierarchy.

Top layer: Biological Process, Molecular Function and Cellular Component Divides genes into 3 categories. A gene can be in 1-3 categories


In 1 to 2 sentences, what is the transcriptome?

The set of all RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA produced in one or a population of cells.


Why is the number of reads in RNA seq important for RNA expression analysis?

Quantify RNA molecules. Indicate expression level.


Explain what RNA-seq is. For RNA expression analysis, what are the advantages of RNA seq over traditional DNA arrays? The disadvantages?

“Whole Transcriptome Shotgun Sequencing,” refers to the use of high-throughput sequencing tech- nologies to sequence cDNA in order to get information about a sample’s RNA content
1. identify RNA sequence, can discover novel transcript.
2. get the sequence read directly 3. higher coverage Disadvantages:
1. expensive
2. biased to high abundance transcripts 3. need to align


Using RNA expression analysis, how would you analyze transcriptional output of the cell cycle? What are the key considerations for this analysis to focus on truly cycling genes?

Knock out checkpoint genes, and analysis RNA expression genes that is blocked in a certain phase. Synchronized the cell population?
FISH for chromosomes and stall cell cycle whenever during a certain cell cycle phase, and analyze the gene expression


Describe how you would:
Subdivide the tumors into categories
Identify candidate biological pathways within each tumor category that are aberrantly regulated

Hierarchical clustering

Pathway calling for the clustered gene list. (gene list enrichment for pathway)


What is tumor heterogeneity, and how might it affect the exome mutational data?

Cancer cell diversity. The millions of cells that make up the lump have become distant relatives.


How would you try to distinguish potential functionally relevant driver mutations from passenger mutations, using only the sequencing data?

Passenger mutations do not have any effect on the cancer cell, but driver mutations will cause a clonal expansion.
A driver mutation is causally implicated in oncogenesis.
A passenger mutation has not been selected, has not conferred clonal growth advantage and has therefore not contributed to cancer development.
Driver mutations cluster in the subset of genes that are cancer genes whereas passenger mutations are more or less randomly distributed.
- “The cancer genome,” Nature 458, 719-724 (9 April 2009)
1. Find the gene sequence that is consistent but random among the tumor population
2. the driver gene sequence should be consensus within tumor, but across tumor (?)
3. We can distinguish by looking for orthologous genes or genomic region with tumors from mul- tiple species with the same type of cancer.


How would you combine the mutational data with the expression profiling data to distinguish potential functionally relevant driver mutations from passenger mutations?



One fundamental difficulty in interpreting microarray data from tumors or organs is that measured changes in gene expression may reflect either intracellular changes or shifts in the relative proportions of distinct cell populations. Describe an experimental methodology to over- come this challenge. Next, design a computationally-based approach to address the same challenge. What other data sets will be required to implement your computationally-based approach?

Do single cell analysis
Do controlled experiment where you do mixtures of cell population as well as homogenous cell popu- lations and do arrays – use data to computationally predict mixture models for tumor samples.


Describe an experimental strategy to identify the community flora of human wound lesion, including type of sequencing, genes targeted and rationale the experimental design

16S rRNA sequencing with primer on the conserved regions extending into variable regions which identify the microbe species. The 454 sequencing platform is preferable due to the long read length compared to the Illumina sequencing platform. The 16S rRNA gene is highly conserved and also has certain variance among different microbe species. So sequencing 16S rRNA is easy to quantify the microbe abundance in a single assay.


Describe how one might numerically characterize the human oral microbiome including possible measures of diversity.

We can calculate the relative abundance of each by normalizing the read counts with gene length and sequencing depth. The Sharon Index can be used to measure the diversity.


Describe how one might identify novel organisms whose sequence data is not found in existing databases.

We can do whole genome short gun sequencing, and then de novo assemble the genomes. By com- paring to the existing databases, we can identify novel species.


Name two common types of mass spectrometric assays.

MALDI-TOF: combine a matrix-assisted laser desorption/ionization source with a time-of-flight mass analyzer.
LC MS/MS: Similar to gas chromatography MS (GC/MS), liquid chromatography mass spectrometry (LC/MS or LC-MS) separates compounds chromatographically before they are introduced to the ion source and mass spectrometer. Tandem mass spectrometry (MS/MS).
A mass spectrometer consists of three components: ion source, mass analyzer, and detector. The ionizer converts some portion of the sample into ions. An extraction system which removes ions from the sample and gives them a trajectory which allows the mass analyser to sorts the ions by mass-to- charge. The detector, which measures the value of an indicator quantity and thus provides data for calculating the abundances of each ion present.


The yeast two hybrid and IP mass spectrometry are two methods that are used to generate genome wide protein protein interaction networks. Describe how one of these two works:

IP mass spectrometry pulls down a protein of interest by immunoprecipitation through a bead/column or antibody. Next, mass spectrometry is performed on the pulled down protein complex, which likely contains proteins interacting with the protein of interest. IP mass spec is more likely to capture true biological binding partners that act in vivo whereas yeast 2 hybrid is subject to possible false positives arising from artificial interactions that do not happen in ivo especially if the protein is not endogenously a yeast protein interacting in the nucleus.


How does yeast two hybrid work?



Describe the notion of centrality and lethality in a protein protein interaction network.

The idea of centrality is that there are certain hubs in protein protein interaction networks that have many interacting partners and deletion or knockdown of these central genes is often lethal.


Briefly explain the principles behind MALDI-TOF proteomics technology.

Matrix-assisted laser desorption ionization - time of flight

The matrix protects the protein from direct contact with the laser. Ionize the proteins by transfering the energy from the laser to the matrix and to the protein. The time of flight is the amount of time it takes for an ion to fly from point A to point B (determined by mass/charge)


Describe two different experimental methods for high-throughput measurement of protein-protein interactions.

Yeast-two hybrid with library of domains Co-immunoprecipiation and mass spec


The yeast-two-hybrid assay is based on what general feature of eukaryotic transcription factors?

They are modular


compare quantitative protein profile comparisons with nucleotide microarrays. What does each of them measure and what are the two advantages and two disadvantages of protein profiling?

Protein profiles give you actual protein expression levels and nucleotide microarrays give you tran- script levels.
Advantages of protein profiling:
Gives you protein levels instead of transcripts (b/c don’t always correspond)
Can look at PTMs
You can’t amplify proteins – you can generally only detect abundant proteins Expensive and you can only look at things with existing antibodies.
Can only probe known proteins.


Describe a human shRNA library. What are its key features?

Many short hairpin RNA molecules that target many genes in the human genome. To ensure speci- ficity, a given gene may have many individual shRNA target sites. In addition, the shRNA may target different regions of a gene, such as different sequences in the 3 UTR. Includes efficient delivery system such as a lentiviral vector.


Explain how you would use such a library for identifying new potential drug targets for BRAF inhibitor-resistant melanomas, using the cell lines described above.

Infect each of the cell lines with the library of shRNA and monitor cell death after introduction of
BRAF inhibitor therapy. Cell lines that die are those where after introduction of a shRNA that targets a gene are no longer able to resist BRAF inhibitor therapy. Recover the specific responsible shRNA by testing and narrowing down subpools of the shRNA library or assay each shRNA in a separate well of a plate. To ensure identified genes in the primary screen are real, perform repeated experiments on all 20 cell lines with addition of shRNA that targets the gene in different regions of the transcript. See if the wild type phenotype (or resistance to BRAF inhibitor therapy) can be rescued by transfection of the WT version of the gene without infection of shRNA or transfection of a version of the gene resistant to the shRNA identified.


Explain how you would validate candidate targets arising from this functional genomics screen.

Try to rescue the WT phenotype by introducing the WT version of the gene or introducing a version of the gene resistant to the shRNA identified. Knockdown the gene through a more stable pathway such as site specific homologous recombination to see if the phenotype identified in the screen is consistent and persistent.


Suppose you get 30 candidate proteins that cluster into 6 discrete biological pathways. What might this result be telling you about BRAF inhibitor resistance in melanoma?how would you approach developing drugs that could be effective for BRAF inhibitor-resistant melanoma?

BRAF inhibitor resistance in melanoma involves a central or hub gene that implicates many path- ways. Thus, there is a lot of cross talk between different pathways. This also suggests that this pathway is important and that knockdown of too many genes may be lethal. Depending on the specific nature of the pathways. Lethal for melanoma specifically and not lethal for normals cells is critical.

Target combinations of the 30 target genes and observe which one gives the most effective pheno- type of selective death of melanoma but not normal cells. Be careful not to target too many genes in the same pathway or genes that have similar functions in different pathways as this might induce a synthetic lethal mutation.


Your lab is interested in hedgehog pathway signaling. Conveniently, you have a reporter cell line with integrated hedgehog responsive promoter driving luciferase. When you add hedgehog ligand to the cells you get 100 fold induction of reporter signal. Design a genome wide siRNA screen to find components that are necessary for response to hedgehog ligand.

Use siRNA library targeting human genes to screen the responsible genes. Make sure each gene has several siRNAs targeting it. Transfect the hedgehog reporter cell line with siRNA library. We also need some controls, such as reporter cell line only and reported cell line with ligand.Find the genes that decrese the fold change after adding the hedgehog ligand.
Positive Controls
PC-S: This is the positive control for silencing. The PC-Ss are siRNAs that induce a high level of gene knockdown, they are NOT involved in the pathway you are studying and should not target genes that affect cell proliferation or survival (e.g., GAPDH or beta-actin). The PC-S will simply provide information on the efficiency of the positive knockdown in the screen and will NOT be used in the statistical analysis of your data.
PC-A: This is the positive control for your assay. The PC-As are siRNAs that should induce your screening phenotype and CAN BE used in the statistical analysis of your data for evaluating hits. The PC-As should target known genes in your pathway and it is very important to test several potential PC-As to find one that produces the desired phenotypic change at the levels you require.

PC-A2: It is often very helpful if a second PC-A (we’ll call it PC-A2) is used that induces a moderate phenotypic change in the assay. This control will not be used in the statistical analysis of the data but can be used as a phenotypic marker to evaluate the results.
Negative Controls
NC-NT: This is the Non-targeting negative control and establishes the baseline for your assay. The NC-NT measures the changes siRNA delivery can make on gene expression. In most cases, this should be a nonsense sequence with no complementary to known genes and should have no effect on your assay results.
NC-NsiRNA: This is the nontransfected negative control and contains only seeded cells with no trans- fection reagent/siRNAs. The NC-NsiRNA can be used in conjunction with the NC-NT to determine if siRNA delivery affects assay results.
NC-T: This is the treated negative control and is only used in experiments having additional treat- ments (drugs, chemicals, etc.). The NC-T serves as the baseline for effects of treatment alone on cells. NC-NC: This is the no-cell negative control. The assay wells are treated with all reagents used during the experiment and measures non-specific signals from these reagents. This control is considered un- necessary for most screens.


Briefly explain the principles of ChIP-CHIP assay. Describe the general features of ChIP assays and what they are used for.

Chromatin immunoprecipiation on an array. Used to pull down with antibodies and look for enrich- ment of what regions of the genome are bound to that protein. Often used for identifying transcription factor binding sites, regions with certain chromatin modifications, etc.


You are setting up a genome-wide siRNA screen to investigate host cell genes affecting replica- tion of an intracellular bacteria. You have available siRNAs for 20,000 human genes arrayed 90 siRNAs per 96 well plate. You thus have six wells free for controls. Name three types of controls you would include in the six free wells.

One well with no siRNA
use known knockout phenotype
use siRNA that does not target anything


Name as many uses of DNA array technology as you can.

Chip-chip, look for DNA methylation (sulfonate), mRNA expression array, SNP calling


Name a technique that allows one to identify nucleotide sequences bound by a transcription factor in vivo at the genomic level.

ChIP-chip, ChIP-seq


In 1 to 2 sentences each, explain advantages/disadvantages between cDNA, long oligonucleotide, and short oligonucleotide DNA arrays.

cDNA (and promoter) arrays (e.g. Brown/Botstein):
pros: Prevalent infrastructure, cheap to produce in your own lab, sensitive cons: Variable quality, cross hybridization
Long oligonucleotide array (e.g. Agilent):
pros: Sensitive, commercially sourced, high density cons: Cross hybridization, density, cost
Small oligonucleotide arrays (e.g. Affymetrix):
pros: Extremely high density, multiple independent measures, open-source analysis algorithms, base level discriminance
cons: Cost, sensitivity
*Exon arrays
The longer the probe, the higher the probability that another gene will have similar sequences that could cross hyb. i.e. think about a genome length probe, almost any sequence would cross hyb some- where.
The shorter the probe, the fewer base pairs that hybridize, the less sensitivity of detection. Affy tries to get around this by having many of them.


Your lab is interested in hedgehog pathway signaling. Conveniently, you have a reporter cell line with integrated hedgehog responsive promoter driving luciferase. When you add hedgehog ligand to the cells you get 100 fold induction of reporter signal. Design a genome wide siRNA screen to find components that are necessary for response to hedgehog ligand.

siRNA screen for cell line with and without hedgehog


Describe the key steps in a ChIP-seq experiment you might carry out to localize CTCF binding sites

Compare with a background (input) sequencing
Experiment: whole genome pull-down by CTCF, digest out the unbinding region, and do next- generation sequencing to identify the binding sequence.

*cross-link DNA


Can you be confident that you have identified the genomic locations of all of the CTCF binding sites for these two cell types? If not, what types of sites do you think may have been missed?

Binding by CTCF with some other stimulation, co-activators


What do you think the cohesin ChIP-seq data mean? Give at least three possible interpretations.

1. Cohesion itself is regulating some gene by binding the chromotin 2. Combinatorial effect by cohesion
3. Cohesin havs regulating fuction with other transcription factors


You’re a CEO of a biotech company that specializes in protease inhibitors. There are about 500 proteases in humans, and would like to discover which ones are the strong contributors to the survival of a particular form of osteosarcoma that can be modeled as a cell. How would you screen these proteases to find ones that regulate this process?

High throughput screen of all the protease inhibitors that your company makes