Genomics - Lecture Flashcards

(64 cards)

1
Q

What is Sanger sequencing?

A

Method to determine the DNA sequence –> exploits the fact that dideoxynucleotides terminate replication –> Hence if we add a specific dideoxynucleotide (I.e. ddATP or ddCTP or ddTTP or ddGTP) it will bind to the complementary base (A-T and C-G) and terminate replication –> the point of termination corresponds to the location of the complementary base.

Hence, when we run an agarose gel we will get DNA fragments of different lengths –> each corresponding to the location of a particular base.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does a tube for Sanger sequencing contain?

A

DNA replication is performed in four separate tubes, each containing:

  1. Single-stranded DNA to be sequenced
  2. DNA polymerase
  3. Primers
  4. The four dNTPs (dATP, dCTP, dTTP and dGTP)
  5. Small amount of one of the four 2’,3’-dideoxy analog (ddATP or ddCTP or ddTTP or ddGTP)

Note - Either the primers or the dNTPs are radiolabelled with 32P or fluorescent labels –> allows us to visualize it on an agarose gel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Outline the procedure of Sanger Sequencing.

A
  1. A short piece of DNA called a primer is added –> the primer will bind specifically to a DNA sequence –> serves as a starting point for DNA replication.
  2. Primers are elongated using DNA polymerase.

If this were all, the reaction would copy a new chain until it stopped. However, this is not the case as we have dideoxynucleotides.

Dideoxynucleotides –> Lack a -OH on both the 2I and 3I carbon –> This means that when a dideoxy base is incorporated into a DNA molecule, the chain stops or terminates –> as the phosphodiester backbone can not be extended.

  1. At any position, either a normal base will be added, so the chain can continue to grow, or a dideoxy base will be added, so the chain terminates –> After many cycles of replication –> we get many DNA chains of different length –> each corresponding to a particular base.
  2. The base sequence can then be read off using the agarose gel.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How has Sanger sequencing been improved?

A

Modern DNA sequencing.

The DNA sequences are separated by size on a gel-filled capillary tube –> DNA added to one end of the gel and a charge is applied to the tube and the DNA moves through according to size, smallest first –> electrophoresis.

Reading the sequence is done by illuminating the DNA, just before it emerges, with a laser to detect the ‘coloured’ tag on the dideoxy base (each dideoxy-base has a coloured tag) at the end of the DNA copy –> The colour of the emitted fluorescence is read by the detector and a base is assigned.

The result is stored and assessed by software designed to test how reliable the base assignment is.

Note –> Peaks created are perfect –> Hence, we need a score in order to gauge validity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Phred quality score (Q-Value) used for in the modern sequencing?

A

Each nucleotide read is assigned a quality score based on how confident the read prediction is –> Generally called the Q value.

The most commonly used method is to only count the bases with a quality score of 20 and above (99% accuracy)

All sequencing technologies (Illumina etc) provide a quality score for each nucleotide in their sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Examples of next gen sequencing methodologies?

A
  1. Illumina
  2. Lifetechnologies –> Ion Torrent
  3. Pacificbiosciences
  4. Nanoporetech
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Outline how Illumina performs DNA sequencing?

A

Illumina –> most widely used.

Difference between Sanger and Illumina –> Sanger sequences a specific DNA sequences whereas with Illumina you get the entire human genome and break it apart into small DNA fragments.

  1. Fragment the entire genome into small DNA fragments.
  2. DNA is made single-stranded
  3. Add Adaptors to the end of all the DNA fragments
  4. Add the DNA to a flow cell –> the flow cell has probes fixed to it which match the adaptors.
  5. As each DNA fragment has two adaptors it bridges across to bind the other adaptor to another probe.
  6. Add nucleotides and polymerases –> builds double-stranded sequence
  7. Denature the dsDNA and repeat process to produce several million sequence clusters.

Now that we have amplified the DNA we can start sequencing it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Outline how we are able to read the DNA sequence using Illumina.

A

Add four labelled reversible terminators, primers and polymerase.

When the reversible terminator binds –> no other nucleotides can bind –> but it is reversible so it can be removed –> allowing for another nucleotide to bind and to be removed.

This process repeats itself –> allows us to read one base at a time.

So the process is as follows…

First reversible terminator binds –> Read the sequence with a laser –> wash to remove base –> repeat.

Note –> each nucleotide is labelled with a different colour –> allowing the laser to distinguish between them.

Once again, each base is given a quality score for the probability it is incorrect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Problem with Illumina?

A

Can only read relatively short sequences at a time –> sometime only 150-200 nucleotides at a time.

Problematic for genome assembly because you need to find the position of each sequence relative to each other (examining overlap)

Longer reads would be more useful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Pac Bio?

A

Produces the longest reads – approx 20 000

It has a high error rate of approx 87% –> however, errors are random so compensated by multiple reads and creating consensus –> Consensus method creates 99.999% accuracy.

Basically, when you sequence a genome multiple times –> the errors are random (all over the place) –> we can align the sequences and get an overall consensus (chance of an error in the same position in multiple reads is very unlikely) –> to generate an accurate read.

Would not be possible if errors were focussed on particular areas or sequence patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why can PacBio produce longer reads than Illumina?

A
  • Polymerase used by Illumina falls off DNA quite easily –> struggle to keep DNA poly on.

However…

PacBio uses polymerase from a deep-sea sponge –> adapt in order to allow biological functions to occur at extreme pressures –> hence, the polymerase has to bind tightly to the DNA.

Thus, being an advantage for gene sequencing –> allows us to create longer reads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do we go about sequencing a genome?

A

After sequencing a genome (sanger or next-gen sequencing) we end up with all these fragments of DNA that have been sequenced which will need to put together –> like a puzzle –> Known as de novo assembly

Note –> this is not as problematic now as we have already sequenced the human genome which we can use as a reference.

Method - Shotgun sequencing

  1. The genome sequence is shredded into pieces and inserted into plasmids
  2. We sequence each fragment from both ends –> get reads –> this process is repeated for all fragments of the genome
  3. The sequence is then assembled de novo or against a reference for comparison
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The basic idea behind De Novo sequencing?

A

Get genome –> convert it into a lot of fragments –> sequence –> examine for overlaps –> if there is sufficient overlap we know that those two fragments are adjacent to each other –> over time build up the genome.

In principal seems easy but in reality not because…

We have millions of fragments + we have a lot of repetitive sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Outline De novo sequencing.

A

De novo sequencing

We get fragments and search for overlaps to build up a consensus.

But at one point you will reach an error –> this is where we use the quality score in order to gauge whether we should consider this difference or not –> quality score low we disregard error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the ‘Depth of Coverage’ refer to?

A

Sequencing errors are eliminated by the depth of coverage of overlapping sequence fragments.

For the Human Genome Project, most of the genome was sequenced at 12X or greater coverage.

This means that each base was present in 12 reads on average –> by increasing the number of times we sequencing and checking the overlap –> the higher probability that overlap is correct.

However…

Even with 12x coverage approximately 1% of the genome not accurately assembled

General rule of thumb - More complex genomes need more depth of coverage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is one of the challenges of De novo sequencing?

A

Highly repetitive DNA –> you don’t know whether this repeat fragment is from one part of the genome or the other.

If you don’t assembly correctly –> you can lose a large portion of the sequenced DNA between the two repeats (as shown in the picture).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can paired-end reads be used to deal with repetitive DNA?

A

Paired End Reads

Paired-end reads are sequences from both ends of a DNA fragment. We know the paired ends because we sequence from both sides of each fragment –> so we know what’s on each end but not necessarily what’s in the middle.

If the fragment is 700bp and the reads 100bp they provide 3 pieces of information:

  • the tag 1 sequence
  • the tag 2 sequence
  • that they were 500bp apart in your genome (distance)

This gives you the ability to map to a reference (or denovo) using that distance information. It helps resolve structural rearrangements (insertions, deletions, inversions), as well as helping to assemble across repetitive regions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How exactly are Pair reads used?

A

Since we know the Paired-end repeat sequence and the distance between them we can start building up the genome –> as we know that they must be a specific distance apart.

If one read is unmappable because it falls in a very repetitive region but the other end is unique, you can use that distance information to map both reads.

Basically, we know one end of the which acts as anchors –> whereas, the other end falls in a repetitive region –> normally we wouldn’t be able to distinguish where this repetitive sequence is from but since we know that it must be a certain distance from the anchor we know its position.

Normally works for short repeats not longer repeats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What do we use for large repeats?

A

Use this when the repeat is larger (Few Kb) than the distance we can read.

Use Mate Pairs –> Same approach as paired ends.

We sequence on either end of a very large fragment -> we know that each end of the sequence goes together + we know the distance apart.

Allow us to bridge across large regions and then bridge across these two ends using paired-end reads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How can paired-end reads be used to understand cancer genomes?

A

Paired-end reads have a lot of DNA rearrangements.

Sequence cancer genome and reference it back to the reference genome.

What we want to find out is how the cancer genome varies from the reference.

We get paired-end reads which are X distance apart –> when mapping it back if it is less than X distance this means that we have an insertion in the cancer genome. Likewise, if the reference distance is more than X distance than we have had a deletion in the cancer genome.

We can also identify inversions (the sequence is inverted) or translocation (movement of sections to a different chromosome)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is scaffolding?

A

When sequencing we create contigs –> Contiguous sequence where base order is known —> Assembled from sequence reads.

But we will always get gaps –> use mate pairs to figure out the order of contigs.

Note - Approx length of fragments are known so we number of base pairs between contigs.

Hence, a Scaffold is a…

Genome sequence reconstructed from contigs and gaps.

To fill the gap we need to manually sequence using chromosome walking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are some assembly limitations?

A

Denovo assembly of complex genomes is still problematic –> Even greater problem for next gen sequencers with small read lengths –> Will improve with longer reads and 3rd gen sequencers.

Examples of difficult genomes

Entamoeba hystolytica

  • Very AT rich genome
  • Ploidy unknown
  • Over 1500 contigs
  • Genome size approx 20Mb

Blumeria graminis

  • Repeat rich
  • Approx 7000 contigs
  • Genome size approx 120Mb
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Benefits of Next-Gen sequencing?

A

Next generation sequencers can be used for:

  1. Genome sequencing/re-sequencing –> looking for variants
  2. Targeted resequencing
  3. SNP detection
  4. Transcriptome sequencing for expression analysis, and splice variant detection (RNA-Seq)
  5. Protein-DNA/RNA interactions (ChIP-Seq) –> protein binding sites
  6. DNA Methylation (MeDIP-Seq) –> methylation patterns.
    - Seq –> means next-gen sequencing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is genome annotation?

A

Going from the basic DNA sequence –> to the fully annotated version –> introns, exons, transcription factor binding sites, repeats, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the objective of genome annotation?
The objective of genome annotation: 1. Identify all of the possible genes and features within the genome + assign functions to as many genes as possible. This should include... - Variable splice sites that may mean the gene has multiple products - Identify and describe regulatory features and their functions As well as... Identify all other features, including repeats The better picture we have --\> the better we can perform comparative studies between organisms of interest.
26
How do we identify genes?
Gene Hunting By Sequence Homology Many useful tools for gene identification are based on sequence identity --\> the assumption is if 2 genes are (very) similar in the sequence they will encode proteins with similar structure/function. Hence... Compare unknown sequence to sequences of known (or guessed) function by sequence alignment methods Performing BLAST searches against: - EST (express sequence tag) - short sub-sequence of a cDNA sequence which corresponds to mRNA. - SwissProt database Note --\> Even if our sequences matches a protein of unknown function --\> the existence of similarity itself is strong evidence that sequence is protein-encoding. **Major problem** – finding genes **without** known homologue so additional methods required
27
Why is the first exon hard to identify in a Blast search?
Only a couple of the bases in the first exon are actually coding (rest could be UTR - untranslated region), which quite often is reported as insignificant in the BLAST search --\> making it hard to identify.
28
Prokaryotic gene structure?
If we can't identify the coding sequence via a BLAST search --\> we need to identify other relevant sequences --\> ribosome binding site, promoter region, -10 and -35 sequences etc. --\> in order to help us identify the gene. We looking for the usual patterns that are found in these specific sequences (consensus - average)
29
What are the two ways genes are identified?
1. Start forward homology 2. Examining patterns of sequences associated with the gene.
30
What is an open reading frame?
Open Reading Frame (ORF) - The coding sequence of an mRNA is a set of in-frame triplet codons starting with AUG and ending with UAA, UAG or UGA --\> starts with a start codon and ends with a stop codon. Note - We need to consider all 6 possible reading frames: 3 forward and 3 reverse On average, stop codons occur 3/64 times in any three base sequence ~1 in 21 times --\> hence, only “real” ORFs tend to be long (don't have a stop codon in the middle) ORFs in bacteria are easy to predict as there are no introns. Whereas... Eukaryotic ORFs much harder due to intron/exon structure and some peptides are small
31
Why is gene prediction in prokaryotes easier than eukaryotes?
Gene prediction is bacteria is easier than eukaryotes: - No introns - Smaller intergenic regions (Microbial genome tends to be gene rich (80%-90% of the sequence is coding) whereas in eukaryotes genes tend to be further apart.
32
What do Bacteria genefinders normally do?
Bacteria genefinders often just look for the largest open reading frames (ORFs), those above a certain size (approx. 300bp) and consider them to be real genes. This is a reasonable assumption genomes with a low GC contents. However... It is a problem for those with high GC contents With a lack of A and T in these genomes, there are far fewer stop codons. Hence... Long ORFs occur simply by chance in high GC genomes, many of which are not genuine genes But... L0w GC content --\> high accuracy.
33
How do you identify genuine ORF?
We can use the longest ORF but eventually, it becomes challenging when the GC content increases --\> need to use additional information (different parts of the gene --\> promoters, etc.)
34
How is codon bias used to predict ORFs?
In every organism, we have 61 tRNAs --\> results in redundancy as we have 20 amino acids --\> Arg has 6 codons --\> inefficient to use all 6. Wobble base pairing allows one tRNA to bind many codons --\> This means that tRNAs prefer some codons to others So organisms have a bias towards a specific codon --\> higher frequency (i.e. Drosophila and E. Coli) Note - Bias can also be used to moderate gene expression --\> reduce gene expression --\> increase the number different codons for a specific A.A --\> slows down the process --\> as new tRNA would bind every time. Hence, examining the gene and codon bias --\> gauge how likely it is a gene.
35
Summarize the different steps in Bacteria gene prediction.
1. Homology (BLAST) search against a good quality database (UniProt-SwissProt, RefSeqP) 2. Identify ORFs – those above 300bp 3. Use codon bias information for species in question Programs use Markov models to detect compositional biases and increase the reliability of gene detection 4. Identify other gene related structural features: - Promoter - -35 and -10 (Pribnow box) regions - Shine Delgarno sequence Note - 2-4 are ab initio gene prediction methods
36
What does ab initio gene prediction mean?
- Looking at the raw DNA sequence and looking for patterns --\> not using homology.
37
Why is Eukaryotic gene prediction more difficult?
Far greater problem: 1. Multi-exon genes – introns 2. Multiple transcripts from a single gene depending on splicing. 3. Large intergenic regions --\> stretch of DNA between genes. However, there are a lot of different patterns we can use.
38
First thing you do to predict genes with eukaryotes?
As with prokaryotes search for homology – BLAST Search against good quality protein database – UniProt-SwissProt, RefseqP Furthermore, look for identity to cDNA sequences from either: cDNA library sequence (e.g. high throughput) --\> identify exons --\> compare to DNA EST library sequence - ESTs are paired-end reads of cDNA clones and so define the 5’ and 3’ end of mRNAs / transcripts Problem - Novel gene - No information about variable transcripts --\> RNAseq data.
39
What do we use if we can find a gene in a database? (Eukaryotes)
Ab initio Gene Prediction --\> Looking for patterns Many species have unique genes with limited or no representation in the sequence databases + Non-coding genes will not be in the protein databases. - Ab initio gene prediction uses mathematical models and computer programming to predict gene locations based on sequence alone Some prediction programs now have near 100% sensitivity --\> However, as the sensitivity increases the overall accuracy falls due to increased false positives (predictions of genes that aren't correct) But we can utilise a combination of ab initio and homology predictions --\> in order to overcome this --\> extra confidence.
40
How does the ab initio approach work for eukaryotes?
Pattern recognition Genes as a Series of Binding Sites --\> gene expression involves a series of protein / nucleic acid interactions --\> interactions can be seen in the DNA sequence as sequences related to a consensus binding site sequence (comparing to an average). The problem --\> binding sequence can vary --\> not always the same. Computers can search for these binding site sequences in “raw” genome sequence and guess the functional elements of the genome. Look for the correct spatial organisation of these elements and it’s likely to be a gene
41
How do programs overcome the increase in complexity of eukaryotic DNA when it comes to gene prediction?
Patterns differ between species but there is a degree of conservation within a species. Programs such as Genscan are trained using sets of known genes --\> same species or the most closely related available. The training set acts as a model for intron/exon lengths, splice sites, GC content etc. The program uses these models to predict genes Basically, examining other known genes of the same species or related to creating a general model (training itself) which will help us identify genes.
42
What are the things the programs such as Genscan look for?
The program teaches itself to look for these relationships/patterns. 1. **GC content**: - Human approx 38% GC but as with all species varies widely within the genome - Regions of high GC content (62-68%) have higher relative gene density than regions of lower GC content --\> region where genes are located - Exon length is relatively uniform with respect to GC content --\> relationship --\> informative. - Intron length decreases dramatically in regions of high GC content 2. **Patterns:** - PolyA tail region has a specific sequence pattern with consensus AATAAA - Translation start site has methionine and 12 nucleotide pattern - Translation stop has 1 of 3 stop codons according to the observed frequency and then 3 nucleotide pattern - Conserved patterns at the donor and acceptor sites (splice sites) + we also get a dependency between adjacent nucleotides in these patterns --\> probability of an base at one position influences the probability of a base in the adjacent position.
43
How do we model the different functional sites (promoter, exon, etc) of a gene?
Need a way to characterise the functional sites within the gene: - Promoter - Exon/intron splice sites We use... Position weight matrix (PWM) can be used – a score is given to each possible nucleotide at each possible position --\> done by comparing our sequence to related sequences (multiple sequence alignment --\> more sequences to compare to more accurate) Then, for any sequence, the scores are summed to give a score for that sequence as a potential site -- allow us to determine significance. For all functional sites of a gene --\> we perform the PWM --\> there is always a pattern around them (conserved + **surrounding pattern**)
44
Why are the donor and acceptor splice sites surrounding patterns difficult to predict?
- The **acceptor splice site** (AG) has consensus region from -20 to +3 (quite large) + some dependency between adjacent positions (Conditional probability) - **Donor splice site** (GT) has a similar pattern but nucleotide dependencies are more complex with dependencies between non-adjacent nucleotides
45
Why does gene prediction software need all this information?
The more information we provide --\> the better we can train the program/ more accurate model --\> results in better gene prediction. In the end... We use a combination of both ab Initio and homology search --\> eventually combine all the information. Once the gene has been identified it can be added to an Ensembl --\> aims to be a database for all sequenced genomes. Once, we have all the genes we can perform comparative studies.
46
What is the Ensembl Annotation Procedure?
1. Place known sample organism (e.g. human) genes onto the genome 2. Place highly similar genes e.g. mus on genome - Perform homology search --\> BLAST searches on the protein level --\> DNA is too noisy 3. Predict novel genes from ab initio methods backed up with supporting evidence from sequence similarity – only use ones confirmed by similarity to protein, cDNA or ESTs Note --\> for important species --\> after all the computerized predictions are made --\> everything gets manually checked.
47
What about non-coding DNA? Is it annotated?
In eukaryotes up to 99% of the genome is non-coding --\> 80% of the DNA has some purpose: - Repeats - Promoters - Cis/Trans Regulatory elements (Cis control nearby gene, Trans distant) - Enhancers Etc.... Hence, the next challenge is to annotate these.
48
Definition of a homolog, ortholog and paralog?
**Homolog** --\> A gene related to a second gene by descent from a common ancestral DNA sequence. Basically --\> two genes are related and have a common ancestor Within a homology, you have two types which are... 1. **Ortholog** --\> Orthologs are genes in **different species** that evolved from a common ancestral gene by speciation Orthologs usually retain the **same function** Orthologs essential for reliable prediction of gene function in newly sequenced genome Example - Insulin gene in many species 2. **Paralog** --\> are genes related by duplication within the same genome (same species) Unlike orthologs paralogs evolve **new functions**, even if these are related to the original one Example --\> Human serine proteases --\> chymotrypsin, trypsin, elastase.
49
What is unequal recombination?
Unequal recombination is one-way gene duplication can occur. Recombination duplications are due to unequal crossing-over that occurs during meiosis between misaligned homologous chromosomes --\> happens because of repeat sequences --\> results in misassembly --\> two different repeats align up. Results in duplication at the site of the exchange and a reciprocal deletion Duplication can be beneficial or harmful Beneficial --\> Spare copy of the gene --\> mutate Harmful --\> over-expression --\> waste of resources.
50
What is replication slippage?
Replication slippage is another way gene duplication can occur --\> once again relies on a repeat (only a few bases of similarity). A DNA replication error that can duplicate short genetic sequences --\> during replication process the polymerase dissociates from the DNA When it re-attaches to the DNA strand it attaches to a different repeat that is close by --\> results on DNA duplication.
51
Can transposable elements cause gene duplication?
Yes! Transposable elements are repeats in the genome that jump around --\> two main types LINES and SINES --\> LINES can self-duplicate (copies itself (mRNA -\> DNA with reverse transcriptase) and inserts itself back into the genome --\> original copy left behind SINES are similar but are not autonomous --\> Use the LINE machinery. Remember - LINES and SINES have repeats on either end Transposable elements have been very significant in shaping the human genome How gene duplication occurs --\> Transpoable elemtns provide a hot spot for unequal recombination events + DNA transposons provide a mechanism for gene and exon duplication (when they move they take a gene with them). Note --\> Alu sequence, a SINE present in over 106 copies --\> most abundant.
52
What are the two classes of transposable mobility?
Class 1 --\> LINES and SINES --\> create RNA --\> reverse transcriptase --\> insert themselves somewhere else via homologous recombination. Class 2 --\> Don't replicate --\> just move around the genome --\>
53
How does exon shuffling via transposons occur?
Exon Shuffling via Transposons --\> Requires the same transposon either side of an exon. If the transposase cleaves DNA at the left inverted repeat of the upstream transposon and right inverted repeat of the downstream transposon then both transposons and the exon will move together Basically cleaves at one transposable element at one side and at the other transposable element on the other side of the exon. This exon can now be inserted into another gene or not (non-coding) --\> occasionally the insertion will be beneficial. This mechanism is responsible for moving around protein domains from one protein to another --\> this is why we find proteins with very similar domains.
54
Outline the evolution of the globin gene.
1. Original ancestral gene 2. Gene duplication 3. Then transposition --\> movement of gene from one chromosome to another Shows that exons, as well as entire genes, can be moved. 4. Further duplication and mutations.
55
What is the significance of repeat DNA in the human genome?
LINE, SINE and LTR elements comprise 37% of the rodent and 42% of the human genome --\> make up a large part of our genome. Exons of genes comprise only approximately 2% of the sequence. Significance? Evidence that the genome environment, including repeats, can be important for the regulation of gene expression
56
Outline the role of LINE repeats in X-inactivation?
X-inactivation is the silencing of one of the X chromosomes in all female mammals --\> Required for dosage compensation to avoid overexpression of X chromosome genes. Inactivated X chromosome is packaged as compacted heterochromatin --\> Carried out by Xist (gene), a long non-coding RNA (17Kb), works its way around the X-chromosome --\> methylating --\> switching it off. LINE repeats are believed to facilitate the ability of Xist to traverse the chromosome Evidence? --\> X-chromosome has a much higher density of LINE repeats.
57
What is the origin of Pseudogenes?
Pseudogenes - ORFs that appear to be nonfunctional due to the accumulation of mutations --\> 'Zombie genes'. How many? - Estimates of the order of 12,000 Arise in two ways: - Gene duplication and mutation --\> gene was duplicated but turned out non-functional. - Reverse transcription, integration and mutation --\> sometimes the reverse transcriptase for LINES transcribes another mRNA and inserts it into the genome --\> re-inserting a spliced mRNA --\> results in only exons.
58
Are we able to easily categorize many genes?
Yes, human genes tend to be conservative --\> due to gene duplication. This allows us to group genes and deduces a common ancestor. However, the single largest class of protein are the unknown ones --\> we do not know the functions of a lot of genes
59
How can two species that are completely different have similar domain structure?
Exon shuffling Domains are commonly coded for by a single exon --\> homologous domains can be found between completely different species due to exon shuffling that occurred in a common ancestor.
60
What is comparative genomics?
We have a genome of a species --\> how does this compare to other genomes --\> what can this tell us? Example... Mice and humans were believed to have diverged 85 million yrs ago However... Comparison between the mouse and human sequences shows that the sequences are homologous, not in small chunks but in large blocks Large blocks called syntenic blocks. You can match these large blocks from one chromosome of one species to another chromosome of another species.
61
Comparative genomics study between Human and a mouse? What can we take away from this?
Human and Mouse Synteny About 180 homologous blocks --\> Ranging From 24kbp to 90.5Mbp --\> average 17.5Mbp Suggests a chromosomal basis to genome evolution... Chromosomes constantly being cut and rejoined, sometimes incorrectly. How? Incorrect end joining will be promoted by the presence of repetitive sequences providing multiple alternative templates for homologous end joining
62
Is it useful to perform genome comparisons studies between individuals with the same genome?
YIEShhhh mucho useful Human genome comparative studies enable the identification of variability --\> E.g. allow for the identification of SNP's. For example, Genomics England set up by the Department of Health will sequence 100,000 genomes from NHS patients. Investigate individuals with genetic diseases + their families as well as cancer patients compare this to healthy individuals. Allow us to identify SNPs that have a correlation with the occurrence of the condition. Can be used as a treatment --\> screen individuals --\> look for SNPs --\> can give us a probability that this specific individual can get a particular condition.
63
What is copy number variation?
Copy Number Variations (CNVs) are large scale changes in the genome --\> comparison between individuals with the same genome. Average of 12 copy number variants per individual Deleted regions - fewer than the normal number Duplicated regions - more than the normal number For example: ERBB2 – High copy numbers of this gene associated with breast cancer. Hence, when genotyping individuals we don't only look for SNPs but also CNVs.
64
Difference between SNPs and SNVs?
If a single nucleotide polymorphism is present in 1% population --\> SNP Whereas... If the frequency is higher --\> we call it a single nucleotide variant.