Genomics - Lecture Flashcards

Question

What is the objective of genome annotation?

Answer 1

The objective of genome annotation: 1. Identify all of the possible genes and features within the genome + assign functions to as many genes as possible. This should include... - Variable splice sites that may mean the gene has multiple products - Identify and describe regulatory features and their functions As well as... Identify all other features, including repeats The better picture we have --\> the better we can perform comparative studies between organisms of interest.

Answer 2

Gene Hunting By Sequence Homology Many useful tools for gene identification are based on sequence identity --\> the assumption is if 2 genes are (very) similar in the sequence they will encode proteins with similar structure/function. Hence... Compare unknown sequence to sequences of known (or guessed) function by sequence alignment methods Performing BLAST searches against: - EST (express sequence tag) - short sub-sequence of a cDNA sequence which corresponds to mRNA. - SwissProt database Note --\> Even if our sequences matches a protein of unknown function --\> the existence of similarity itself is strong evidence that sequence is protein-encoding. **Major problem** – finding genes **without** known homologue so additional methods required

Answer 3

Only a couple of the bases in the first exon are actually coding (rest could be UTR - untranslated region), which quite often is reported as insignificant in the BLAST search --\> making it hard to identify.

Answer 4

If we can't identify the coding sequence via a BLAST search --\> we need to identify other relevant sequences --\> ribosome binding site, promoter region, -10 and -35 sequences etc. --\> in order to help us identify the gene. We looking for the usual patterns that are found in these specific sequences (consensus - average)

Answer 5

1. Start forward homology 2. Examining patterns of sequences associated with the gene.

Answer 6

Open Reading Frame (ORF) - The coding sequence of an mRNA is a set of in-frame triplet codons starting with AUG and ending with UAA, UAG or UGA --\> starts with a start codon and ends with a stop codon. Note - We need to consider all 6 possible reading frames: 3 forward and 3 reverse On average, stop codons occur 3/64 times in any three base sequence ~1 in 21 times --\> hence, only “real” ORFs tend to be long (don't have a stop codon in the middle) ORFs in bacteria are easy to predict as there are no introns. Whereas... Eukaryotic ORFs much harder due to intron/exon structure and some peptides are small

Answer 7

Gene prediction is bacteria is easier than eukaryotes: - No introns - Smaller intergenic regions (Microbial genome tends to be gene rich (80%-90% of the sequence is coding) whereas in eukaryotes genes tend to be further apart.

Answer 8

Bacteria genefinders often just look for the largest open reading frames (ORFs), those above a certain size (approx. 300bp) and consider them to be real genes. This is a reasonable assumption genomes with a low GC contents. However... It is a problem for those with high GC contents With a lack of A and T in these genomes, there are far fewer stop codons. Hence... Long ORFs occur simply by chance in high GC genomes, many of which are not genuine genes But... L0w GC content --\> high accuracy.

Answer 9

We can use the longest ORF but eventually, it becomes challenging when the GC content increases --\> need to use additional information (different parts of the gene --\> promoters, etc.)

Answer 10

In every organism, we have 61 tRNAs --\> results in redundancy as we have 20 amino acids --\> Arg has 6 codons --\> inefficient to use all 6. Wobble base pairing allows one tRNA to bind many codons --\> This means that tRNAs prefer some codons to others So organisms have a bias towards a specific codon --\> higher frequency (i.e. Drosophila and E. Coli) Note - Bias can also be used to moderate gene expression --\> reduce gene expression --\> increase the number different codons for a specific A.A --\> slows down the process --\> as new tRNA would bind every time. Hence, examining the gene and codon bias --\> gauge how likely it is a gene.

Answer 11

1. Homology (BLAST) search against a good quality database (UniProt-SwissProt, RefSeqP) 2. Identify ORFs – those above 300bp 3. Use codon bias information for species in question Programs use Markov models to detect compositional biases and increase the reliability of gene detection 4. Identify other gene related structural features: - Promoter - -35 and -10 (Pribnow box) regions - Shine Delgarno sequence Note - 2-4 are ab initio gene prediction methods

Answer 12

- Looking at the raw DNA sequence and looking for patterns --\> not using homology.

Answer 13

Far greater problem: 1. Multi-exon genes – introns 2. Multiple transcripts from a single gene depending on splicing. 3. Large intergenic regions --\> stretch of DNA between genes. However, there are a lot of different patterns we can use.

Answer 14

As with prokaryotes search for homology – BLAST Search against good quality protein database – UniProt-SwissProt, RefseqP Furthermore, look for identity to cDNA sequences from either: cDNA library sequence (e.g. high throughput) --\> identify exons --\> compare to DNA EST library sequence - ESTs are paired-end reads of cDNA clones and so define the 5’ and 3’ end of mRNAs / transcripts Problem - Novel gene - No information about variable transcripts --\> RNAseq data.

Answer 15

Ab initio Gene Prediction --\> Looking for patterns Many species have unique genes with limited or no representation in the sequence databases + Non-coding genes will not be in the protein databases. - Ab initio gene prediction uses mathematical models and computer programming to predict gene locations based on sequence alone Some prediction programs now have near 100% sensitivity --\> However, as the sensitivity increases the overall accuracy falls due to increased false positives (predictions of genes that aren't correct) But we can utilise a combination of ab initio and homology predictions --\> in order to overcome this --\> extra confidence.

Answer 16

Pattern recognition Genes as a Series of Binding Sites --\> gene expression involves a series of protein / nucleic acid interactions --\> interactions can be seen in the DNA sequence as sequences related to a consensus binding site sequence (comparing to an average). The problem --\> binding sequence can vary --\> not always the same. Computers can search for these binding site sequences in “raw” genome sequence and guess the functional elements of the genome. Look for the correct spatial organisation of these elements and it’s likely to be a gene

Answer 17

Patterns differ between species but there is a degree of conservation within a species. Programs such as Genscan are trained using sets of known genes --\> same species or the most closely related available. The training set acts as a model for intron/exon lengths, splice sites, GC content etc. The program uses these models to predict genes Basically, examining other known genes of the same species or related to creating a general model (training itself) which will help us identify genes.

Answer 18

The program teaches itself to look for these relationships/patterns. 1. **GC content**: - Human approx 38% GC but as with all species varies widely within the genome - Regions of high GC content (62-68%) have higher relative gene density than regions of lower GC content --\> region where genes are located - Exon length is relatively uniform with respect to GC content --\> relationship --\> informative. - Intron length decreases dramatically in regions of high GC content 2. **Patterns:** - PolyA tail region has a specific sequence pattern with consensus AATAAA - Translation start site has methionine and 12 nucleotide pattern - Translation stop has 1 of 3 stop codons according to the observed frequency and then 3 nucleotide pattern - Conserved patterns at the donor and acceptor sites (splice sites) + we also get a dependency between adjacent nucleotides in these patterns --\> probability of an base at one position influences the probability of a base in the adjacent position.

Answer 19

Need a way to characterise the functional sites within the gene: - Promoter - Exon/intron splice sites We use... Position weight matrix (PWM) can be used – a score is given to each possible nucleotide at each possible position --\> done by comparing our sequence to related sequences (multiple sequence alignment --\> more sequences to compare to more accurate) Then, for any sequence, the scores are summed to give a score for that sequence as a potential site -- allow us to determine significance. For all functional sites of a gene --\> we perform the PWM --\> there is always a pattern around them (conserved + **surrounding pattern**)

Answer 20

- The **acceptor splice site** (AG) has consensus region from -20 to +3 (quite large) + some dependency between adjacent positions (Conditional probability) - **Donor splice site** (GT) has a similar pattern but nucleotide dependencies are more complex with dependencies between non-adjacent nucleotides

Answer 21

The more information we provide --\> the better we can train the program/ more accurate model --\> results in better gene prediction. In the end... We use a combination of both ab Initio and homology search --\> eventually combine all the information. Once the gene has been identified it can be added to an Ensembl --\> aims to be a database for all sequenced genomes. Once, we have all the genes we can perform comparative studies.

Answer 22

1. Place known sample organism (e.g. human) genes onto the genome 2. Place highly similar genes e.g. mus on genome - Perform homology search --\> BLAST searches on the protein level --\> DNA is too noisy 3. Predict novel genes from ab initio methods backed up with supporting evidence from sequence similarity – only use ones confirmed by similarity to protein, cDNA or ESTs Note --\> for important species --\> after all the computerized predictions are made --\> everything gets manually checked.

Answer 23

In eukaryotes up to 99% of the genome is non-coding --\> 80% of the DNA has some purpose: - Repeats - Promoters - Cis/Trans Regulatory elements (Cis control nearby gene, Trans distant) - Enhancers Etc.... Hence, the next challenge is to annotate these.

Answer 24

**Homolog** --\> A gene related to a second gene by descent from a common ancestral DNA sequence. Basically --\> two genes are related and have a common ancestor Within a homology, you have two types which are... 1. **Ortholog** --\> Orthologs are genes in **different species** that evolved from a common ancestral gene by speciation Orthologs usually retain the **same function** Orthologs essential for reliable prediction of gene function in newly sequenced genome Example - Insulin gene in many species 2. **Paralog** --\> are genes related by duplication within the same genome (same species) Unlike orthologs paralogs evolve **new functions**, even if these are related to the original one Example --\> Human serine proteases --\> chymotrypsin, trypsin, elastase.

Answer 25

Unequal recombination is one-way gene duplication can occur. Recombination duplications are due to unequal crossing-over that occurs during meiosis between misaligned homologous chromosomes --\> happens because of repeat sequences --\> results in misassembly --\> two different repeats align up. Results in duplication at the site of the exchange and a reciprocal deletion Duplication can be beneficial or harmful Beneficial --\> Spare copy of the gene --\> mutate Harmful --\> over-expression --\> waste of resources.

Answer 26

Replication slippage is another way gene duplication can occur --\> once again relies on a repeat (only a few bases of similarity). A DNA replication error that can duplicate short genetic sequences --\> during replication process the polymerase dissociates from the DNA When it re-attaches to the DNA strand it attaches to a different repeat that is close by --\> results on DNA duplication.

Answer 27

Yes! Transposable elements are repeats in the genome that jump around --\> two main types LINES and SINES --\> LINES can self-duplicate (copies itself (mRNA -\> DNA with reverse transcriptase) and inserts itself back into the genome --\> original copy left behind SINES are similar but are not autonomous --\> Use the LINE machinery. Remember - LINES and SINES have repeats on either end Transposable elements have been very significant in shaping the human genome How gene duplication occurs --\> Transpoable elemtns provide a hot spot for unequal recombination events + DNA transposons provide a mechanism for gene and exon duplication (when they move they take a gene with them). Note --\> Alu sequence, a SINE present in over 10⁶ copies --\> most abundant.

Answer 28

Class 1 --\> LINES and SINES --\> create RNA --\> reverse transcriptase --\> insert themselves somewhere else via homologous recombination. Class 2 --\> Don't replicate --\> just move around the genome --\>

Answer 29

Exon Shuffling via Transposons --\> Requires the same transposon either side of an exon. If the transposase cleaves DNA at the left inverted repeat of the upstream transposon and right inverted repeat of the downstream transposon then both transposons and the exon will move together Basically cleaves at one transposable element at one side and at the other transposable element on the other side of the exon. This exon can now be inserted into another gene or not (non-coding) --\> occasionally the insertion will be beneficial. This mechanism is responsible for moving around protein domains from one protein to another --\> this is why we find proteins with very similar domains.

Answer 30

1. Original ancestral gene 2. Gene duplication 3. Then transposition --\> movement of gene from one chromosome to another Shows that exons, as well as entire genes, can be moved. 4. Further duplication and mutations.

Answer 31

LINE, SINE and LTR elements comprise 37% of the rodent and 42% of the human genome --\> make up a large part of our genome. Exons of genes comprise only approximately 2% of the sequence. Significance? Evidence that the genome environment, including repeats, can be important for the regulation of gene expression

Answer 32

X-inactivation is the silencing of one of the X chromosomes in all female mammals --\> Required for dosage compensation to avoid overexpression of X chromosome genes. Inactivated X chromosome is packaged as compacted heterochromatin --\> Carried out by Xist (gene), a long non-coding RNA (17Kb), works its way around the X-chromosome --\> methylating --\> switching it off. LINE repeats are believed to facilitate the ability of Xist to traverse the chromosome Evidence? --\> X-chromosome has a much higher density of LINE repeats.

Answer 33

Pseudogenes - ORFs that appear to be nonfunctional due to the accumulation of mutations --\> 'Zombie genes'. How many? - Estimates of the order of 12,000 Arise in two ways: - Gene duplication and mutation --\> gene was duplicated but turned out non-functional. - Reverse transcription, integration and mutation --\> sometimes the reverse transcriptase for LINES transcribes another mRNA and inserts it into the genome --\> re-inserting a spliced mRNA --\> results in only exons.

Answer 34

Yes, human genes tend to be conservative --\> due to gene duplication. This allows us to group genes and deduces a common ancestor. However, the single largest class of protein are the unknown ones --\> we do not know the functions of a lot of genes

Answer 35

Exon shuffling Domains are commonly coded for by a single exon --\> homologous domains can be found between completely different species due to exon shuffling that occurred in a common ancestor.

Answer 36

We have a genome of a species --\> how does this compare to other genomes --\> what can this tell us? Example... Mice and humans were believed to have diverged 85 million yrs ago However... Comparison between the mouse and human sequences shows that the sequences are homologous, not in small chunks but in large blocks Large blocks called syntenic blocks. You can match these large blocks from one chromosome of one species to another chromosome of another species.

Answer 37

Human and Mouse Synteny About 180 homologous blocks --\> Ranging From 24kbp to 90.5Mbp --\> average 17.5Mbp Suggests a chromosomal basis to genome evolution... Chromosomes constantly being cut and rejoined, sometimes incorrectly. How? Incorrect end joining will be promoted by the presence of repetitive sequences providing multiple alternative templates for homologous end joining

Answer 38

YIEShhhh mucho useful Human genome comparative studies enable the identification of variability --\> E.g. allow for the identification of SNP's. For example, Genomics England set up by the Department of Health will sequence 100,000 genomes from NHS patients. Investigate individuals with genetic diseases + their families as well as cancer patients compare this to healthy individuals. Allow us to identify SNPs that have a correlation with the occurrence of the condition. Can be used as a treatment --\> screen individuals --\> look for SNPs --\> can give us a probability that this specific individual can get a particular condition.

Answer 39

Copy Number Variations (CNVs) are large scale changes in the genome --\> comparison between individuals with the same genome. Average of 12 copy number variants per individual Deleted regions - fewer than the normal number Duplicated regions - more than the normal number For example: ERBB2 – High copy numbers of this gene associated with breast cancer. Hence, when genotyping individuals we don't only look for SNPs but also CNVs.

Answer 40

If a single nucleotide polymorphism is present in 1% population --\> SNP Whereas... If the frequency is higher --\> we call it a single nucleotide variant.

Genomics - Lecture Flashcards

(64 cards)