Duncan - variant nomenclature and analysis Flashcards
how much variation do we see in the average human genome?
compared to a reference human genome, a person’s ~6 billion-nucleotide genome sequence will have:
5,000,000 Single Nucleotide Variants (SNPs) that involve ~5,000,000 nucleotides
600,000 insertion/deletion variants (2+ nucleotides) that involve ~2,000,000 nucleotides
25,000 structural variants (such as CNVs) that involve >20,000,000 nucleotides
what is the basic structure of a gene?
Start codon - ATG for amino acid methionine, initiates the reading frame/transcription
Exons - codes for the protein, contributes to the final mRNA molecule that determines order of amino acids
Introns - non-coding, don’t contribute to final mRNA molecule, removed by splicing
Stop codon - several options, e,g, UGA, UAG, UAA
one possible type of variant is a nonsense variant -
what is this?
how does the cell deal with it/the resultant mRNA?
is it likely to be disease causing?
Alter the amino acid code, resulting in a stop codon, so the protein ends prematurely.
These mRNAs are then targeted by NMD - nonsense mediated decay - a system that prevents production of faulty proteins (quality control, protects cell from ‘aberrantly’ functioning proteins). Likelihood of NMD should be considered when assessing variants that result in shortened proteins
mRNA that escapes NMD may produce proteins that retain some functionality – potentially not causing disease. however if mRNA doesn’t escape NMD, you’re essentially losing that protein/its not being expressed, likely disease causing
how can you, generally, identify whether or not a nonsense mutation will result in mRNA that manages to escape NMD?
General rules to identify those aberrant mRNAs that may escape NMD:
if the DNA variant is present in last exon
if the DNA variant is located in last 50 nucleotides of the penultimate exon
(then the mRNA may escape NMD)
name 6 types of variants, and the
consequences that can have
Stop and start variants -
Occur in the stop or start codons
In start codon: transcription not initiated, no protein product, probably disease causing
In stop codon: transcription continues into the non-coding DNA 3’ of the gene, resulting in a protein with additional amino acids that are likely to interfere with structure + function and cause disease
Missense variants -
Most common, it’s just a substitution, an amino acid is swapped AND changes the amino acid coded for
May or may not be pathogenic
Nonsense variants -
Alter the amino acid code, resulting in a stop codon, so the protein ends prematurely.
These mRNAs are then targeted by NMD - nonsense mediated decay (more later)
Deletion variants -
Result in a frameshift and therefore almost always disease-causing
Can be just 1 nucleotide or an entire gene (entire gene = CNV)
Codes for different amino acids than WT, you’ve got a novel protein with new or lost functions…
But a reading frame shift somehow gives a new stop codon within the first 200 codons, so you get a truncated protein that may be targeted by NMD
Duplications -
Addition of nucleotides = frame shift = altered amino acid sequence = likely to be disease causing
Also often gives a premature stop codon and truncated protein. Same as deletions in terms of consequences
RNA splicing - what are donor and acceptor sites?
what are donor and acceptor variants like/are they likely to cause disease?
Donor = the exonic and intronic sequences flanking the 5’ end of an intron, typically GT
Acceptor = the exonic and intronic sequences flanking the 3’ end of an intron, typically AG
Donor splice site variants -
Change in donor splice site = not recognised by splice machinery so removal of the intronic DNA not initiated, it gets included in the mRNA, altering protein function and structure
Also causes a frameshift so you get an early stop codon and a truncated protein
Often disease causing
Acceptor splice site variants -
Results in exclusion of exon. Donor site is recognised and removal of intron initiated but acceptor site is never reached, so the exon is removed too
Very likely to be disease causing as the exon may encode vital parts of the protein (active sites/binding sites etc…)
how does the spliceosome work?
objective = removal of introns from pre mRNA
small nuclear RNA (snRNA) molecules bind to specific proteins, forming a sn-ribonucleoprotein complex (snRNP)
this combines with other snRNPs forming the spliceosome. snRNPs recognise and bind to the acceptor and donor sites, the intron is looped out and excised
are donor/acceptor splice site variants likely to be pathogenic?
Donor/acceptor sites = 15% of recorded pathogenic variants, as can lead to aberrant splicing, excluding exons or including intronic sequences
when naming a variant, what are the three components involved (and where does the second one come from)?
- gene name
- reference sequence (represents normal WT)
- Variant description
a human genome reference sequence is used as a WT reference, while the HG was 1sr sequenced in 2003, it had gaps and erros, is constantly updated, latest version is GRCh38 ‘genome reference consortium human build 38’
there are multiple versions of the human genome sequence.
These are combined to form consensus sequences and are updated as more data is gathered, so different reference will differ slightly, which is why you must include it
why must you know which gene reference sequence you are using?
As sequence data and knowledge increases, the gene DNA sequences are regularly updated.
These can include new exons, longer introns, additional nucleotides etc
Therefore for accurate variant you must know which gene reference sequence you are using
We normally use references supported by sequencing of the corresponding mRNA transcripts as this provides good knowledge of intron exon boundaries
in terms of the reference used in variant naming, what are the three options?
NM_xxx = based on mRNA transcripts, includes introns
NG_xxx = genomic sequence of a gene
NP_xxx = protein sequence based on NM_xxx sequence
how is a gene’s reference sequence annotated (as in each base is given an annotation to tell you where it is in the sequence, how does this system work)?
the c. sequence
each nucleotide has a c number, with C.1 being the A of the ATG start codon
nucleotides in the exons are then just numbered in order (C.1, C.2, C.3 etc…)
nucleotides in introns are numbered based on how far they are from the nearest coding nucleotide, so if the first exon in a gene ends at C.5, the first nucleotide of the intron is c.5 +1
the last intronic nucleotide would be c.6 -1
youd include the base too, so c.5A or c.5+3T etc…
how are amino acids labelled?
Named by position, start codon being 1, then followed by single letter or three letter code e.g. p.A23 or p.Ala23 = an alanine at position 23
what would c.11G>A mean?
say this change makes the OG glycine become an aspartic acid, how would you write this at the protein level?
Nucleotide 11 (coding nucleotide) was a G and has been changed to an A (>means change/substitution)
at protein level:
11th nucleotide is in the fourth codon, so fourth amino acid = p.4
p.Gly4Asp same as above you just don’t use the arrow to indicate change (>)
in ‘p.(Gly4Asp) what do the brackets indicate?
Brackets indicate this proteins change is a prediction based on the sequence and has not been experimentally confirmed
for a gene we have two alleles.
how would I indicate:
1. two WT alleles in terms of DNA sequence?
- two WT alleles in terms of protein/Aa sequence?
- one WT one variant?
- c[=];[=]
the ‘=’ is for WT, the ‘[ ]’ show its the allele youre talking about, the ‘;’ separates them - p.[(=)];[(=)]
the extra bracket for the protein = only DNA sequencing has been done - If there is a variant, just shove the code in the brackets in place of the =. If its on both allele, replace both ‘=’
if you’ve got two variants in the same gene, how do you indicate which allele each variant is on?
Can’t tell from your initial DNA sequencing, may need to look at the parents. Three situations, the two variants are on the same allele, different, or you dont know.
Same: c.[20G>C;59T>A];[=] shove it in the same square brackets cus its the same allele, just separate the variants by a ‘;’ THIS IS KNOWN AS CIS
Different: c.[20G>C];[59T>A] easy, put each variant in different square brackets for different alleles
Unknown -
c.[20G>C(;)59T>A] put it all in the same square bracket, separating the two variants by (;) remeber cus () these brackets kind of indicate ‘unconfirmed’
same rules apply to protein, dont forget if ony DNA sequencing has been done to use [( )]
how do you describe a deletion? e.g. of the 13th and 14th coding nucleotides
This is described as c.13_14.
That it is a deletion of these nucleotides is described by the abbreviation del.
So it is described as c.13_14del
Alleles are again separated using square brackets and an equals symbol for wild type.
If only one nucleotide is deleted then of course only that nucleotide is list,
For example, c.14del
c.[13_14del];[=]
if there has been a deletion, how do I write this at the protein level?
You need to indicate there has been a frameshift, so you simply put the amino acid from which the sequence changes (you don’t need to put what it changes to) followed by ‘fs’ for frameshift
E.g if the deletion left the first four codons normal, and the first change in amino acid was in amino acid 5, originally a valine, you’d put:
p.[(Val5fs)];[(=)]
for a nonsense variant, i.e. you’ve got a premature stop codon, how do you write this at DNA level vs protein level?
indicated in the DNA sequence as you would a regular substitution/single nucleotide change, c.[21T>A];[=]
At the protein level, you indicate which amino acid has changed to give a premature stop codon with a *
p.[(Gly7*)];[(=)]
Glycine at position 7 was replaced with a stop codon
what does ‘p.[(Arg12*)];[(=)] tell you?
arginine at position 12 has been replaced by a stop codon, the other allele is presumed WT
are changes within splice sites and intronic variants treated the same?
what about if youre writing at the protein level
yes they are
just use the correct term to refer to the nucleotide position (as in with the + or -)
E.G 1
G to C change at 45 nucleotides away in 3’ direction from the last coding nucleotide c.548
c.[548+45G>C];[=]
at the protein level -
protein changes cannot be certain and so the nomenclature uses a ?
p.[(?)];[(=)]
define pathogenic, benign and VUS
Pathogenic - it has a negative effect on the production of the protein and so causes disease
Benign - the protein produced pretty much functions normally, does not cause a problem. Same as a polymorphism. These can be very common, seen in 50% of the population or as low a 0.0001% of the population, some benign polymorphisms may be confined to a single family
VUS - variant of uncertain significance, basically not sure if its a problem
when classifying a variant, the Association for Clinical Genomic Science (ACGS) guidelines are followed.
briefly describe how this system works?
Evidence on a variant is gathered and compared to the guidelines, using a point based system. Based on how many benign or pathogenic criteria are met by the variant, it is classified from class 1 to class 5
Each criteria is given a code, beginning with P or B for whether meeting that criteria brings a variant closer to being classified as pathogenic or benign