10 Standardisation - Genomes, Genes and Nomenclature Flashcards
(80 cards)
What was the specific aim of the human genome project?
To sequence the euchromatic human genome: this is the lightly packed chromatin that is enriched in genes. It’s 92% of the genome.
How much of the genome is considered protein coding?
25%
How much of the genome is actually protein coding because its in exons?
1%
What surprised people about the content of the genome?
There’s more segmentally duplicated DNA than expected
What makes up 15% of the genome?
Short interspersed Nuclear elements, primarily Alu elements
Who contributed DNA to the human genome project?
13 or 30 people… From Buffalo NY. 66% is from one male donor
In 2001 how much of the genome had been sequenced? What was the error rate? And how many gaps were there?
90% coverage, high error rate of 1 in 1000, and there were 15,000 gaps in the euchromatic genome
How much of the genome had been sequenced in 2003 with referene Hg16 (NCBI34), what was the error rate, and how many gaps were there?
99% coverage, 1 in 10,000 error rate, with 400 gaps in the euchromatic genome
In 2009 Grch37 was published. How many gaps were there, and how many genes still had sequencing error?
300 gaps in euchromatic genome in build 37, and 550 genes with sequencing errors
In 2013 what was published?
GRCh38
How many gaps were in GRCh38?
as few as 89 gaps
When was the Telomere to Telomere (T2T) genome published?
2022
Why is it hard to transition from GRCh38 to T2T?
Because the MANE project is still of GRCh38. The MANE project is trying to bring about consensus on a defined set of transcripts representative of the expression of the whole genome.
What are patches?
Additional sequences with their own identifier, adding info to the reference genome without disrupting it
What are the two types of genome patch?
Fix patches and Novel patches
What is a fix patch?
They correct gaps or sequence errors
Why are fix patches needed?
Because changing the reference genome by incorporating the patch would change downstream position numbers, so it can’t be done directly. A fix patch adds info without altering the number of bases.
What are novel patches?
They are designed to provide an alternate structure for a chromosomal region, such as for CYP206 duplication
What happens to fix patches and novel patches when a new genome is released?
Fix patches are incorporated, but novel patch scaffolds remain as Alt loci, representation variation
What patches does GRCh38 have?
Some missing exons (e.g. SHANK3), some missing genes in patches, and indel errors in a few hundred genes like ABO, a few of which are clinically relevant
Where can the reference genome be downloaded from?
UCSC, EBI (Ensembl), or NCBI (RefSeq)
Why is NCBI RefSeq recommended for looking at the reference genome?
It has a more conventional numbering system that is used in analyses e.g. Chr1.
Files are an unambigious format.
IDs are uniquely labelled with an accession and version number.
What can you do if you have a reference position in UCSC or Ensembl format and you want it in RefSeq format?
RefSeq site has a table for rapid conversion of IDs.
What makes a Transcript Reference Sequence Record different from a Genomic Reference Sequence Record?
A Transcript Reference Sequence Record needs to contain functional information, e.g. exon boundaries, transcription start and stop sites etc.