MBB 267 Week 2: RMC 4 Flashcards

1
Q

How can scoring systems be used to align sequences?

A

Align the sequences together and compare the amino acids. We use a PAM70 scoring system, which calculates the differences between proteins. The differences are identified as numbers scaling from -10 to upto 5. Similar amino acids represent a match, which are given as a positive score. Mismatches, gap opening and gap extensions are given as a negative score. Biochemically similar substitutions are penalised less.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are local and global alignments?

A

Types of alignments:
Global: it assumes that the two sequences that are aligned are equivalent across their whole length.
Local: searches subsequences of the full-length sequence to maximise the alignment score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is BLAST?

A

BLAST stands for “Basic Local Alignment Search Tool”. BLAST is effectively a “Google for sequences”, and performs a local alignment search to allow us to rapidly identify potential homologues from an online database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the key features of a BLAST report?

A

Successful searches are called hits” or “subject sequences”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the Overlap-Layout-Consensus (OLC) approach to genome assembly?

A

The “Overlap layout consensus” method looks for overlaps between adjacent reads Overlapping reads are combined, and the consensus sequence determined This would work well if genomes were non-repetitive and sequencing was error-free

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why can the Overlap-Layout-Consensus approach lead to errors?

A

Because reads overlapping repeats can be mistakenly joined resulting in incorrect final assembly as loss of the in between may occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is a de Brujin graph a better alternative to OLC?

A

Because the de Bruijn graph is a repeat-aware approach to assembly which minimises errors in assembled genome:
Because kmers containing errors will be present much less frequently in the dataset than real kmers which are part of the genome.
Because the genome will be sequenced a lot of times. Let say it is sequenced 30 times, so each base is sequenced 30 times. If there is an error, it will only be present 1 time, whereas the non-error bases will be present 29 times.
Read pairs can provide info which spans repeat sequences, helping to resolve the order of the contigs and close the assembly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does the de Bujin graph assembles?

A

How:
The sequence is broken down into kmers with a fixed length
We take one kmer and look for the adjacent kmer (adjacent is defined, in this case, as overlap all but one base). Each kmer is only included once. This process helps us to identify repeats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do the circles tell in a genomic de Bruijn graph?

A

They mean a repeat in the genome sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is it helpful to resequence a genome?

A

The reasons:
Resequencing allows us to understand genetic variation within a population, as individuals of a species are not all identical.
For human populations, resequencing the genome helps us study single gene and complex genetic disorders.
Sequencing allows us to understand the genetic changes in cancer as it progresses.
Functional genomic technologies are also involved in resequencing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is mapping?

A

This is the process of taking many sequence reads obtained from biological samples, and determining where in the reference genome they are likely to be derived from.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the process of mapping?

A

The process:
Identification of matches to the kmer or our read, using an index of all of the kmers in the genome.
We want to identify seed alignments which are matches from the start of the reads that are positioned along the genome.
The seed alignments are extended using the rest of the read.
Alignments in this way do not require exact matches, but it is computationally slow.
By using an index, we have reduced the number of alignments we need to perform to a minimum, which greatly reduces the memory required to store sequences in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What happens if our read maps with the same alignment score in multiple places, such as within a repeat sequence?

A

Read pairs can be used (like in novo assembly):
The two read pairs are mapped independently, but if one maps to a unique position and the other maps ambiguously, the uniquely mapped read can be used to determine which of the possibilities for its pair is the most likely.
If both reads are within a repeat, there is nothing we can do. Usually, one of the possible mapping locations will be selected at random.
For some applications, ambiguously-mapped reads would be excluded from the analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can we identify SNPs from resequencing data?

A

typically sequence the genome lots of times (e.g. 50x coverage) and look for positions where all of the reads show the same difference from the reference. This makes the assumption (broadly but not always true) that sequencing errors occur at random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can we tell the difference between sequencing errors and SNPs in IGV?

A

The differences:
Sequencing errors: one error on its own (only one colour line on that region)
SNPs: change highlighted in all of the reads (a real difference between genome that we sequence and the reference genome)
A deeper coverage (more sequence reads) will allow you to properly tell you the difference between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the types of SNP, and their effects?

A

(the first one is the least pathogenic and the last ones are the most pathogenic)
Intergenic: not within a gene.
intronic: within a gene but in an intron
synonymous: within protein coding sequence (exon). the change in the nucleotide sequence will not change the encoded protein sequence.
regulatory: in a promoter or upstream on the gene. It could have an effect on expression and regulation on genes.
non-synonymous: alters encoded protein sequence.
nonsense: introduces premature stop codon.

17
Q

What are heterozygous SNPs?

A

In diploid organisms we get our DNA from our father and mother, which allows some genes to have different alleles.

18
Q

Why are some SNPs are more likely to be pathogenic?

A

This is as different SNPs can have different effects:
intergenic;away from the gene.
intronic; in the gen but in an intron
synonymous;in a gene exon but codes for the same/similar codon
regulatory;in the promoter region of the gene
non-synonymous; in a gene exon and changes the codon
non-sense; changes codon to a stop codon

19
Q

What is the concept of a Genome-Wide Association Study?

A

GWAS allows the identification of genomic regions associated with a particular phenotype