final study Flashcards

(64 cards)

1
Q

What is the primary goal of genome assembly?

A

Reconstruct the original genome sequence

Genome assembly involves compiling millions of pieces of DNA sequences to form a complete genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two main types of genome assembly methods?

A

Reference-based and De Novo

Reference-based assembly maps reads to a known reference genome, while De Novo assembly constructs a genome without a reference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the first step in the assembly workflow?

A

From genomic libraries

This involves obtaining raw long and/or short reads in fastq format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the purpose of quality trimming in genome assembly?

A

To get rid of low-quality sequences and prevent errors

Quality trimming ensures that only accurate sequences contribute to the assembly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a contig?

A

Contiguous sequences

Contigs are assembled from reads and represent continuous stretches of DNA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How are scaffolds formed in genome assembly?

A

By bridging contigs together

Scaffolds can contain gaps of unknown size and represent larger fragments of DNA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is N50 in the context of evaluating assembly quality?

A

A metric indicating the length of the contig that has half of the assembly

A higher N50 value indicates better contiguity in the assembly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does BUSCO assess in genome assemblies?

A

Completeness by evaluating single-copy genes

BUSCO uses a curated set of genes expected to be present in a given clade to benchmark assembly quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a Hi-C library used for?

A

Capturing chromosome conformation for scaffolding the genome

Hi-C libraries help determine how different parts of chromosomes are physically close to one another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the significance of paired-end sequencing in Hi-C?

A

To map reads to the assembly and determine proximal locations of contigs

This step does not construct the genome but helps visualize the spatial organization of the genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Fill in the blank: The _______ is a high throughput method for obtaining conformational information of chromosomes.

A

Hi-C library

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What challenge does the assembly strategy using short reads face?

A

Complex regions longer than the size of reads

This limitation can lead to a low-quality assembly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a De Bruijn graph used for in genome assembly?

A

To build a graph from kmers for short reads

It helps identify overlaps and connections between short sequence fragments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

True or False: Hybrid assembly combines short and long reads to improve coverage.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the diagonal in a Hi-C interaction heatmap represent?

A

Sequences that are next to each other

The intensity of color in the heatmap indicates the frequency of interactions between regions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a chromosome-level assembly?

A

An assembly that represents a full set of chromosomes

This may include phased assembly, capturing genetic information from both parents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the format of the first line in a FastQ file?

A

header (@)

The header line typically starts with ‘@’ followed by a sequence identifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the second line in a FastQ file represent?

A

sequence

This line contains the nucleotide sequence of the DNA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is found on the third line of a FastQ file?

A

separator (+)

This line serves as a separator between the sequence and the quality score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What information is provided in the fourth line of a FastQ file?

A

Quality Score

This line contains a code that represents the quality score of the sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the desired quality score for most reads in a FastQ file?

A

30 or over

A quality score of 30 indicates a 90% chance that the base is correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does a quality score of 30 indicate about the base?

A

10% chance it’s the wrong base

This score reflects a high probability of accuracy in base calling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What character in a FastQ file indicates a quality score of 30?

A

?

The ‘?’ symbol corresponds to a quality score of 30 in the encoding system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the primary purpose of a FASTA file?

A

store DNA sequences

FASTA files are used to represent nucleotide or protein sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What does the first line in a FASTA file begin with?
header (>) ## Footnote The header line starts with '>' followed by a sequence identifier.
26
What is contained in the second line of a FASTA file?
the sequence ## Footnote This line contains the actual DNA sequence data.
27
What technology uses hairpin adapters to make DNA circular?
PacBio ## Footnote PacBio technology allows for the generation of many copies of the same DNA fragment.
28
How does PacBio improve the accuracy of DNA reading compared to Illumina?
It can trim copies out, add adapters, and read DNA more accurately ## Footnote This process enhances the quality of the sequencing results.
29
What is the main mechanism used by Oxford Nanopore for sequencing DNA?
DNA fragments are loaded onto a flow cell with ~2 thousand distributed nanopores ## Footnote Electrical current runs through the membrane of the flow cell to facilitate sequencing.
30
What challenge does Oxford Nanopore face with repetitive sequences?
It struggles to distinguish when nucleotides are repeated ## Footnote This makes it difficult to determine how many times a base appears in the sequence.
31
What role does the motor protein play in Oxford Nanopore sequencing?
It controls the amount of DNA that passes through the flow cell ## Footnote The motor protein is added during the library prep stage.
32
True or False: The motor protein is present on the nanopore from the beginning of the sequencing process.
False ## Footnote The motor protein is added during the adapter step of library preparation.
33
What does the nanopore capture during the sequencing process?
Changes in ion current ## Footnote This change in current is used to infer the sequence of the DNA.
34
What is a GFF file?
A plain text file that stores information in general feature format, specifically an annotation file.
35
What information does a GFF file provide regarding genes/elements?
It indicates where those genes/elements are located in the genome relative to the file.
36
What are the key columns in a GFF file?
seq name, source, feature, start, end, score, strand, phase, attribute.
37
What does the 'feature' column in a GFF file represent?
The region it’s focused on.
38
What does the 'start' column indicate in a GFF file?
The location on the genome where the feature begins.
39
What does the 'end' column indicate in a GFF file?
The location on the genome where the feature ends.
40
What does the 'score' column represent in a GFF file?
A dot symbol.
41
What does the 'strand' column indicate?
+ (positive strand - FORWARD) or - (negative strand - complement).
42
What does the 'phase' column indicate in a GFF file?
For coding sequences, it indicates the codon position, which can be 0, 1, or 2.
43
What does the 'attribute' column typically include?
A description/ID for the feature or any other relevant information for each line.
44
What is the hierarchical structure of features in a GFF file?
Exons belong to RNA, and several exons together form one gene, which makes a protein.
45
How are exons identified in relation to mRNA in a GFF file?
Exons after any mRNA lines are considered part of that mRNA if their start and end positions make sense.
46
What is the purpose of using a GFF file with a fasta file?
To filter out sequences of interest from the fasta file.
47
What is Illumina sequencing characterized by?
High coverage; reference-based assembly ## Footnote Cost-effective for many individuals, lots of data with high accuracy.
48
What is a key advantage of Illumina sequencing regarding reference maps?
Don’t need to span long stretches ## Footnote Have reference map.
49
What is essential to resolve in Illumina sequencing?
Allelic variants ## Footnote Distinguish true variation from sequencing errors.
50
What type of coverage does PacBio/ONT provide?
Medium-high coverage ## Footnote Used alongside Illumina and Hi-C for de novo assembly.
51
What is the benefit of long reads in PacBio/ONT sequencing?
Resolve complex regions ## Footnote High accuracy is needed to get a good reference genome.
52
What is the starting point for PacBio/ONT sequencing?
Starting from scratch ## Footnote Requires high accuracy to minimize errors.
53
What is Synteny?
Linkage between genomic regions that can be conserved across species ## Footnote Synteny can refer to any given region of the genome and can be observed at both the chromosomal and gene order levels.
54
What are the two levels at which Synteny can be observed?
* Chromosomal Level: Synteny, macrosynteny * Gene Order: colinearity, microsynteny
55
What is Structural variation in the context of genomics?
Rearrangement of large fragments of DNA ## Footnote This variation can include large insertions, deletions, duplications, and can occur within populations of the same species.
56
What are examples of Structural variation?
* Inversions * Translocations * Copy-number variation
57
What can Inversions lead to?
* Diseases * Effects on phenotypes * Speciation events * No effect at all
58
What is the main cause of Inversions and Translocations?
Errors in repairing double-strand breaks
59
What happens during a reciprocal translocation?
Genes on ends (loci) of chromosomes are swapped
60
What is a nonreciprocal translocation?
Exchange of gene chunks between 2 chromosomes through errors
61
What characterizes Robertsonian translocation?
It is reversible if genes translocated are near the middle and still have most of their DNA in the chromosomes
62
Fill in the blank: Synteny can include _______ between genomic regions.
[linkage]
63
True or False: Structural variation can only occur between different species.
False
64
What is the difference between macrosynteny and microsynteny?
Macrosynteny refers to larger scale synteny, while microsynteny refers to smaller or fine scale.