1. Navigating Information About Nucleotide and Protein Sequences Flashcards

1
Q

What is the difference between Bioinformatics and Computational Biology?

A

Bioinformatics: the goal is to build useful tools to analyze biological data - engineering
Computational biology: the goal is to learn new biology by using computational techniques - science

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does “demultiplex” mean?

A

When sequencing many samples at the same time, we obtain multiple signals at the same time. A computer differentiates these signals, allowing us to know what sequence comes from what sample. This process is demultiplexing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A higher Q score indicates a ( greater / smaller ) probability of error.

A

smaller

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a FASTA file?

A

Text-based format for representing either nucleotide sequences or amino acid sequences in which nucleotides or amino acids are represented using single-letter codes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Components of a FASTA file (each entry in a FASTA file consists of…)

A
  1. ”>” followed by a sequence identifier with the definition or description of the sequence
  2. Lines of sequence data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an accession number?

A

Unique identifier for a sequence record
1 letter + 5 numbers or 2 letters + 6 numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a flat file?

A

During sequence submission, the submitter has to provide the name of the sequence, the source, annotation, ORF, sequence and translation product. All of this is displayed in a flat file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Types of motif representations

A

Regular expressions (regex)
Profiles (matrices)
Logos

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Regular expression for motifs (definition)

A

Define a unique sequence pattern using the standard IUPAC one-letter amino acid code, allowing for ambiguities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Profile for motifs (definition)

A

Consist of a position-weighted matrix where each position of the motif receives a score (probability) for each amino acid and position

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sequence logo for motifs (definition)

A

Graphical display of a multiple sequence alignment consisting of color-coded stacks of letters representing amino acids at successive points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a RefSeq?

A

It is the reference sequence for a given gene or protein

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do each of the following accession numbers correspond to? (molecule types)
NC_123456
NG_123456
NM_123456
NR_123456
NP_123456

A

Complete genomic molecules (genomes, chromosomes…)
Incomplete genomic region (gDNA for a gene)
mRNA
Non-coding RNA
Protein

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a genome browser?

A

A program that provides a grahical interface for users to browse, searh, retrieve and analyze genomic sequence and annotation data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Most commonly used genome browsers (list)

A

UCSC Genome Browser
NCBI Genome Data Viewer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a BCL file?

A

It’s the file where base calls are stored for clusters in a sequence by synthesis process

17
Q

What is a FASTQ file?

A

Text file that contains the sequence data from the clusters that pass the filter in a flow cell.
It contains a sequence identifier, the sequence, a separator (+ sign) and the base call quality score (Q)

18
Q

Base call quality score (Q) (definition)

A

Often represented using ASCII character, it provides information regarding the reliability of a base call
Q = -10log10(e), where e is the estimated probability of the base call being wrong

19
Q

A Q score of Q10 representes an error rate of how much? What about Q20? Q30?

A

Error rate of 1 in 10
1 in 100
1 in 1000

20
Q

What are the primary sequence databases?

A

GenBank
European Nucleotide Archive (ENA)
EMBL-EBI
DNA Databank of Japan

21
Q

What is the sequence read archive (SRA)?

A

Database that contains raw sequence data; it is not curated and contains redundancies.

22
Q

What is a GI number?

A

It is the GeneInfo Identifier, the number assigned to a unique GenBank entry

23
Q

What is UniProtKB?

A

A database formed by Swiss-Prot and TrEMBL where we can find information about proteins and related sequences

24
Q

What are the main protein domain databases?

A

Prosite
Pfam
InterPro