Week 2 Flashcards
(34 cards)
Information
the letters used to write the sequences
Genomes in a double or single stranded states
What is information used for?
we use information in DNA seqeunce to align sequences
Contig
Summary sequence of overlapping DNA fragments
Why can we align DNA sequences
How long does a sequence have to be for us to expect it to be unique?
Information theory and Genetics
The restriction enzyme HaeIII cuts the sequence GGCC. What is the probability that any random four base pair sequence is a HaeIII site?
1/4 chance of getting either base
1/4 x 1/4 x 1/4 x 1/4
(1/4)^n
n=length of the genome
How many distinct permutations in sequence can a stretch of n bases long have?
How many sequences (permutations) are possible for a giving length of a DNA strand (n-bases long) if the bases occur at an equal frequency in the genome
, how many permutations?
how many peices of distinct information can a stretch of n bases have (potential)
4^n
We can expect a sequence to be unique in a genome when the # of permutations is greater than the # of sequences of n bases in the genome
Bacterial genome = 4x10^6 bases
Human genome = 3x10^9 base (single stranded)
What is the # of sequences of n bases in the genome?
What is the number of seqeunces of n bases in the genome?
because genomes are so long the number of sites = the length of the genome if the genome is single stranded
the number of sites in double stranded DNA = 2x the length of the genome
Palindromic sequences are an exception.
We can expect a sequence to be unique in a genome when the # of permutations is greater than the number of sequences of n bases in the genome.
when is 4^n > 8x10^6 (bacteria)
n>=12 in bacteria is unique
when is 4^n > 6x10^9 (humans)
n>=17 in humans is unique
Say I have 9 genomes made of random sequence of 3 X 109 bases. And I search each of these with ATAGACATAGGATACAT. How many would you expect to have the sequence?
6 X 10^9 / 17 X10^9 = 1/3
Therefore, 1/3 X 9 = 3.
I expect about 3 of these random genomes to have this sequence
How much information is there in DNA?
A,T,G,C each letter can arbitrarily represented as:
A=00
G=01
C=10
T=11
the information content at any position in a DNA sequence is 2 bits
smallest unit of information is a bit
The total bits of a sequence n is the addition of the bits at each position.
Information theory
trying to pick out an identified object
we have eight objects each of the objects are assigned a binary number, each object is identified by a distinct binary number
when do i have sufficient information to identify the object
if you only have one bit of information when you search through the object you are going to identify half
Minimum amount of information
Imin = log2 N
I>= Imin to be sufficient information
N is the number of objects/length of the genome
12 base sequence = 24 bits > log2 8 X 10^6 = 22.9
17 base sequence = 34 bits > log2 6 X 10^9 = 32.5
BLAST Basic local alignment search tool
Genbank has about 10^13 bases of sequence in the database. Therefore, Imin= 44 and a sequence of 23 bases or more could be expected to be unique, more often than not, in the database.
Blast takes your query and breaks it into short words (5 sequence words), they take the database and break into words as well (5 base word) these words are indexed with position information.
If so the blast program creates an aligment to align the query to the sequences in the database
Two statistics of a BLAST
Score: higher the score the better.
e value: lower the e value the better.
e value is telling you what the random chance of your alignment occurring when searching the database. Low e value low chance.
Cost of DNA sequencing
Has fallen over the past 20years
large change occurs after massive parallel sequencing is introduced
Sanger
reads per/run: 1
Length of read: 1,000 bases
basic method: termination
expensive/accurate
Illumina
length of read: 100 bases
# of reads/per run: 10^9-10^10
basic method: real time incorporation
cheap/accurate
Pacific bioscience
length of read: 15,000 bases
reads/run: 10^6
real time incorporation
cheap/low accuracy
nanopore
read length: 100,000
#reads per run: 10^6
current changes
cheap/low accuracy
Real time incorporation: Illumina
A series of rounds of synthesis, the nucleotides are modified to have a fluorescent group and a blocked 3’OH
after cycle one the 3’OH and the fluorescent group is removed
the cycle is repeated, continue on and on and take successive pictures to determine the sequence of DNA
Real time incorporation: Pacbio
the cell is set up so you can see the DNA polymerase, and DNA polymerase incorporated labeled dNTPs, the label is at the very end of the phosphate, what you observe is the read output, labeled dNTPs are incorporated in the chain, this fluorescence is detected, and diffuses away shortly after incorporation
Nanopore
the sequence that is in the pore alters the amount of current that can flow through the nanopore in a specific manner depending on the sequence.
as each base goes through we see a current change
a set of bases is read and is being read, the current we’re looking at is the G coming out naf the T going in
Illumina basic steps
DNA molecule has adaptors added.
Sticks to oligos on a glass slide.
Amplified to make a microdot on the slide.
First sequence read.
Reorientation of the DNA molecule.
Second paired sequence read.
Bridge amplification to close fragments