Bioinformatik Flashcards
Flat file
term used to refer to when data is stored in a plain ordinary file on the hard disk. Example RefSEQ.
Bioinformatics
Application of information technology to the storage, management and analysis of biological information (Facilitated by the use of computers)
Nanopore seq
When a molecule goes through the hole it is measured. Proteins in the hole that pull it through, 800 nucleotides per minute Read length up to 300 000 —> Able to do phasing/haplotyping. If you have hetereozygote in two spots in the genome.
Examples of location descriptors
Location Description
476 Points to a single base in the presented sequence
340..565 Points to a continuous range of bases bounded by and
including the starting and ending bases
<345..500 The exact lower boundary point of a feature is unknown.
(102.110) Indicates that the exact location is unknown but that it
is one of the bases between bases 102 and 110.
(23.45)..600 Specifies that the starting point is one of the bases
between bases 23 and 45, inclusive, and the end base 600
123^124 Points to a site between bases 123 and 124
145^177 Points to a site anywhere between bases 145 and 177
J00193:hladr Points to a feature whose location is described in
another entry: the feature labeled ‘hladr’ in the
entry (in this database) with primary accession ‘J00193’
Sequencing file format tips
a) When saving a sequence for use in an email message or pasting into a web page…
b) When retrieving from a database or exchanging between programs…
c)When using sequence again with the same program…
a) …use an unannotated text format such as FASTA
b) …use an annotated text format such as Genbank
c) …use that program’s annotated binary format (or annotated text if binary not available)
Asn-1 (NCBI)
Gbff (sanger)
XML
Phred
*base calling
*vector trimming
*end of sequence read trimming
*assigns quality values (qv) of bases in the sequence
Phrap
*Phrap uses Phred’s base calling scores to determine the consensus sequences. *Examines all individual sequences at a given position, and uses the highest scoring sequence (if it exists) to extend the consensus sequence
Consend
graphical interface extension that controls both Phred and Phrap
Poor data at seq end
This is due to the difficulties in resolving larger fragment ~1kb (it is easier to resolve 21bp from 20bp than it is to resolve 1001bp from 1000bp)
Cis- and transsplicing for ORF
Cis-splicing - splice a intron and join exons on the same site
trans splice - splice and join from different sites, able to do between sense and antisense strand.
Swissprot
SWISS-PROT is an annotated protein sequence database. Continuously updated (daily).
Format follows as closely as possible that of EMBL’s
Curated protein sequence database
Three differences:
- Strives to provide a high level of annotations
- Minimal level of redundancy
- High level of integration with other databases
Behind a paywall..
TREMBL
Translated EMBL sequences not (yet) in Swissprot. Updated faster than SWISS-PROT.
TREMBL - two parts
1. SP-TREMBL
Will eventually be incorporated into Swissprot
Divided into FUN, HUM, INV, MAM, MHC, ORG, PHG, PLN, PRO,ROD, UNC, VRL and VRT.
- REM-TREMBL (remaining)
Will NOT be incorporated into Swissprot
Divided into:Immunoglobins and T-cell receptors,Synthetic sequences,Patent application sequences,Small fragments,CDS not coding for real proteins
Protein searching
3 levels
1.Swissprot - Little noise, annotated entries
2.Swissprot + TREMBL - More noise, all probable entries
3.Translated EMBL - blast or tfasta - Most noisy, all possible entries
PDB
3D structure of proteins. AI is able to read the information from AA to predict the datamodel.
>10 000 structures of proteins
Also contains structures of DNA, carbohydrates and protein-DNA complexes
Structures determined principally by X-ray crystallography but other methods are electron microscopy and NMR.
Each entry identified by unique 4-letter code
4 most used databanks in bioinformatics
gene ontology - defines the terms
pfam - protein families, identifies functional parts in proteins
smart - visual presentation of protein families
kegg - pathway database, which enzymes work together in biosynthesis pathway
Problem with flat files:
Wasted storage space
Wasted processing time
Data control problems
Problems caused by changes to data structures
Access to data difficult
Data out of date
Constraints are system based
Limited querying eg. all single exon GPCRs (<1000 bp)
Relational databases
A set of tables and links. A language to query the database. A program to manage the data.
Has existed for 50 years. Main stream in bioinformatics.
Very well known and proven underlying mathematical theory, a simple one that makes possible. Relational model is very mature and has strong knowledge on how to make a relational back-end fast and reliable and how to exploit different technologies.
Pros with databases
+Redundancy can be reduced
+Inconsistency can be avoided
+Conflicting requirements can be balanced
+Standards can be enforced
+Data can be shared
+Data independence
+Integrity can be maintained
+Security restrictions can be applied
Cons with databases
-Size
-Complexity
-Cost
-Additional hardware costs
-Higher impact of failure
-Recovery more difficult
Identity
Extent to which two (nucleotide or amino acid) sequences are invariant
Homology
Similarity attributed to descent from common ancestor
Orthologous
Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function
Paralogous
Homologous sequences within a single species that arouse by gene duplication.
Empirical finding
If two biological sequences are sufficiently similar, almost invariably they have similar biological functions and will be descended from a common ancestor.