DBMS endsem Flashcards by Era Todkar

Relational database

method of organization and management of data through relations, attributes and tuples
Ex.
Components (8) :
Relation, Attribute, Tuple, Constraint, Domain (allowable values in att), PK, FK, Normalization

Advantages:
Scalability
SQL support
User friendly
Data integrity

How well did you know this?

Not at all

Perfectly

Relation | Table (6)

How well did you know this?

Not at all

Perfectly

Candidate key and superkey

Super key may contain unecessary attribute with no restriction for uniqueness

Candidate key is minimal super key

Every candidate key is super key but not every..

How well did you know this?

Not at all

Perfectly

Codd’s rules

0: Foundation rule
1: Information rule
2: Guaranteed access
3: Systematic treatment of null values
4: Active online catalog
5: Comprehensive data sub-language rule
6: View updating rule
7: High-level insert, update, delete
8: Physical data independence
9: Logical data independence
10: Integrity independence
11: Distribution independence
12: Non subversion rule

How well did you know this?

Not at all

Perfectly

Normalization

1NF:
Atomicity
Uniqueness
No groups

2NF:
No partial dependence
Completely dependent on primary key

3NF:
No transitive dependence

BCNF:
Stronger form of 3NF
Ensures in every functional dependence, dependent variable is super key

How well did you know this?

Not at all

Perfectly

Characteristics of a good relational database

1 Normalization
2 Data integrity and consistency
3 Scalability
4 Referential integrity
5 Proper data types
6 Minimal data redundancy
7 Security and ascces control

How well did you know this?

Not at all

Perfectly

Why is normalization needed?

1 Data redundancy
2 Scalability support
3 Improve query performance
4 Improves data integrity

How well did you know this?

Not at all

Perfectly

Explain Query optimization wrt SQL database

-selecting efficient query plan
-goals: minimize execution time, reduce resource usage, overall performance improvement

Steps:
1 Parsing the query (syntax, identifier, keywords)
2 Semantic analysis
3 Query rewrite
4 Plan generation
5 Plan selection (estimated cost)
6 Plan execution

How well did you know this?

Not at all

Perfectly

Distributed database

Database spread across multiple physical locations connected using conection links

Components:
1 Data nodes (physical locations)
2 Connection network (connects data nodes)
3 Distrubuted DBM (software that manages dist database)

Types:
Homogenous (nodes run the same DBMS software)
Heterogenous (may run diff)
Hybrid (combination)

How well did you know this?

Not at all

Perfectly

Explain architecture of parallel databases

designed to execute simultaneous execution of tasks
1 Shared mem arch
2 Share disk arch
3 Distributed arch
4 Hybrid (Mem+Disk+Distributed)

How well did you know this?

Not at all

Perfectly

Key elements of parallel database processing

1 Parallelism
2 Multiprocessing
3 Data partitioning
4 Task parallelilsm
5 Data parallelism
6 Inter query
7 Intra query
8 Synchronization
9 Communication
10 Load balancing

How well did you know this?

Not at all

Perfectly

2 tier | 3 tier

layers
names
complexity
scalability
security
cost
flexibility
performance
example (simple web app, mobile apps | enterprise software systems)

How well did you know this?

Not at all

Perfectly

OLAP

Online analytical processing
technology that allows to analyze large amounts of data quickly from multiple percpectives
Slice and dice through data for reporting and decision making

key feature:
Multidimensional view (organized in cubes)
Advanced analysis (comparisons, forecasting)
Interactive (users can slice, dice, roll-up)
Roll up (aggregates to summary)
Drill down (to more detailed data)
Slice (selects a single value for 1 dimension)
Dice(multiple values across multiple dim)
Pivot(reorrients multidimensional view)

How well did you know this?

Not at all

Perfectly

Data warehouse architecture and its components

Architecture:
Bottom tier(data source flat files and operational databases)
Middle tier (staging and storage-olap cubes)
Top tier (presentation)

Components:
Data source (logs,spreadsheets, API, cloud storage)
ELT tools (Extract, Load, transform)
Stage area
Storage area (organized by subject)
Metadata repository (usage, source, structure)
OLAP engine
Query tools (run query, reports, analyze trends)

How well did you know this?

Not at all

Perfectly

What is KDD

overall process of discovering useful, valid, and understandable patterns or knowledge from large volumes of data.
Data selection (sources)
Data preprocessing (Clean data)
Data transformation (Convert to suitable form)
Data mining (Apply algorithms to extract patterns)
Pattern Evaluation (valid, novel, useful, and understandable)
Knowledge presentation

How well did you know this?

Not at all

Perfectly

Goal of data mining

Study These Flashcards

main goal - discover hidden patterns and analyzing data to gain insights
1 Forecasting and prediction
2 Classification
3 Clustering
4 Anomaly detection

What is big data?

Study These Flashcards

Datasets so large that traditional data processing cannot handle
Volume
Velocity (real time processing and streaming)
Value (inghts that bring business value)
Variety

Explain data mining task

Study These Flashcards

Descriptive tasks
Purpose (describe general properties or patterns in the data)
Focus(Understand what’s happening)
Ex - Clustering, Summarization

Predictive tasks
Purpose (predict future values based on current)
Focus(Make predictions or classification)
Ex - Classification, regression

NoSQL database

Study These Flashcards

For handling unstructured or semi-structured data in flexible schema
Compensate for limitations of relational databases

Characteristics:
Schema less (flexible data modelling)
Non relational (key-value, column family, document)
High performance (optimized for fast data retrieval)
Distributed (scale horizontally across multiple servers)

Types:
Key value (Riak or Redis)
Document oriented (self describing docs like JSON or XML - MongoDB)
Column family (Cassandra)
Graph based (complex relations - Amazon neptune)
Multimodel (ArangoDB)
Adv: Flexible, High scalability
Disadv: Lack of standardization (different query language and models), Limited transaction support (no roll back)

Internet databases

Study These Flashcards

Web based
Provide remote access
Application like e-commerce, social media

Traits:
Remote access
Scalability
Security
Web-based architecture

Cloud databases

Study These Flashcards

Hosted and managed by cloud services like Amazon web services, Google cloud platform
Wide range of application from small scale to large scale ernterprise

Traits:
Cloud based architecture
Managed by cloud services
Scalability
Security

Adv: Cost effective, Scalability
Disadv: Dep on service provider, security risks

SQLite

Study These Flashcards

self contained, File based RDBMS
web, mobile and desktop applications

Traits:
Relational
Self contained (no server)
File-based (easy sharing)
Zero configuration
SQL support

Adv: Easy to use, Flexible
Disadv: Limited scalability, limited concurrency

XML database

Study These Flashcards

Store manage and query data in XML format
For large unstructured, semi-structured datasets
Traits:
1 XML model (hierarchical elements and att)
2 Schema less
3 Indexing (high query performance)
4 Query support (Xpath and SQL)
adv: flexibility, scalability
disadv: complexity, performance

MongoDB

Study These Flashcards

provides flexible and scalable way of organizing large amounts of data
Released in 2009

Traits:
Document oriented
Schema less
Scalable
High performance
adv:
disadv: less support for transactions, no complex queries, data redundancy

JSON

lightweight, text based data interchange format data exchange between web-servers and web-applications Data types: Number, string, boolean, array, object (unordered collection of key value pairs), null Syntax: Values: values can be number, string, boolean, arrays, objects or null adv: text based, lightweight disadv: schema less, data type less

HDFS

Hadoop - distributed file system used to store and manage large amounts of data across a cluster of machines Integral component of Apache hadoop ecosystem used for big data analytics 1 Distributed arch 2 Block based storage (64, 128 mb) 3 Replication (across multiple systems in case of h/w failure) 4 High throughput access (fast data processing) adv: scalability, fault tolerance disadv: complexity, less support for small files

MapReduce and Hadoop

MapReduce : programming model used to manage large datasets across Hadoop : open source software framwork used to store and process Hadoop: 1 HDFS 2 MapReduce 3 YARN (resource management layer)

DBMS endsem Flashcards

(27 cards)