de Flashcards

(57 cards)

1
Q

What does ACID stand for

A

ATOMICITY
CONSISTENCY
ISOLATION
DURABILITY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A in ACID

A

ATOMICITY
All or nothing rule
Entire transaction takes place at once or doesn’t happen at all
Abort & commit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

C in ACID

A

CONSISTENCY
Database must be consistent before and after the transaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

I in ACID

A

ISOLATION
Multiple transactions occur independently without interference
Changes only visible after they have been made to main memory
Isolation prevents dirty reads, non-repeatable reads, phantom reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

D in ACID

A

DURABILITY
Changes of a successful transaction occurs even if the system failure occurs
Once transaction is committed, changes are permanent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Advantages of ACID in DBMS

A

Data consistency - data remains consistent and accurate after any transaction execution

Data integrity - maintain the integrity of the data by ensuring that any changes to the database are permanent and cannot be lost

Concurrency control - help to manage multiple transactions occurring concurrently by preventing interference between them.

Recovery - in case of any failure or crash, the system can recover the data up to the point of failure or crash.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Disadvantages of ACID in DBMS

A

Performance - can cause a performance overhead in the system, as they require additional processing to ensure data consistency and integrity.

Scalability - may cause scalability issues in large distributed systems where multiple transactions occur concurrently.

Complexity - can increase the complexity of the system and require significant expertise and resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the importance of ACID properties?

A

ACID properties ensure data consistency, integrity and reliability in DBMS. They manage multiple transactions that are occurring concurrently and help to recover data if any system failures occur. Thus, they play an important role.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can we achieve Atomicity in transactions?

A

To achieve atomicity, a transaction should be treated as a single unit. If at any point of transaction, execution fails then the whole transaction should be rolled back by undoing the changes made.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a data warehouse

A

Centralized system designed to store process and analyze large volumes of data from multiple sources
Optimized for queries and reporting rather than transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some drawbacks of ACID Properties in DBMS?

A

ACID properties cause performance overhead because of additional processing required for maintaining data integrity and consistency. This also increases the complexity of the system, so we require more significant expertise and resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

OLTP

A

Fast reads / writes
Day to day transactions
Real-time, current data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

OLAP

A

Complex queries
Analytical queries and reporting
Historical, aggregated data
Read optimized for querying

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data warehousing architectures

A

Trad data warehouse
* Centralized storage using ETL before loading
Cloud data warehouse
* Scalable, pay as you go
Data lakehouse
* Combines DWH + data lake (lakehouse on databricks)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Database

A

OLTP online transaction processing
-structured system for storing and managing real time transactions handling read/writes for day to day
-sql server
-maintaining inventory records in a store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data warehouse

A

OLAP online analytical processing
-central repo optimized for storing historical data and performing complex analytical queries. Holds structured data
-red shift, snowflake
-trends on last 5 years, hr report across departments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data lake

A

-massive storage holding raw unstructed semi struct and structured data.
-used for data processing, ml
-s3, azure data lake, Hadoop
-storing raw sensor data from IoT devices
-unfiltered ocean of information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data mart

A

Subset of warehouse focused on specific department or business unit (sales, hr, marketing)
-provides faster access to relevant insights
-structured data
-hr analyzing employee retention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
  • Star schema
A

o One fact connected to multiple dimension tables
o Optimized for fast querying
o Sales fact table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
  • Snowflake schema
A

o Normalizes dimensions into sub-dimensions
o Reduces data redundancy but increases query complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q
  • Galaxy schema – fact constellation
A

o Multiple fact tables share common dimension tables
o Complex business scenarios (sales, returns data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Slowly changing dimensions (SCD) type 1

A

o Overwrites old data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Slowly changing dimensions (SCD) type 2

A

o New row is added for every address change with start and end date

22
Q

Slowly changing dimensions (SCD) type 3

A

o Limited history (prev_addr is added to track only the last change)

23
FACT table
* Stores measurable business events
24
DIM table
* Stores descriptive attributes
25
Landing vs staging
Landing is data as is staging is modified, multiple layers, structured
26
Parallel processing
Distribute etl workloads across multiple nodes: Spark, Hadoop
27
Incremental loading
Processes only new or changed data instead of full dataset reloads
27
Data partitioning
Splits large datasets into smaller manageable chunks
28
How do you approach ETL design
* Gathering stakeholder needs. * Analyzing data sources. * Researching architecture and processes. * Proposing a solution. * Fine-tuning the solution based on feedback. * Launching the solution and user onboarding.
29
TYPICAL ETL SOURCES?
-databases -cloud storage like s3, blob storage -flat files (csv, Json) -APIs, like for web scraping
30
What documentation have you made for etl process and data flows?
* High level system design * Data models * Metadata definitions * Step by step etl workflows * Dependencies and scheduling details * Data validation rules, error handling, retry logic * Data dictionary – data catalog, schema documentation * Playbook for production support – troubleshooting guides, resolving schema changes, performance issues
31
How do you monitor and maintain ETL systems to ensure smooth operation and minimal downtime?
* Automated monitoring and logging o Spark ui – pipeline performance, execution times, failures o Custom job alerts * Error handling and retry mechanism o Failure recovery logic o Checkpointing in spark streaming * Performance optimization o Review spark execution plans to optimize joins, partitions, caching * Regular maintenance and load testing o Scheduled load tests o Maintained historical logs and audit trails
32
What data validation techniques would you use?
* Schema validation o Pyspark schema enforcement and SQL constraints to detect missing or incorrect fields * Data type & format checks o Date formats, numerical precision, text standardization * Business rule validation o Custom business logic (mortgage balance cant be negative) * Duplicate detection o Windows functions and deduplication rules * Range and anomaly detection o Flagged outliers using aggregate and statistical checks
33
How have you optimized etl workflows in the past?
* Refactored SQL queries for performance o Explain plan o Optimized joins and aggregations o Removed unnecessary sub queries and used CTEs instead * Partitioning and indexing (faster data retrieval) o Table partitioning by date, location to improve query speeds o Created indexes on frequently queried columns reducing scan time * Incremental loading o Change Data Capture (CDC) – only process new or updated records * Parallel processing and workflow automation o Automated error handling and retry mechanisms to prevent workflow failure.
33
Tokenization
* Replace sensitive data with tokens * The values are stored securely in a separate table * Azure key vault
34
Materialized view
precomputed stored result of a query stored on disk
35
Normal view
dynamically fetches data on each query
36
OLAP cubes and structures
An OLAP cube is a multi-dimensional data structure that enables fast analytical queries by pre-aggregating data into different dimensions.
37
types of OLAP
MOLAP (Multidimensional OLAP) ROLAP (Relational OLAP) HOLAP (Hybrid OLAP)
38
MOLAP (Multidimensional OLAP)
* Stores precomputed OLAP cubes in a specialized database
39
ROLAP (Relational OLAP)
* Stores data in relational databases (e.g., SQL, Snowflake, Redshift)
40
HOLAP (Hybrid OLAP)
* Combines MOLAP & ROLAP – Some data is precomputed, and some is queried dynamically
41
What is a cluster
A pool of computers working together but viewed as a single system
42
Cluster technology for spark:
HADOOP YARN, Kubernetes
43
Want to run a spark application on a cluster
- Spark-submit command - Request goes to yarn resource manager Yarn RM will create one application master container on a worker node and start applications main() method in the container
44
What is a container?
- container is an isolated virtual runtime environment Comes with some CPU and memory allocation
45
SPARK DEPLOY MODES and differences
cluster mode, client mode Only one difference between the two - In cluster mode, the driver runs in the cluster In client mode your driver run in your client machine
45
Application Master Container
- The container is running the main() method of application (pyspark or scala) - The pyspark application will start a JVM application - Once there is a connection, the pyspark wrapper will call the java wrapper using the Py4J connection Py4J - allows a python application to call a java application
46
SPARK SQL ENGINE & QUERY PLANNING - logical optimization
2- Logical optimization - applies standard rule based optimizations to logical plan
46
spark job transformations
- Narrow dependency, can run in parallel, on each data partition without grouping data from multiple partitions - - select,filter,withColumn, drop - Wide dependency, requires some kind of grouping before they can be applied - groupBy,join,cube,rollup, agg,repartition
47
SPARK SQL ENGINE & QUERY PLANNING - analysis
1- Analysist phase will parse your code and create a full resolved logical plan
47
spark actions
- Used to trigger some work/job, all spark actions trigger one or more spark jobs - - red,write,collect,take, count Code blocks are determined by spark actions, each code block is run as one spark job
48
SPARK SQL ENGINE & QUERY PLANNING
1- Analysist phase will parse your code and create a full resolved logical plan 2- Logical optimization - applies standard rule based optimizations to logical plan 3- Physical planning - spark sql takes logical plan and generates one or more physical plans - applies cost based optimizations Code generation - engine will generate java byte code for the RDD operations in the physical plan
49
SPARK SQL ENGINE & QUERY PLANNING - physical plan
3- Physical planning - spark sql takes logical plan and generates one or more physical plans - applies cost based optimizations
49
SPARK SQL ENGINE & QUERY PLANNING - code gen
Code generation - engine will generate java byte code for the RDD operations in the physical plan