tmlsn Flashcards

(39 cards)

1
Q

What is a data warehouse

A

Centralized system designed to store process and analyze large volumes of data from multiple sources
Optimized for queries and reporting rather than transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

OLTP

A

Fast reads / writes
Day to day transactions
Real-time, current data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

OLAP

A

Complex queries
Analytical queries and reporting
Historical, aggregated data
Read optimized for querying

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Database

A

OLTP online transaction processing
-structured system for storing and managing real time transactions handling read/writes for day to day
-sql server
-maintaining inventory records in a store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data warehouse

A

OLAP online analytical processing
-central repo optimized for storing historical data and performing complex analytical queries. Holds structured data
-red shift, snowflake
-trends on last 5 years, hr report across departments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data lake

A

-massive storage holding raw unstructed semi struct and structured data.
-used for data processing, ml
-s3, azure data lake, Hadoop
-storing raw sensor data from IoT devices
-unfiltered ocean of information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data mart

A

Subset of warehouse focused on specific department or business unit (sales, hr, marketing)
-provides faster access to relevant insights
-structured data
-hr analyzing employee retention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  • Star schema
A

o One fact connected to multiple dimension tables
o Optimized for fast querying
o Sales fact table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  • Snowflake schema
A

o Normalizes dimensions into sub-dimensions
o Reduces data redundancy but increases query complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  • Galaxy schema – fact constellation
A

o Multiple fact tables share common dimension tables
o Complex business scenarios (sales, returns data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Slowly changing dimensions (SCD) type 1

A

o Overwrites old data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Slowly changing dimensions (SCD) type 2

A

o New row is added for every address change with start and end date

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Slowly changing dimensions (SCD) type 3

A

o Limited history (prev_addr is added to track only the last change)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

FACT table

A
  • Stores measurable business events
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

DIM table

A
  • Stores descriptive attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Landing vs staging

A

Landing is data as is staging is modified, multiple layers, structured

15
Q

How do you approach ETL design

A

Gathering stakeholder needs.
* Analyzing data sources.
* Researching architecture and processes.
* Proposing a solution.
* Fine-tuning the solution based on feedback.
* Launching the solution and user onboarding.

16
Q

TYPICAL ETL SOURCES?

A

-databases
-cloud storage like s3, blob storage
-flat files (csv, Json)
-APIs, like for web scraping

17
Q

What data validation techniques would you use?

A
  • Schema validation
    o Pyspark schema enforcement and SQL constraints to detect missing or incorrect fields
  • Data type & format checks
    o Date formats, numerical precision, text standardization
  • Business rule validation
    o Custom business logic (mortgage balance cant be negative)
  • Duplicate detection
    o Windows functions and deduplication rules
  • Range and anomaly detection
    o Flagged outliers using aggregate and statistical checks
18
Q

How have you optimized etl workflows in the past?

A
  • Refactored SQL queries for performance
    o Explain plan
    o Optimized joins and aggregations
    o Removed unnecessary sub queries and used CTEs instead
  • Partitioning and indexing (faster data retrieval)
    o Table partitioning by date, location to improve query speeds
    o Created indexes on frequently queried columns reducing scan time
  • Incremental loading
    o Change Data Capture (CDC) – only process new or updated records
  • Parallel processing and workflow automation
    o Automated error handling and retry mechanisms to prevent workflow failure.
19
Q

Normal view

A

dynamically fetches data on each query

19
Q

Materialized view

A

precomputed stored result of a query

20
Q

OLAP cubes and structures

A

An OLAP cube is a multi-dimensional data structure that enables fast analytical queries by pre-aggregating data into different dimensions.

20
Q

types of OLAP

A

MOLAP (Multidimensional OLAP)
ROLAP (Relational OLAP)
HOLAP (Hybrid OLAP)

21
MOLAP (Multidimensional OLAP)
* Stores precomputed OLAP cubes in a specialized database
22
ROLAP (Relational OLAP)
* Stores data in relational databases (e.g., SQL, Snowflake, Redshift)
23
HOLAP (Hybrid OLAP)
* Combines MOLAP & ROLAP – Some data is precomputed, and some is queried dynamically
24
What is a cluster
A pool of computers working together but viewed as a single system
25
What is a container?
- container is an isolated virtual runtime environment Comes with some CPU and memory allocation
26
First normal form (1NF)
each row in the table should have a unique identifier, and each value in the table should be indivisible (atomic value).
27
Second Normal Form (2NF)
that each non-key column in a table is dependent on the primary key. In other words, there should be no partial dependencies in the table.
28
Third Normal Form (3NF)
requiring that all non-key attributes are independent of each other. This means that each column should be directly related to the primary key, and not to any other columns in the same table
29
BCNF — Boyce-Codd Normal Form
ensures that each determinant in a table is a candidate key. In other words, BCNF ensures that each non-key attribute is dependent only on the candidate key.
30
Fourth Normal Form (4NF)
used to eliminate the possibility of multi-valued dependencies in a table. A multi-valued dependency occurs when one or more attributes are dependent on a part of the primary key, but not on the entire primary key.
31
Fifth normal form (5NF)
Project-Join Normal Form (PJNF). It is used to handle complex many-to-many relationships in a database.
32
Star Schema: Pros:
1. Simplicity: With denormalized dimension tables, the structure is easier to understand and navigate. 2. Query Performance: Fewer joins are required, which can improve the speed of querying large datasets. 3. Adaptability: It’s relatively easy to add new dimensions without altering existing structure.
33
Star Schema: Cons:
1. Redundancy: Denormalization can lead to data redundancy in dimension tables. 2. Storage: The redundant data might require more storage space. 3. Maintenance: Redundancy can make updates and inserts more complex, potentially leading to inconsistencies.
34
Snowflake Schema: Pros:
1. Normalization: Reduced data redundancy due to the normalization of dimension tables. 2. Storage Efficiency: Uses less storage space compared to the star schema because of reduced redundancy. 3. Clear Structure: The normalized structure can be clearer for those familiar with relational database design.
35
Snowflake Schema: Cons: