de Flashcards

Question

FACT table

Answer 1

* Stores measurable business events

Answer 2

* Stores descriptive attributes

Answer 3

Landing is data as is staging is modified, multiple layers, structured

Answer 4

Distribute etl workloads across multiple nodes: Spark, Hadoop

Answer 5

Processes only new or changed data instead of full dataset reloads

Answer 6

Splits large datasets into smaller manageable chunks

Answer 7

* Gathering stakeholder needs. * Analyzing data sources. * Researching architecture and processes. * Proposing a solution. * Fine-tuning the solution based on feedback. * Launching the solution and user onboarding.

Answer 8

-databases -cloud storage like s3, blob storage -flat files (csv, Json) -APIs, like for web scraping

Answer 9

* High level system design * Data models * Metadata definitions * Step by step etl workflows * Dependencies and scheduling details * Data validation rules, error handling, retry logic * Data dictionary – data catalog, schema documentation * Playbook for production support – troubleshooting guides, resolving schema changes, performance issues

Answer 10

* Automated monitoring and logging o Spark ui – pipeline performance, execution times, failures o Custom job alerts * Error handling and retry mechanism o Failure recovery logic o Checkpointing in spark streaming * Performance optimization o Review spark execution plans to optimize joins, partitions, caching * Regular maintenance and load testing o Scheduled load tests o Maintained historical logs and audit trails

Answer 11

* Schema validation o Pyspark schema enforcement and SQL constraints to detect missing or incorrect fields * Data type & format checks o Date formats, numerical precision, text standardization * Business rule validation o Custom business logic (mortgage balance cant be negative) * Duplicate detection o Windows functions and deduplication rules * Range and anomaly detection o Flagged outliers using aggregate and statistical checks

Answer 12

* Refactored SQL queries for performance o Explain plan o Optimized joins and aggregations o Removed unnecessary sub queries and used CTEs instead * Partitioning and indexing (faster data retrieval) o Table partitioning by date, location to improve query speeds o Created indexes on frequently queried columns reducing scan time * Incremental loading o Change Data Capture (CDC) – only process new or updated records * Parallel processing and workflow automation o Automated error handling and retry mechanisms to prevent workflow failure.

Answer 13

* Replace sensitive data with tokens * The values are stored securely in a separate table * Azure key vault

Answer 14

precomputed stored result of a query stored on disk

Answer 15

dynamically fetches data on each query

Answer 16

An OLAP cube is a multi-dimensional data structure that enables fast analytical queries by pre-aggregating data into different dimensions.

Answer 17

MOLAP (Multidimensional OLAP) ROLAP (Relational OLAP) HOLAP (Hybrid OLAP)

Answer 18

* Stores precomputed OLAP cubes in a specialized database

Answer 19

* Stores data in relational databases (e.g., SQL, Snowflake, Redshift)

Answer 20

* Combines MOLAP & ROLAP – Some data is precomputed, and some is queried dynamically

Answer 21

A pool of computers working together but viewed as a single system

Answer 22

HADOOP YARN, Kubernetes

Answer 23

- Spark-submit command - Request goes to yarn resource manager Yarn RM will create one application master container on a worker node and start applications main() method in the container

Answer 24

- container is an isolated virtual runtime environment Comes with some CPU and memory allocation

Answer 25

cluster mode, client mode Only one difference between the two - In cluster mode, the driver runs in the cluster In client mode your driver run in your client machine

Answer 26

- The container is running the main() method of application (pyspark or scala) - The pyspark application will start a JVM application - Once there is a connection, the pyspark wrapper will call the java wrapper using the Py4J connection Py4J - allows a python application to call a java application

Answer 27

2- Logical optimization - applies standard rule based optimizations to logical plan

Answer 28

- Narrow dependency, can run in parallel, on each data partition without grouping data from multiple partitions - - select,filter,withColumn, drop - Wide dependency, requires some kind of grouping before they can be applied - groupBy,join,cube,rollup, agg,repartition

Answer 29

1- Analysist phase will parse your code and create a full resolved logical plan

Answer 30

- Used to trigger some work/job, all spark actions trigger one or more spark jobs - - red,write,collect,take, count Code blocks are determined by spark actions, each code block is run as one spark job

Answer 31

1- Analysist phase will parse your code and create a full resolved logical plan 2- Logical optimization - applies standard rule based optimizations to logical plan 3- Physical planning - spark sql takes logical plan and generates one or more physical plans - applies cost based optimizations Code generation - engine will generate java byte code for the RDD operations in the physical plan

Answer 32

3- Physical planning - spark sql takes logical plan and generates one or more physical plans - applies cost based optimizations

Answer 33

Code generation - engine will generate java byte code for the RDD operations in the physical plan

de Flashcards

(57 cards)