All in One Flashcards

Question

❓ **Question:** Can you explain lazy evaluation in Spark?

Answer 1

✅ **Answer:** * Yes. **Lazy evaluation** means that in Spark, when we write transformations like `filter()`, `map()`, or even `groupBy()`, they **don’t run immediately**. * Spark **builds a DAG (Directed Acyclic Graph)** of all operations first. * The actual execution only starts **when we run an action**, like `collect()`, `count()`, or `show()`. This allows Spark to **optimize the entire pipeline** before running — for example, combining filters or deciding the best join strategy.

Answer 2

“Spark uses lineage info to recompute lost RDDs — that’s how it achieves fault tolerance.”

Answer 3

✅ **Answer:** * Since both DataFrames are large and broadcast join won’t work, I would focus on **reducing shuffle** by ensuring the **data is partitioned properly** before the join. Here’s my approach: 1. **Repartition both DataFrames by the join key** using `repartition()` ``` df1 = df1.repartition("join_key") df2 = df2.repartition("join_key") ``` This ensures that records with the same key go to the same partition, reducing **cross-node shuffling**. 2. Optionally, I’d use **`sortMergeJoin`** strategy if the data is already **sorted** and big enough. Spark usually chooses this for large datasets. 3. I’d also **cache** one or both DataFrames if reused multiple times after the join. 4. If possible, I’d **filter** both DataFrames first to reduce data size **before the join**. --- 🧠 One-liner summary: * “Use `repartition()` on join keys to align partitions and reduce shuffle during large joins.”

Answer 4

✅ **Answer:** * When a **Spark job is submitted**, the data is split into **partitions** based on: * **Input file size** * **File format** * And the **default parallelism** or **block size** (like 128MB or 256MB depending on config) By default, Spark will: * Use **HDFS block size** (e.g., 128MB) to decide number of partitions * Or fall back to **spark.sql.shuffle.partitions** (default 200 for shuffle operations) If we don't specify partition logic (like with `.repartition()`), Spark uses **hash partitioning** internally.

Answer 5

✅ **Answer:** * To reduce partitions, **`coalesce()`** is more efficient than `repartition()`. * ✅ **`coalesce(n)`** → **No full shuffle**, just merges nearby partitions → Best when reducing partitions (e.g., from 200 → 50) * ⚠️ **`repartition(n)`** → **Triggers full shuffle** → More expensive → Use when you need even distribution or when **increasing** partitions --- 🧠 One-liner to remember: * “**Use `coalesce()` to reduce**, `repartition()` to reshuffle or increase.”

Answer 6

✅ **Answer:** > We are using the **Medallion Architecture**: * 🟫 **Bronze Layer** – Raw data from source systems (CSV, Excel, APIs…) * 🪙 **Silver Layer** – Cleansed, validated, and transformed data * 🥇 **Gold Layer** – Aggregated data, KPIs, ready for analytics/reporting The **final Gold data** is stored in a **data warehouse**, and connected to tools like **Power BI** for reporting.

Answer 7

✅ **Answer:** * Yes, we use **surrogate keys** and **Slowly Changing Dimensions (SCD)**, but **not in the Bronze (raw) layer**. * They are applied in the **Silver or Gold layers**, where data is cleaned and structured. * 🆔 **Surrogate keys** help assign unique IDs, especially when source systems don’t have stable IDs. * 🔄 **SCD (usually Type 2)** is used to **track changes** over time — for example, in **advisor or contact details**, where we need to maintain historical records.

Answer 8

✅ **Answer:** * Yes, we use **partitioning and folder structure** across all layers — but with different purposes: * 🟫 **Bronze Layer**: Partitioned by **ingestion date** or **source type** to manage raw files efficiently. * 🪙 **Silver Layer**: Partitioned by **business keys** like `region`, `department`, or `event_date` to optimize joins and queries. * 🥇 **Gold Layer**: Partitioning depends on reporting needs — usually by **time (e.g., year/month)** for faster aggregation. Folder structure follows a **clear hierarchy** (`/bronze/table_name/yyyy/mm/dd/`, etc.) to support **data discovery and lineage**.

Answer 9

✅ **A:** All raw files are **converted to Parquet** in the **Bronze layer**, using **Snappy compression** for better **performance and storage**.

Answer 10

✅ **A:** Yes. **Partitioning** by keys like **date** or **bank name** reduces **data scanned** during queries → improves speed for analysts.

Answer 11

✅ **A:** Yes. Regardless of input format, we **standardize everything to Parquet** before moving to **Silver/Gold**.

Answer 12

✅ **A:** * Use **parameterized pipelines** * Maintain **control tables** for metadata (file type, path, client…) → Enables **dynamic ingestion** and reuse

Answer 13

✅ **A:** * Use **Databricks notebooks** * For nested JSON: use **recursive logic** or **`explode()`** in Spark * KPIs are based on **business definitions**

Answer 14

✅ **A:** * A **control table flag** tells ADF the file type * Based on type: use **direct copy** or send to **Databricks for transformation**

Answer 15

✅ **A:** * Use **Delta Lake** for **schema evolution** + **time travel** * Apply **SCD with surrogate keys** to track changes * Monitor changes via **ADF logs + alerts**

Answer 16

> It’s the process of **removing repetition** in data. > Example: Instead of repeating a department name for every employee, split into separate **employee** and **department** tables → link by ID.

Answer 17

✅ **A:** It depends on the use case: * **1NF or lower forms** are better for **analytical databases** (OLAP) → Fewer joins, better read performance * **4NF or higher forms** are better for **transactional systems** (OLTP) → Reduces redundancy, supports consistent updates/inserts

Answer 18

✅ **A:** Snowflake schema is better for **OLTP** systems: * Supports **faster updates/inserts** * But not ideal for reporting — too many joins → **slower queries** For reporting, a **Star schema** or **OBT** is preferred.

Answer 19

✅ **A:** Yes, I’ve used OBT for **RFM segmentation** and reporting use cases. * ✅ **Pros:** * No joins → simple for analysts * Fast read performance * ❌ **Cons:** * Hard to update * High redundancy But with **Parquet + cheap cloud storage**, redundancy is less of a concern today.

Answer 20

✅ **A:** Redshift is used for: * **Columnar storage** * **Fast SQL querying** * Hosting **final curated data** (from S3) for **reporting tools** like Power BI or Tableau.

Answer 21

✅ **A:** * Used **broadcast joins** for small dimension tables * Avoided `collect()` to prevent **OutOfMemory** → Reduced job time by \~**30%**

Answer 22

✅ **A:** SCD tracks changes in dimension data over time: * **Type 1:** Overwrites old data * **Type 2:** Stores **full history** (with version columns or flags) * **Type 3:** Keeps only **current + previous version**

Answer 23

✅ **A:** * Use **broadcast join** (if possible) * Use **`repartition()` or `salting`** techniques * Enable **dynamic resource allocation** * Apply **hash partitioning** on skewed keys

Answer 24

✅ **A:** * Use `dropDuplicates(["col1", "col2"])` to remove based on specific columns * Use `distinct()` to remove fully duplicate rows

Answer 25

✅ **Answer:** We process around **200–300 GB of data per day** using an **autoscaling Spark cluster**. **Cluster Setup:** * 🔹 **1 driver node** * 🔹 **Up to 8 worker nodes** * 🔹 Each node: **16 vCPU, 64 GB RAM** * 🔹 Total cluster memory: **512 GB** **Autoscaling:** * Enabled using **dynamic resource allocation** * Cluster **scales up** automatically for large data loads and **scales down** during idle times **Cost Optimization:** * 🟢 **1 driver + 1 worker** are **on-demand** * 💸 **Remaining 7 workers** are **spot instances** to reduce cost

Answer 26

**✅ A:** Use `DENSE_RANK()` or `ROW_NUMBER()` in a subquery: ```SELECT salary FROM ( SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rnk FROM employees ) t WHERE rnk = 5; ```

Answer 27

**✅ A:** Ranks rows without skipping ranks on ties 📊 Example: Salaries 100, 100, 90 → Ranks: 1, 1, 2

Answer 28

**✅ A:** Group by column(s), then filter: ```sql SELECT col, COUNT(*) FROM table GROUP BY col HAVING COUNT(*) > 1; ```

Answer 29

**✅ A:** They return the **next (`LEAD`)** or **previous (`LAG`)** row value Used for time-based comparisons and trend analysis 📈

Answer 30

**✅ A:** A small, **anonymous function** Syntax: ```lambda x: x * 2 ``` Used in `map()`, `filter()`, etc. — lightweight and inline ⚡

Answer 31

✅ **A:** * ✅ **ACID support** * ⚡ **Faster reads** (columnar) * ⏳ **Time travel** * 🔄 **Schema evolution** * 🛠️ **Recovery from bad writes**

Answer 32

Partition by columns used in filters (e.g., `order_date`) to enable **data pruning** → **faster queries** 📊

Answer 33

✅ **A:** * 🪓 **Partitioning**: Use when column has **few distinct values** (e.g., country, status) * 🧩 **Bucketing**: Use for **high-cardinality** columns (e.g., `customer_id`) — data is **hashed into fixed buckets**

Answer 34

✅ **A:** 1. 🏷️ `df.createOrReplaceTempView("my_table")` 2. 🧠 `spark.sql("SELECT * FROM my_table")`

Answer 35

✅ **A:** * **DataFrame**: 🧾 Schema-aware, ❌ No compile-time type safety (Python/Scala) * **Dataset**: ✅ Type-safe (Scala/Java only) * Both are built on top of **RDD**

Answer 36

**❓ Q: What is normalization in SQL?** ✅ **A:** Normalization = Split big table into smaller related tables to: * 🧹 **Reduce redundancy** * 🧠 **Improve clarity** Example: Split `employees` and `departments` into separate tables

Answer 37

✅ **A:** Window functions do **calculations row by row**, but can still **see other rows** around it. They **don’t remove rows** like `GROUP BY`. 📌 Common window functions: * `ROW_NUMBER()` → gives number to each row * `RANK()`, `DENSE_RANK()` → ranking * `LEAD()`, `LAG()` → look at next or previous row * `SUM() OVER(...)` → running total

Answer 38

✅ **A:** * Use **`ROW_NUMBER()`** if every row needs a unique number * Use **`RANK()`** if ties are okay, but you want **gaps** * Use **`DENSE_RANK()`** if ties are okay, and you want **no gaps**

Answer 39

✅ **A:** --- ⚙️ **1. Use `cache()` or `persist()`** * Saves reused DataFrames in **memory** * Helps avoid recomputing again and again --- ⚡ **2. Use `select()` to pick only needed columns** * Don’t use `*` — it loads everything * Select only what you really need --- 🪄 **3. Filter early** (a.k.a. predicate pushdown) * Add `filter()` before joins or heavy logic * Reduces data size early → faster processing --- 📦 **4. Use `broadcast join` for small tables** * Sends small table to all nodes → avoids shuffle * Faster joins when one side is small --- 🔄 **5. Use `repartition()` or `coalesce()` smartly** * Use `repartition()` to **increase** partitions * Use `coalesce()` to **reduce** partitions without shuffle --- 🧠 **6. Avoid `collect()` unless really needed** * `collect()` pulls all data to driver → can crash if data is too big --- 🧮 **7. Use built-in Spark SQL functions** * Built-in functions (like `withColumn()`, `when()`, etc.) are faster than UDFs * Avoid using Python `lambda` in transformations --- 🧠 Easy tip to remember: > "**Filter early, join smart, cache wisely, and avoid collect.**"

Answer 40

✅ **A:** --- 🔍 **1. Check if `collect()` is used** * ❌ `collect()` pulls all data to driver — avoid it unless the dataset is very small * ✅ Use `show()`, `take()`, or save to storage instead --- 🧠 **2. Tune Spark memory settings** * Increase: * `spark.executor.memory` * `spark.driver.memory` * `spark.memory.fraction` * Example: ```bash --executor-memory 8g --driver-memory 4g ``` --- ⚡ **3. Reduce data size early** * Apply **`filter()`** and **`select()`** early in the pipeline * Avoid wide rows with many unused columns --- 🔄 **4. Optimize joins** * Use **`broadcast join`** if one table is small * Repartition by join key to reduce shuffle --- 📊 **5. Cache smartly** * Cache only when reused * Use `persist(StorageLevel.DISK_ONLY)` if memory is tight --- 🔄 **6. Monitor with Spark UI / logs** * Check which stage is failing * See executor memory usage * Look for skewed partitions or expensive shuffles --- 🧠 Quick tip to remember: > "**Filter early, tune memory, avoid collect, join smart.**" Let me know if you want a sample config or YAML tuning template!

Answer 41

✅ **A:** 🕒 Use **timestamps**, **last updated column**, or **change flags** like: ```WHERE updated_at > 'last_run_time' ``` Other techniques: * ✅ Use **CDC (Change Data Capture)** tools like Debezium * ✅ Use **watermarking** in ADF or Spark for streaming/batch

Answer 42

✅ **A:** 📊 **Data skew** means some partitions have **a lot more data** than others. ❌ This causes one executor to do most of the work → **slow jobs** or OOM. ✅ It usually happens with **joins** or **groupBy** on **unbalanced keys** (e.g., same customer ID repeated too much).

Answer 43

✅ **A:** 💥 The job will **fail with an Out of Memory (OOM)** or **disk spill error**. Spark **can’t cache, shuffle, or persist data** anymore. You may see: * `ExecutorLostFailure` * `ShuffleBlockFetcherIterator` errors ✅ Fix: free up resources, scale the cluster, or reduce data.

Answer 44

✅ **A:** 🧠 **Syntax:** ```sql WITH cte_name AS ( SELECT ... ) SELECT * FROM cte_name WHERE ... ``` --- 🎯 **Why use CTEs?** * ✅ Makes SQL **easier to read** * ♻️ Can **reuse logic** in the same query * 🔁 Supports **recursion** (e.g., tree structures) 🧠 One-liner to remember: * “CTE = mini table inside a query for clarity and reuse.” ✅ Let me know if you want recursive CTE examples too!

Answer 45

✅ **A:** **ACID** = 4 key rules that make sure database transactions are **safe and reliable**. --- 🔹 **A – Atomicity** > 👉 All steps in a transaction **must succeed**, or **none at all**. > Example: If you send money, it must leave your account **and** arrive at the other — not just one. 📌 **Think:** “All or nothing.” --- 🔹 **C – Consistency** > 👉 Data must follow the **rules of the database** before and after the transaction. > Example: You can't have negative balance if your system doesn’t allow it. 📌 **Think:** “Follow the rules.” --- 🔹 **I – Isolation** > 👉 Multiple users can run transactions at the same time, but they **won’t mess each other up**. > Example: If two people buy the same product, each order is handled **separately and safely**. 📌 **Think:** “No interference.” --- 🔹 **D – Durability** > 👉 Once a transaction is done, the data is **saved forever**, even if system crashes. > Example: You book a ticket → system crashes → your booking is still there. 📌 **Think:** “Saved for sure.” --- 🧠 Easy line to learn by heart: > "**ACID = All or nothing, Consistent, Isolated, Durable.**"

All in One Flashcards

(70 cards)