All in One Flashcards

(70 cards)

1
Q

❓ Question: Have you involved in any kind of PySpark job optimizations? ⚡

A
  • Broadcast Join: Used for joining huge daily transaction data with small metadata to avoid shuffling.
    • Partition By: Stored data partitioned by year and month (as querying by month was frequent) for both unprocessed and processed data.
    • Delta Tables: Used instead of normal Spark tables for efficient updates/merges and handling restatements (upsert operations).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

❓ Question: Suppose one day you logged in and you are seeing that you’re getting an out-of-memory issue for the Spark jobs… what kind of steps you will follow to troubleshoot this kind of error? 🧠

A
  • Answer (Ro):
    1. Just go to the Spark UI and find out what is going on here..
    2. Determine the root cause (e.g., larger than usual dataset,data skew, inefficient join strategy).
    3. Consider solutions:
      • Increase cluster size (if appropriate).
      • Optimize the code: Filter records earlier.
      • If data skew: Try to repartition.
      • If caching causes issues: Consider persist() instead of cache().
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

❓ Question: You talked about data skewness. If you find that for particular partitions you’re getting a huge amount of data suddenly, how will you mitigate this kind of problem? ⚖️

A
  • Answer (Ro):
    • Enable AQE (Adaptive Query Execution) in Spark 3.0+ as a first step, as it can detect and handle skewness at runtime.
    • If AQE isn’t enough, repartition the DataFrame.
    • Salting: As an advanced technique, aggregate data based on some condition to split a skewed partition using a hashing mechanism.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

❓ Question: If you have to handle joining two large tables (can’t broadcast either), how do you optimize? 🔗

A
  • Answer (Ro): Use bucketing. Bucket both tables on the join column. Choose the number of buckets properly. This will help because data will be pre-sorted and partitioned, making the join faster.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  • Question: Do you have any high-level understanding of Hadoop, its working, and components?
A
  • Answer: Hadoop is a framework for Big Data problems. Core components:
    • HDFS (Hadoop Distributed File System): For distributed storage. 📁
    • MapReduce: For distributed processing. 🔄
    • YARN (Yet Another Resource Negotiator): Provides resources for Hadoop jobs. 🛠️
    • Interviewer’s Input (Ganesh): Added that apart from core components, there’s an ecosystem with tools like Sqoop, Uzi, HBase, Hive, and Pig. 🧩
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  • Question: Does Spark offer storage?
A
  • Answer (Priti): No, it’s a plug-and-play compute engine. Can be used with storage like HDFS, Google Cloud Storage, Amazon S3. 🗂️
    • Interviewer’s Confirmation (Ganesh): Correct, it’s a plain compute engine usable with any resource manager and distributed storage. ✅
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  • Question: What are the levels/APIs we can work with in Apache Spark?
A
  • Answer (Priti): Two kinds of layers:
    • Spark Core APIs: Work with RDDs. 🧠
    • Higher-Level APIs: Deal with Spark DataFrames and Spark SQL. 📊
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  • Question: What’s the difference/advantage of DataFrames and Spark SQL over RDDs? Why are they recommended?
A

DataFrame / Spark SQL vs RDD – Why better?

1. Faster – Spark can optimize DataFrames using Catalyst engine. RDDs can’t.

> → DataFrames = smart, RDD = manual work.

2. Less code – DataFrames use SQL or simple functions.

> → Easier to write than low-level RDD logic.

3. Better for big data – Spark does auto tuning (like memory, joins) with DataFrames.

> → You don’t need to fine-tune everything.

4. Easy to use – DataFrames feel like tables. You can write SQL!

> → RDD is like coding everything from scratch.

🔁 Summary to remember:

> “DataFrames are faster, shorter, smarter, and feel like SQL.
RDDs are low-level, harder, slower, more code to write.”

Let me know if you want a one-line mnemonic to memorize it too!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
A

🧠 Key features:

  1. Resilient – Fault-tolerant. If data is lost, Spark can rebuild it.
  2. Distributed – Data is split across many machines.
  3. Immutable – Once created, it can’t change (only transformed).
  4. Lazy – Transformations are not run until an action is called.

Why important?

  • Base of all Spark APIs (DataFrame is built on top of RDD).
  • You can control everything (partition, caching, logic).
  • Useful when you need custom transformations or low-level control.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  • Question: What are transformations and actions in Spark? What happens when they are executed ?
A

🔄 Transformations

  • Create new RDD/DataFrame from existing one.
  • Lazy: Spark does not run them immediately.
  • Spark builds a DAG (like a plan).

🧠 Example: filter(), map(), select()

Actions

  • Start execution of the DAG.
  • Spark runs all needed transformations and returns result.

🧠 Example: collect(), count(), show()

💥 What happens?

  • Spark builds DAG from transformations → waits
  • When action is called → Spark optimizes + runs the DAG.

🧠 Easy summary to learn:

  • Transformations build the plan, Actions run the plan.
  • DAG is the plan. Spark waits until action to execute it.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  • Question: Briefly explain Spark architecture ?.
A

🏗️ Spark = Master–Slave Architecture

🧑‍💼 Driver (Master)

  • Starts the app (via SparkContext)
  • Builds the plan (DAG)
  • Talks to Cluster Manager to ask for resources
  • Sends tasks to Executors
  • Collects results

👷 Executors (Workers / Slaves)

  • Run the code
  • Store data (in memory or disk)
  • Send results back to Driver

🔁 Life of a Spark Job:

  • Driver → asks for Executors → sends tasks → Executors run → results → Driver ends.

🧠 Summary to memorize:

  • Driver plans, Executors work. Spark spreads the job across many machines.”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  • Question: What are the different plans you mentioned in Spark architecture (parsed logical plan, etc.)?
A

📊 Spark Query Plans (via Catalyst Optimizer)

  1. Parsed Logical Plan 🔤
    → Checks for syntax errors (like SQL grammar).
  2. Analyzed Logical Plan 🧠
    → Checks metadata: Are table/column names valid?
  3. Optimized Logical Plan ⚙️
    → Applies rules (e.g., push filters early, remove extra steps).
    → This is where Catalyst Optimizer runs.
  4. Physical Plan
    → Chooses how to run:
    • Use broadcast join or sort-merge join
    • Use hash aggregate or sort aggregate

🧠 Summary to memorize:

  • Parse → Analyze → Optimize → Execute
  • Think: What to runHow to run it best
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

🔍 What is Predicate Pushdown?

A
  • Spark pushes filters down to the earliest point (like reading from file or DB)
  • → So less data is loaded or processed.

🧠 Example:

```SELECT * FROM big_table WHERE country = ‘VN’
~~~

With pushdown → Spark asks only rows with country = 'VN' from storage.

✅ Why important?

  • Reduces I/O (reads less data)
  • Speeds up query
  • Works best with Parquet, ORC, JDBC, etc.

🧠 Easy to remember:

  • Push filter early → process less → run faster.” 🏎️
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Question: What are jobs, stages, and tasks in Spark UI?

A

🧠 Answer:

  • Job 🏃
    → Created when an action (like count(), show()) is triggered.
    One job = one action
  • Stage 🎬
    → A job is split into stages based on shuffle boundaries.
    #Stages = wide transformations + 1
  • Task 🧩
    → Smallest unit of work.
    #Tasks = #Partitions of the DataFrame/RDD

Interviewer’s Note (Ganesh):

> “Number of jobs = number of actions performed.”

🧠 Quick Summary:

  • Action → Job → Stages → Tasks
  • Like breaking a race into checkpoints and runners.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Question: What are the types of transformations in Spark?

A

🧠 Answer:

  • Narrow Transformations 🔄
    → Data stays on same partition, no shuffle.
    Fast & local.
    🧪 Examples: map(), filter(), flatMap()
  • Wide Transformations 🔀
    → Data moves between partitions (shuffle).
    → Used for grouping or joining.
    🧪 Examples: groupByKey(), reduceByKey(), join()

🧠 Easy to remember:

  • Narrow = no shuffle, Wide = shuffle happens.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Question: What’s the difference between repartition and coalesce?

A

🧠 Answer:

  • repartition() 🔄
    → Can increase or decrease partitions
    → Always reshuffles all data → more cost
    → Good when you want balanced partitions
  • coalesce() 🛠️
    → Used to decrease partitions only
    No full shuffle → just merges nearby partitions
    → Faster, but partitions may be uneven

🗣️ Interviewer’s Follow-up:

> “So, best for decreasing partitions is coalesce?”
Yes!

🧠 Easy to remember:

  • Repartition = full shuffle, Coalesce = merge smart & cheap
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Question: When creating a DataFrame from a data lake, should I enforce the schema or infer it (inferring is less code)? What’s recommended?

A

🧠 Answer:

  • Enforce schema explicitly (recommended)
    → More control, faster, and safe
  • ⚠️ Inferring schema
    → Spark scans full data to guess types
    → Can be slow and guess wrong types

🧠 Easy to remember:

> “Infer = easy but risky ❌,
Enforce = safe and fast ✅”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Question: What are the ways/methods for schema enforcement?

A

🧠 Answer:

There are 2 main ways to define a schema in Spark:

  1. Schema DDL (String Format) 📜
    → Easy and short.
    🧪 Example:
      schema = "name STRING, age INT, salary DOUBLE"
  2. StructType with StructField 🛠️
    → More control, used for complex or nested schemas.
    🧪 Example:``` from pyspark.sql.types import StructType, StructField, StringType, IntegerTypeschema = StructType([
    StructField(“name”, StringType(), True),
    StructField(“age”, IntegerType(), True)
    ])
    ```

🧠 Easy to remember:

  • DDL = short & simple,
  • StructType = detailed & powerful
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Question: What are Data Lake and Delta Lake? Major differences, advantages/disadvantages?

A

🧠 Answer:

🏞️ Data Lake

  • Big storage for structured, semi-structured, unstructured data
  • Cheap and scalable 💸
  • Stores data in raw formats (e.g., Parquet, CSV)

🔻 Disadvantages:

  • ❌ No ACID guarantees
  • ❌ Not good for direct analytics/reporting
  • ❌ Hard to track data changes or rollback

🔺 Delta Lake

  • Storage layer built on top of Data Lake
  • Adds features for data reliability and management
  • Uses Parquet + _delta_log (stores transaction history) 📑

Advantages:

  • ✔️ ACID transactions
  • ✔️ Time travel (see old versions) ⏳
  • ✔️ Schema enforcement & evolution
  • ✔️ Better for streaming + batch

🧠 Easy to remember:

> “Data Lake = raw & cheap,
Delta Lake = reliable & smart (ACID + time travel)” ⏳📦

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Data profiling ?

A

I haven’t directly used a data profiling tool yet, but I understand the concept now.
It’s about analyzing the quality and structure of data – like checking for missing values, wrong data types, or unexpected values.
For example, if a column called name has numbers in it, profiling helps catch that early. Tools like Great Expectations or Deequ can be used for this kind of automated check.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Question: What is Data Governance?

A

🧠 Answer:

Data governance is the set of rules and processes to manage, protect, and ensure the quality of data in an organization.

🔐 Main Goals:

  1. Data Quality – Make sure data is accurate, complete, and consistent
  2. 🔒 Data Security – Control who can access or change data
  3. 📜 Compliance – Follow laws (like GDPR, HIPAA…)
  4. 🧭 Data Lineage – Track where data comes from and how it’s changed
  5. 👥 Ownership & Roles – Know who owns the data, who can update it

🧰 Common Tools:

  • Unity Catalog (Databricks)
  • Apache Atlas, Collibra, Alation, Amundsen
  • Great Expectations (for data quality rules)

🧠 Easy to remember:

  • Right data, right people, right rules.” ✅🔒📊
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Question: What kind of file format is Parquet? How is it better than CSV or row-based formats?

A

Answer:

> Parquet is a columnar file format, unlike CSV which is row-based.

Advantages over CSV / row-based formats:

  1. 📊 Faster for analytics
    → When doing filtering or aggregation, Parquet only reads the needed columns, not full rows → more efficient.
  2. 📦 Better compression
    → Columnar format allows high compression, so file size is much smaller than CSV.
  3. Optimized for big data
    → Especially useful when dealing with wide tables (many columns), as it avoids scanning unused data.

🧠 Easy to remember:

> “Parquet = column-based → faster reads, smaller size, better for big data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Question: Do you know about predicate pushdown?

A

Answer:

  • Yes. Predicate pushdown means Spark applies filters as early as possible, directly while reading data from source (like Parquet, JDBC).
  • This helps reduce the amount of data loaded, so queries run faster.

🧠 Example:

```SELECT * FROM users WHERE age > 30
~~~

With pushdown → Only rows with age > 30 are read from disk.

🧠 One-liner:

  • Push filter early → read less → run faster” ✅📉
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

❓ Question: In Python, do you get errors at compile time or only after running the code?

A

In Python, we get errors at runtime, not at compile time, because Python is a dynamically typed and interpreted language.
So, if there’s a bug in a function, we’ll only see it when that function runs, not just by writing it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
❓ **Question:** Can you explain lazy evaluation in Spark?
✅ **Answer:** * Yes. **Lazy evaluation** means that in Spark, when we write transformations like `filter()`, `map()`, or even `groupBy()`, they **don’t run immediately**. * Spark **builds a DAG (Directed Acyclic Graph)** of all operations first. * The actual execution only starts **when we run an action**, like `collect()`, `count()`, or `show()`. This allows Spark to **optimize the entire pipeline** before running — for example, combining filters or deciding the best join strategy.
26
❓Question: If RDD3 is lost, how can we restore it?
“Spark uses lineage info to recompute lost RDDs — that’s how it achieves fault tolerance.”
27
❓Question: You have two large DataFrames that must be joined. Broadcast join isn't possible. How do you reduce shuffle and optimize the join?
✅ **Answer:** * Since both DataFrames are large and broadcast join won’t work, I would focus on **reducing shuffle** by ensuring the **data is partitioned properly** before the join. Here’s my approach: 1. **Repartition both DataFrames by the join key** using `repartition()` ``` df1 = df1.repartition("join_key") df2 = df2.repartition("join_key") ``` This ensures that records with the same key go to the same partition, reducing **cross-node shuffling**. 2. Optionally, I’d use **`sortMergeJoin`** strategy if the data is already **sorted** and big enough. Spark usually chooses this for large datasets. 3. I’d also **cache** one or both DataFrames if reused multiple times after the join. 4. If possible, I’d **filter** both DataFrames first to reduce data size **before the join**. --- 🧠 One-liner summary: * “Use `repartition()` on join keys to align partitions and reduce shuffle during large joins.”
28
❓**Question:** When are Spark partitions created, and how is the number of partitions decided?
✅ **Answer:** * When a **Spark job is submitted**, the data is split into **partitions** based on: * **Input file size** * **File format** * And the **default parallelism** or **block size** (like 128MB or 256MB depending on config) By default, Spark will: * Use **HDFS block size** (e.g., 128MB) to decide number of partitions * Or fall back to **spark.sql.shuffle.partitions** (default 200 for shuffle operations) If we don't specify partition logic (like with `.repartition()`), Spark uses **hash partitioning** internally.
29
❓**Question:** If I need to **decrease** number of partitions, should I use `coalesce()` or `repartition()`?
✅ **Answer:** * To reduce partitions, **`coalesce()`** is more efficient than `repartition()`. * ✅ **`coalesce(n)`** → **No full shuffle**, just merges nearby partitions → Best when reducing partitions (e.g., from 200 → 50) * ⚠️ **`repartition(n)`** → **Triggers full shuffle** → More expensive → Use when you need even distribution or when **increasing** partitions --- 🧠 One-liner to remember: * “**Use `coalesce()` to reduce**, `repartition()` to reshuffle or increase.”
30
❓**Question:** What architecture are you using?
✅ **Answer:** > We are using the **Medallion Architecture**: * 🟫 **Bronze Layer** – Raw data from source systems (CSV, Excel, APIs…) * 🪙 **Silver Layer** – Cleansed, validated, and transformed data * 🥇 **Gold Layer** – Aggregated data, KPIs, ready for analytics/reporting The **final Gold data** is stored in a **data warehouse**, and connected to tools like **Power BI** for reporting.
31
❓**Question:** Are you using surrogate keys and SCD? If yes, in which layers?
✅ **Answer:** * Yes, we use **surrogate keys** and **Slowly Changing Dimensions (SCD)**, but **not in the Bronze (raw) layer**. * They are applied in the **Silver or Gold layers**, where data is cleaned and structured. * 🆔 **Surrogate keys** help assign unique IDs, especially when source systems don’t have stable IDs. * 🔄 **SCD (usually Type 2)** is used to **track changes** over time — for example, in **advisor or contact details**, where we need to maintain historical records.
32
❓**Question:** Are partitioning and folder structure used in all layers?
✅ **Answer:** * Yes, we use **partitioning and folder structure** across all layers — but with different purposes: * 🟫 **Bronze Layer**: Partitioned by **ingestion date** or **source type** to manage raw files efficiently. * 🪙 **Silver Layer**: Partitioned by **business keys** like `region`, `department`, or `event_date` to optimize joins and queries. * 🥇 **Gold Layer**: Partitioning depends on reporting needs — usually by **time (e.g., year/month)** for faster aggregation. Folder structure follows a **clear hierarchy** (`/bronze/table_name/yyyy/mm/dd/`, etc.) to support **data discovery and lineage**.
33
❓ **Q: How do you handle different file formats (CSV, JSON, XML)?**
✅ **A:** All raw files are **converted to Parquet** in the **Bronze layer**, using **Snappy compression** for better **performance and storage**.
34
❓ **Q: Will partitioning also help query performance?**
✅ **A:** Yes. **Partitioning** by keys like **date** or **bank name** reduces **data scanned** during queries → improves speed for analysts.
35
❓ **Q: Do you convert all files to Parquet in the Bronze layer?**
✅ **A:** Yes. Regardless of input format, we **standardize everything to Parquet** before moving to **Silver/Gold**.
36
❓ **Q: Key considerations for ETL pipeline scalability & adaptability?**
✅ **A:** * Use **parameterized pipelines** * Maintain **control tables** for metadata (file type, path, client…) → Enables **dynamic ingestion** and reuse
37
❓ **Q: How do you transform complex/nested data into KPI-ready structure?**
✅ **A:** * Use **Databricks notebooks** * For nested JSON: use **recursive logic** or **`explode()`** in Spark * KPIs are based on **business definitions**
38
❓ **Q: How does your pipeline detect and handle file types dynamically?**
✅ **A:** * A **control table flag** tells ADF the file type * Based on type: use **direct copy** or send to **Databricks for transformation**
39
❓ **Q: How does your architecture adapt to new data types or relationships?**
✅ **A:** * Use **Delta Lake** for **schema evolution** + **time travel** * Apply **SCD with surrogate keys** to track changes * Monitor changes via **ADF logs + alerts**
40
❓ **Q: What is normalization (layman’s terms)?**
> It’s the process of **removing repetition** in data. > Example: Instead of repeating a department name for every employee, split into separate **employee** and **department** tables → link by ID.
41
❓ **Q: Which is better: 1NF or 4NF?**
✅ **A:** It depends on the use case: * **1NF or lower forms** are better for **analytical databases** (OLAP) → Fewer joins, better read performance * **4NF or higher forms** are better for **transactional systems** (OLTP) → Reduces redundancy, supports consistent updates/inserts
42
❓ **Q: Is Snowflake schema better for OLTP or reporting?**
✅ **A:** Snowflake schema is better for **OLTP** systems: * Supports **faster updates/inserts** * But not ideal for reporting — too many joins → **slower queries** For reporting, a **Star schema** or **OBT** is preferred.
43
❓ **Q: Have you used OBT (One Big Table)? What are the pros and cons?**
✅ **A:** Yes, I’ve used OBT for **RFM segmentation** and reporting use cases. * ✅ **Pros:** * No joins → simple for analysts * Fast read performance * ❌ **Cons:** * Hard to update * High redundancy But with **Parquet + cheap cloud storage**, redundancy is less of a concern today.
44
❓ **Q: What is the use case for Redshift?**
✅ **A:** Redshift is used for: * **Columnar storage** * **Fast SQL querying** * Hosting **final curated data** (from S3) for **reporting tools** like Power BI or Tableau.
45
❓ **Q: Example of optimizing a slow Spark job?**
✅ **A:** * Used **broadcast joins** for small dimension tables * Avoided `collect()` to prevent **OutOfMemory** → Reduced job time by \~**30%**
46
❓ **Q: What is Slowly Changing Dimension (SCD)?**
✅ **A:** SCD tracks changes in dimension data over time: * **Type 1:** Overwrites old data * **Type 2:** Stores **full history** (with version columns or flags) * **Type 3:** Keeps only **current + previous version**
47
❓ **Q: How do you handle data skew in PySpark?**
✅ **A:** * Use **broadcast join** (if possible) * Use **`repartition()` or `salting`** techniques * Enable **dynamic resource allocation** * Apply **hash partitioning** on skewed keys
48
❓ **Q: How to remove duplicate rows in PySpark?**
✅ **A:** * Use `dropDuplicates(["col1", "col2"])` to remove based on specific columns * Use `distinct()` to remove fully duplicate rows
49
❓ **Q: What is your Spark cluster configuration and how do you handle high data volume?**
✅ **Answer:** We process around **200–300 GB of data per day** using an **autoscaling Spark cluster**. **Cluster Setup:** * 🔹 **1 driver node** * 🔹 **Up to 8 worker nodes** * 🔹 Each node: **16 vCPU, 64 GB RAM** * 🔹 Total cluster memory: **512 GB** **Autoscaling:** * Enabled using **dynamic resource allocation** * Cluster **scales up** automatically for large data loads and **scales down** during idle times **Cost Optimization:** * 🟢 **1 driver + 1 worker** are **on-demand** * 💸 **Remaining 7 workers** are **spot instances** to reduce cost
50
**❓ Q: How to find the 5th highest salary in a SQL table?**
**✅ A:** Use `DENSE_RANK()` or `ROW_NUMBER()` in a subquery: ```SELECT salary FROM ( SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rnk FROM employees ) t WHERE rnk = 5; ```
51
**❓ Q: What does `DENSE_RANK()` do?**
**✅ A:** Ranks rows without skipping ranks on ties 📊 Example: Salaries 100, 100, 90 → Ranks: 1, 1, 2
52
**❓ Q: How to find duplicates in a SQL table?**
**✅ A:** Group by column(s), then filter: ```sql SELECT col, COUNT(*) FROM table GROUP BY col HAVING COUNT(*) > 1; ```
53
**❓ Q: What are `LEAD()` and `LAG()` in SQL?**
**✅ A:** They return the **next (`LEAD`)** or **previous (`LAG`)** row value Used for time-based comparisons and trend analysis 📈
54
**❓ Q: What is a lambda function in Python?**
**✅ A:** A small, **anonymous function** Syntax: ```lambda x: x * 2 ``` Used in `map()`, `filter()`, etc. — lightweight and inline ⚡
55
**❓ Q: Why use Delta file format?**
✅ **A:** * ✅ **ACID support** * ⚡ **Faster reads** (columnar) * ⏳ **Time travel** * 🔄 **Schema evolution** * 🛠️ **Recovery from bad writes**
56
**❓ Q: When and why do you use partitioning?**
Partition by columns used in filters (e.g., `order_date`) to enable **data pruning** → **faster queries** 📊
57
**❓ Q: When do you use bucketing over partitioning?**
✅ **A:** * 🪓 **Partitioning**: Use when column has **few distinct values** (e.g., country, status) * 🧩 **Bucketing**: Use for **high-cardinality** columns (e.g., `customer_id`) — data is **hashed into fixed buckets**
58
**❓ Q: In Spark SQL, how to run SQL on a DataFrame?**
✅ **A:** 1. 🏷️ `df.createOrReplaceTempView("my_table")` 2. 🧠 `spark.sql("SELECT * FROM my_table")`
59
**❓ Q: Difference between DataFrame and Dataset?**
✅ **A:** * **DataFrame**: 🧾 Schema-aware, ❌ No compile-time type safety (Python/Scala) * **Dataset**: ✅ Type-safe (Scala/Java only) * Both are built on top of **RDD**
60
**❓ Q: What is normalization in SQL?**
**❓ Q: What is normalization in SQL?** ✅ **A:** Normalization = Split big table into smaller related tables to: * 🧹 **Reduce redundancy** * 🧠 **Improve clarity** Example: Split `employees` and `departments` into separate tables
61
**❓ Q: What are window functions in SQL?**
✅ **A:** Window functions do **calculations row by row**, but can still **see other rows** around it. They **don’t remove rows** like `GROUP BY`. 📌 Common window functions: * `ROW_NUMBER()` → gives number to each row * `RANK()`, `DENSE_RANK()` → ranking * `LEAD()`, `LAG()` → look at next or previous row * `SUM() OVER(...)` → running total
62
**❓ Q: What’s the difference between `ROW_NUMBER()`, `RANK()`, and `DENSE_RANK()`?**
✅ **A:** * Use **`ROW_NUMBER()`** if every row needs a unique number * Use **`RANK()`** if ties are okay, but you want **gaps** * Use **`DENSE_RANK()`** if ties are okay, and you want **no gaps**
63
**❓ Q: What are some common Spark optimization techniques?**
✅ **A:** --- ⚙️ **1. Use `cache()` or `persist()`** * Saves reused DataFrames in **memory** * Helps avoid recomputing again and again --- ⚡ **2. Use `select()` to pick only needed columns** * Don’t use `*` — it loads everything * Select only what you really need --- 🪄 **3. Filter early** (a.k.a. predicate pushdown) * Add `filter()` before joins or heavy logic * Reduces data size early → faster processing --- 📦 **4. Use `broadcast join` for small tables** * Sends small table to all nodes → avoids shuffle * Faster joins when one side is small --- 🔄 **5. Use `repartition()` or `coalesce()` smartly** * Use `repartition()` to **increase** partitions * Use `coalesce()` to **reduce** partitions without shuffle --- 🧠 **6. Avoid `collect()` unless really needed** * `collect()` pulls all data to driver → can crash if data is too big --- 🧮 **7. Use built-in Spark SQL functions** * Built-in functions (like `withColumn()`, `when()`, etc.) are faster than UDFs * Avoid using Python `lambda` in transformations --- 🧠 Easy tip to remember: > "**Filter early, join smart, cache wisely, and avoid collect.**"
64
**❓ Q: How to troubleshoot Out of Memory (OOM) errors in Spark production?**
✅ **A:** --- 🔍 **1. Check if `collect()` is used** * ❌ `collect()` pulls all data to driver — avoid it unless the dataset is very small * ✅ Use `show()`, `take()`, or save to storage instead --- 🧠 **2. Tune Spark memory settings** * Increase: * `spark.executor.memory` * `spark.driver.memory` * `spark.memory.fraction` * Example: ```bash --executor-memory 8g --driver-memory 4g ``` --- ⚡ **3. Reduce data size early** * Apply **`filter()`** and **`select()`** early in the pipeline * Avoid wide rows with many unused columns --- 🔄 **4. Optimize joins** * Use **`broadcast join`** if one table is small * Repartition by join key to reduce shuffle --- 📊 **5. Cache smartly** * Cache only when reused * Use `persist(StorageLevel.DISK_ONLY)` if memory is tight --- 🔄 **6. Monitor with Spark UI / logs** * Check which stage is failing * See executor memory usage * Look for skewed partitions or expensive shuffles --- 🧠 Quick tip to remember: > "**Filter early, tune memory, avoid collect, join smart.**" Let me know if you want a sample config or YAML tuning template!
65
**❓ Q: How do you capture incremental data in OLTP systems?**
✅ **A:** 🕒 Use **timestamps**, **last updated column**, or **change flags** like: ```WHERE updated_at > 'last_run_time' ``` Other techniques: * ✅ Use **CDC (Change Data Capture)** tools like Debezium * ✅ Use **watermarking** in ADF or Spark for streaming/batch
66
**❓ Q: What is data skewness in Spark?**
✅ **A:** 📊 **Data skew** means some partitions have **a lot more data** than others. ❌ This causes one executor to do most of the work → **slow jobs** or OOM. ✅ It usually happens with **joins** or **groupBy** on **unbalanced keys** (e.g., same customer ID repeated too much).
67
**❓ Q: If memory and disk are full in Spark, what happens?**
✅ **A:** 💥 The job will **fail with an Out of Memory (OOM)** or **disk spill error**. Spark **can’t cache, shuffle, or persist data** anymore. You may see: * `ExecutorLostFailure` * `ShuffleBlockFetcherIterator` errors ✅ Fix: free up resources, scale the cluster, or reduce data.
68
**❓ Q: What is a Common Table Expression (CTE)?**
✅ **A:** 🧠 **Syntax:** ```sql WITH cte_name AS ( SELECT ... ) SELECT * FROM cte_name WHERE ... ``` --- 🎯 **Why use CTEs?** * ✅ Makes SQL **easier to read** * ♻️ Can **reuse logic** in the same query * 🔁 Supports **recursion** (e.g., tree structures) 🧠 One-liner to remember: * “CTE = mini table inside a query for clarity and reuse.” ✅ Let me know if you want recursive CTE examples too!
69
**❓ Q: What does ACID mean in databases? What are its types?**
✅ **A:** **ACID** = 4 key rules that make sure database transactions are **safe and reliable**. --- 🔹 **A – Atomicity** > 👉 All steps in a transaction **must succeed**, or **none at all**. > Example: If you send money, it must leave your account **and** arrive at the other — not just one. 📌 **Think:** “All or nothing.” --- 🔹 **C – Consistency** > 👉 Data must follow the **rules of the database** before and after the transaction. > Example: You can't have negative balance if your system doesn’t allow it. 📌 **Think:** “Follow the rules.” --- 🔹 **I – Isolation** > 👉 Multiple users can run transactions at the same time, but they **won’t mess each other up**. > Example: If two people buy the same product, each order is handled **separately and safely**. 📌 **Think:** “No interference.” --- 🔹 **D – Durability** > 👉 Once a transaction is done, the data is **saved forever**, even if system crashes. > Example: You book a ticket → system crashes → your booking is still there. 📌 **Think:** “Saved for sure.” --- 🧠 Easy line to learn by heart: > "**ACID = All or nothing, Consistent, Isolated, Durable.**"
70