All in One Flashcards
(70 cards)
❓ Question: Have you involved in any kind of PySpark job optimizations? ⚡
-
Broadcast Join: Used for joining huge daily transaction data with small metadata to avoid shuffling.
- Partition By: Stored data partitioned by year and month (as querying by month was frequent) for both unprocessed and processed data.
- Delta Tables: Used instead of normal Spark tables for efficient updates/merges and handling restatements (upsert operations).
❓ Question: Suppose one day you logged in and you are seeing that you’re getting an out-of-memory issue for the Spark jobs… what kind of steps you will follow to troubleshoot this kind of error? 🧠
-
Answer (Ro):
- Just go to the Spark UI and find out what is going on here..
- Determine the root cause (e.g., larger than usual dataset,data skew, inefficient join strategy).
- Consider solutions:
- Increase cluster size (if appropriate).
- Optimize the code: Filter records earlier.
- If data skew: Try to repartition.
- If caching causes issues: Consider
persist()
instead ofcache()
.
❓ Question: You talked about data skewness. If you find that for particular partitions you’re getting a huge amount of data suddenly, how will you mitigate this kind of problem? ⚖️
-
Answer (Ro):
- Enable AQE (Adaptive Query Execution) in Spark 3.0+ as a first step, as it can detect and handle skewness at runtime.
- If AQE isn’t enough, repartition the DataFrame.
- Salting: As an advanced technique, aggregate data based on some condition to split a skewed partition using a hashing mechanism.
❓ Question: If you have to handle joining two large tables (can’t broadcast either), how do you optimize? 🔗
- Answer (Ro): Use bucketing. Bucket both tables on the join column. Choose the number of buckets properly. This will help because data will be pre-sorted and partitioned, making the join faster.
- Question: Do you have any high-level understanding of Hadoop, its working, and components?
-
Answer: Hadoop is a framework for Big Data problems. Core components:
- HDFS (Hadoop Distributed File System): For distributed storage. 📁
- MapReduce: For distributed processing. 🔄
- YARN (Yet Another Resource Negotiator): Provides resources for Hadoop jobs. 🛠️
- Interviewer’s Input (Ganesh): Added that apart from core components, there’s an ecosystem with tools like Sqoop, Uzi, HBase, Hive, and Pig. 🧩
- Question: Does Spark offer storage?
-
Answer (Priti): No, it’s a plug-and-play compute engine. Can be used with storage like HDFS, Google Cloud Storage, Amazon S3. 🗂️
- Interviewer’s Confirmation (Ganesh): Correct, it’s a plain compute engine usable with any resource manager and distributed storage. ✅
- Question: What are the levels/APIs we can work with in Apache Spark?
-
Answer (Priti): Two kinds of layers:
- Spark Core APIs: Work with RDDs. 🧠
- Higher-Level APIs: Deal with Spark DataFrames and Spark SQL. 📊
- Question: What’s the difference/advantage of DataFrames and Spark SQL over RDDs? Why are they recommended?
✅ DataFrame / Spark SQL vs RDD – Why better?
1. Faster – Spark can optimize DataFrames using Catalyst engine. RDDs can’t.
> → DataFrames = smart, RDD = manual work.
2. Less code – DataFrames use SQL or simple functions.
> → Easier to write than low-level RDD logic.
3. Better for big data – Spark does auto tuning (like memory, joins) with DataFrames.
> → You don’t need to fine-tune everything.
4. Easy to use – DataFrames feel like tables. You can write SQL!
> → RDD is like coding everything from scratch.
🔁 Summary to remember:
> “DataFrames are faster, shorter, smarter, and feel like SQL.
RDDs are low-level, harder, slower, more code to write.”
Let me know if you want a one-line mnemonic to memorize it too!
🧠 Key features:
- Resilient – Fault-tolerant. If data is lost, Spark can rebuild it.
- Distributed – Data is split across many machines.
- Immutable – Once created, it can’t change (only transformed).
- Lazy – Transformations are not run until an action is called.
⭐ Why important?
- Base of all Spark APIs (DataFrame is built on top of RDD).
- You can control everything (partition, caching, logic).
- Useful when you need custom transformations or low-level control.
- Question: What are transformations and actions in Spark? What happens when they are executed ?
🔄 Transformations
- Create new RDD/DataFrame from existing one.
- Lazy: Spark does not run them immediately.
- Spark builds a DAG (like a plan).
🧠 Example: filter()
, map()
, select()
⚡ Actions
- Start execution of the DAG.
- Spark runs all needed transformations and returns result.
🧠 Example: collect()
, count()
, show()
💥 What happens?
- Spark builds DAG from transformations → waits
- When action is called → Spark optimizes + runs the DAG.
🧠 Easy summary to learn:
- “Transformations build the plan, Actions run the plan.”
- DAG is the plan. Spark waits until action to execute it.
- Question: Briefly explain Spark architecture ?.
🏗️ Spark = Master–Slave Architecture
🧑💼 Driver (Master)
- Starts the app (via
SparkContext
) - Builds the plan (DAG)
- Talks to Cluster Manager to ask for resources
- Sends tasks to Executors
- Collects results
👷 Executors (Workers / Slaves)
- Run the code
- Store data (in memory or disk)
- Send results back to Driver
🔁 Life of a Spark Job:
- Driver → asks for Executors → sends tasks → Executors run → results → Driver ends.
🧠 Summary to memorize:
- “Driver plans, Executors work. Spark spreads the job across many machines.”
- Question: What are the different plans you mentioned in Spark architecture (parsed logical plan, etc.)?
📊 Spark Query Plans (via Catalyst Optimizer)
-
Parsed Logical Plan 🔤
→ Checks for syntax errors (like SQL grammar). -
Analyzed Logical Plan 🧠
→ Checks metadata: Are table/column names valid? -
Optimized Logical Plan ⚙️
→ Applies rules (e.g., push filters early, remove extra steps).
→ This is where Catalyst Optimizer runs. -
Physical Plan ⚡
→ Chooses how to run:- Use broadcast join or sort-merge join
- Use hash aggregate or sort aggregate
🧠 Summary to memorize:
- “Parse → Analyze → Optimize → Execute”
- Think: What to run → How to run it best
🔍 What is Predicate Pushdown?
- Spark pushes filters down to the earliest point (like reading from file or DB)
- → So less data is loaded or processed.
🧠 Example:
```SELECT * FROM big_table WHERE country = ‘VN’
~~~
With pushdown → Spark asks only rows with country = 'VN'
from storage.
✅ Why important?
- Reduces I/O (reads less data)
- Speeds up query
- Works best with Parquet, ORC, JDBC, etc.
🧠 Easy to remember:
- “Push filter early → process less → run faster.” 🏎️
❓ Question: What are jobs, stages, and tasks in Spark UI?
🧠 Answer:
-
Job 🏃
→ Created when an action (likecount()
,show()
) is triggered.
→ One job = one action -
Stage 🎬
→ A job is split into stages based on shuffle boundaries.
→ #Stages = wide transformations + 1 -
Task 🧩
→ Smallest unit of work.
→ #Tasks = #Partitions of the DataFrame/RDD
✅ Interviewer’s Note (Ganesh):
> “Number of jobs = number of actions performed.”
🧠 Quick Summary:
- “Action → Job → Stages → Tasks”
- Like breaking a race into checkpoints and runners.
❓ Question: What are the types of transformations in Spark?
🧠 Answer:
-
Narrow Transformations 🔄
→ Data stays on same partition, no shuffle.
→ Fast & local.
🧪 Examples:map()
,filter()
,flatMap()
-
Wide Transformations 🔀
→ Data moves between partitions (shuffle).
→ Used for grouping or joining.
🧪 Examples:groupByKey()
,reduceByKey()
,join()
🧠 Easy to remember:
- “Narrow = no shuffle, Wide = shuffle happens.”
❓ Question: What’s the difference between repartition
and coalesce
?
🧠 Answer:
-
repartition()
🔄
→ Can increase or decrease partitions
→ Always reshuffles all data → more cost
→ Good when you want balanced partitions -
coalesce()
🛠️
→ Used to decrease partitions only
→ No full shuffle → just merges nearby partitions
→ Faster, but partitions may be uneven
🗣️ Interviewer’s Follow-up:
> “So, best for decreasing partitions is coalesce
?”
✅ Yes!
🧠 Easy to remember:
- “Repartition = full shuffle, Coalesce = merge smart & cheap”
❓ Question: When creating a DataFrame from a data lake, should I enforce the schema or infer it (inferring is less code)? What’s recommended?
🧠 Answer:
- ✅ Enforce schema explicitly (recommended)
→ More control, faster, and safe - ⚠️ Inferring schema
→ Spark scans full data to guess types
→ Can be slow and guess wrong types
🧠 Easy to remember:
> “Infer = easy but risky ❌,
Enforce = safe and fast ✅”
❓ Question: What are the ways/methods for schema enforcement?
🧠 Answer:
There are 2 main ways to define a schema in Spark:
-
Schema DDL (String Format) 📜
→ Easy and short.
🧪 Example:schema = "name STRING, age INT, salary DOUBLE"
-
StructType
withStructField
🛠️
→ More control, used for complex or nested schemas.
🧪 Example:``` from pyspark.sql.types import StructType, StructField, StringType, IntegerTypeschema = StructType([
StructField(“name”, StringType(), True),
StructField(“age”, IntegerType(), True)
])
```
🧠 Easy to remember:
- “DDL = short & simple,
- StructType = detailed & powerful”
❓ Question: What are Data Lake and Delta Lake? Major differences, advantages/disadvantages?
🧠 Answer:
🏞️ Data Lake
- Big storage for structured, semi-structured, unstructured data
- Cheap and scalable 💸
- Stores data in raw formats (e.g., Parquet, CSV)
🔻 Disadvantages:
- ❌ No ACID guarantees
- ❌ Not good for direct analytics/reporting
- ❌ Hard to track data changes or rollback
🔺 Delta Lake
- Storage layer built on top of Data Lake
- Adds features for data reliability and management
- Uses Parquet + _delta_log (stores transaction history) 📑
✅ Advantages:
- ✔️ ACID transactions
- ✔️ Time travel (see old versions) ⏳
- ✔️ Schema enforcement & evolution
- ✔️ Better for streaming + batch
🧠 Easy to remember:
> “Data Lake = raw & cheap,
Delta Lake = reliable & smart (ACID + time travel)” ⏳📦
Data profiling ?
I haven’t directly used a data profiling tool yet, but I understand the concept now.
It’s about analyzing the quality and structure of data – like checking for missing values, wrong data types, or unexpected values.
For example, if a column called name has numbers in it, profiling helps catch that early. Tools like Great Expectations or Deequ can be used for this kind of automated check.
❓ Question: What is Data Governance?
🧠 Answer:
Data governance is the set of rules and processes to manage, protect, and ensure the quality of data in an organization.
🔐 Main Goals:
- ✅ Data Quality – Make sure data is accurate, complete, and consistent
- 🔒 Data Security – Control who can access or change data
- 📜 Compliance – Follow laws (like GDPR, HIPAA…)
- 🧭 Data Lineage – Track where data comes from and how it’s changed
- 👥 Ownership & Roles – Know who owns the data, who can update it
🧰 Common Tools:
- Unity Catalog (Databricks)
- Apache Atlas, Collibra, Alation, Amundsen
- Great Expectations (for data quality rules)
🧠 Easy to remember:
- “Right data, right people, right rules.” ✅🔒📊
❓ Question: What kind of file format is Parquet? How is it better than CSV or row-based formats?
✅ Answer:
> Parquet is a columnar file format, unlike CSV which is row-based.
Advantages over CSV / row-based formats:
- 📊 Faster for analytics
→ When doing filtering or aggregation, Parquet only reads the needed columns, not full rows → more efficient. - 📦 Better compression
→ Columnar format allows high compression, so file size is much smaller than CSV. - ⚡ Optimized for big data
→ Especially useful when dealing with wide tables (many columns), as it avoids scanning unused data.
🧠 Easy to remember:
> “Parquet = column-based → faster reads, smaller size, better for big data”
❓ Question: Do you know about predicate pushdown?
✅ Answer:
- Yes. Predicate pushdown means Spark applies filters as early as possible, directly while reading data from source (like Parquet, JDBC).
- This helps reduce the amount of data loaded, so queries run faster.
🧠 Example:
```SELECT * FROM users WHERE age > 30
~~~
With pushdown → Only rows with age > 30
are read from disk.
🧠 One-liner:
- “Push filter early → read less → run faster” ✅📉
❓ Question: In Python, do you get errors at compile time or only after running the code?
In Python, we get errors at runtime, not at compile time, because Python is a dynamically typed and interpreted language.
So, if there’s a bug in a function, we’ll only see it when that function runs, not just by writing it.