RDDs (Resilient Distributed Datasets) dataframe built ontop of rdd low level, no schema no query optimization

vaa Flashcards by hanna mcadam

How do you optimize performance in a Databricks pipeline?

Delta tables are optimized for spark (acid transactions, schema enforcement)
Partitioning on common filter columns (like date)
Z-order Indexing – improves read performance by clustering data files based on access patterns.
Spark UI to find memory issues or skewed tasks
Job clusters - that spin up and down for duration of job
Wide transformations (groupBy, distinct, join) can cause excessive shuffles

How well did you know this?

Not at all

Perfectly

How do you approach data modeling for a large-scale warehouse (e.g., star vs. snowflake schemas)?

Hybrid approach
Star schema for speed on the analytics layer
Snowflake-like normalization for master data and staging layers
SCD type 2 where needed

How well did you know this?

Not at all

Perfectly

How do you handle slowly changing dimensions in your data warehouse design?

Most enterprise-scale warehouses are type 2.
Use merge statements in sql server and databricks
Start and end date, version or current_flag, created by, updated at

How well did you know this?

Not at all

Perfectly

How do you ensure your data solutions stay aligned with evolving enterprise architecture?

I focus on solving the current business need first without over-engineering
But I make sure my solution follows enterprise standards
My work aligns with current architecture while staying modular and easy to evolve if things shift

How well did you know this?

Not at all

Perfectly

How have you contributed to a technology roadmap in the past?

YES – Data strategy (current state, future state, business requirements, stakeholder assessments)
Timelines, smart goals, resource requirements

How well did you know this?

Not at all

Perfectly

What are some indicators that a data architecture needs rethinking?

When data volume grows faster than expected, performance degrades, manual workarounds increase, or tech debt blocks new features.

How well did you know this?

Not at all

Perfectly

How do you gather data requirements from both technical and non-technical stakeholders?

Stakeholders to understand business goals behind the data need
For non tech stakeholders, focus on what questions they want answered - keep it outcome focused
For tech stakeholders, data sources, schemas, and technical constraints
Clarify data freshness, granularity, and access requirements (how often data updated, level of detail, who needs access)
If anything is unclear, I prototype queries or mock reports to help visualize the data flow

How well did you know this?

Not at all

Perfectly

How do you approach requirement gathering for a system you’ve never worked on before?

High level questions to understand purpose of system and business outcomes
Key stakeholder sessions – find pain points and current workflows, whats working and whats not
Review available documentation
Map out the data flow and draft assumptions and gaps

How well did you know this?

Not at all

Perfectly

What types of documentation do you typically produce for a data warehouse project?

Data models
Source to target mapping
Etl pipeline documentation
Data dictionary
Access and permission guide
Operational runbooks – how to troubleshoot, restart jobs, respond to failures in pipeline

How well did you know this?

Not at all

Perfectly

How do you ensure stakeholder expectations are aligned with what’s technically feasible?

Actively listening
Must haves vs nice to haves
Technical constraints are known early
Suggest alternatives for unfeasible option but same goal
Trade offs are known (real time updates trade off would be cost and complexity)

How well did you know this?

Not at all

Perfectly

Describe your process for managing ongoing support and maintenance of a data system.

Hypercare period with fast feedback loop (quick triage of issues, communication to stakeholders)
Post hypercare transition to support model, knowledge transfer to ops or support teams

How well did you know this?

Not at all

Perfectly

Have you worked in agile environments? How do you manage data engineering tasks in sprint cycles?

Yes, we followed sprints daily standups and backlog grooming
Break down engineering work into small testable tasks that can fit into a sprint
4h chunks for subtasks in a sprint
For larger features like etl pipeline the work gets split into phases like schema design, source ingestion etc.

How well did you know this?

Not at all

Perfectly

Your Databricks job requires frequent joins between a large fact table and several dimension tables. How would you optimize the join operations to improve performance?

Broadcast joins for smaller dim tables
Partition on join key
Cahce dim tables in memory to reduct repeated i/o operations
Bucket the tables on join key to reduce shuffle overhead

How well did you know this?

Not at all

Perfectly

You need to create a Databricks job that reads data from multiple sources (e.g., ADLS, Azure SQL Database, and Cosmos DB), processes it, and stores the results in a unified format. Describe your approach.

Spark connectors to read data
Standardize schema across different data sources
Apply transformations, aggregations and joins to integrate
Write to unified storage format, sql server, delta lake
Schedulet he job using databricks jobs or azure data factory for regular execution

How well did you know this?

Not at all

Perfectly

Your organization needs to implement a data quality framework in Azure Databricks to ensure the accuracy and consistency of the data. What approach would you take?

Validation rules to check for data consistency completeness and accuracy
Spark transformations to clean data
Monitor to track quality and anomalies
Generate reports

How well did you know this?

Not at all

Perfectly

Scenario: You are experiencing intermittent network issues causing your Databricks job to fail. How would you ensure that the job completes successfully despite these issues?

Retry logic
Checkpointing

How well did you know this?

Not at all

Perfectly

You need to integrate Azure Databricks with Azure DevOps for continuous integration and continuous deployment (CI/CD) of your data pipelines. What steps would you follow?

Version control
Ci pipeline to auto test and validate changes to notebooks
Cd pipeline to deply validated notebooks to the databricks workspace
Databricks CLI for integration with azure devops

How well did you know this?

Not at all

Perfectly

Databricks management plane?

Set of tools and services used to manage and control the DB environment, it includes the db workspace

How well did you know this?

Not at all

Perfectly

Difference between instance and cluster?

Instance – single virtual machine used to run app or service
Cluster – set of instances that work together to provide a higher level performance

How well did you know this?

Not at all

Perfectly

Serverless data processing?

Process data without needing to worry about underlying infrastructure
Databricks manages the infastructure and allocates resources as needed

How well did you know this?

Not at all

Perfectly

What are the main components of Databricks?

Workspace (organize projects)
Clusters (executing code)
Notebooks (interactive development)
Jobs (scheduling automated workflows)

How well did you know this?

Not at all

Perfectly

What is apache spark and how does it integrate with databricks

Spark – open source distributed computing system
Databricks provides managed spark environment that simplifies cluster management

How well did you know this?

Not at all

Perfectly

RDDs in spark

Study These Flashcards

RDDs (Resilient Distributed Datasets)
dataframe built ontop of rdd
low level, no schema
no query optimization

Dataframes in spark

Study These Flashcards

Dataframes distributed collections of data organized into named columns
this is what pyspark uses

datasets in spark

* Datasets are typed, distributed collections of data that provide benefits of RDD with convinenience of data frames * java scala only

Data transformations in spark

* Operations: map, filter, reduce, groupby, join

Lazy evaluation in spark

* Spark does not immediately excecute transformations on rdd,dataframe,dataset * instead it builds logical plan of transformations and only executes when an action is called

What is catalyst optimizer in spark

* Is a query optimization framework in spark sql that automatically optimizes the logical and physical execution plans to improve query performance

How do you manage Spark applications on Databricks clusters?

* Configure clusters, databricks job scheduler

How do you create and manage notebooks in databricks

* Create directly in workspace * Key features include cell execution, rich visualizations, version control, multiple langauges

What are Delta Lakes, and why are they important?

* Open source storage layer that brings ACID transactions to spark.

How do you handle data partitioning in Spark?

* Using repartition or coalesce to adjust number of partitions * understand data and frequent filters before partitioning

What is the difference between wide and narrow transformations in Spark?

* Narrow transformations (like map and filter) involve data shuffling within a single partition, * while wide transformations (like groupByKey and join) involve data shuffling across multiple partitions, which can be more resource-intensive.

What is the role of Databricks Runtime?

* Databricks Runtime is a set of core components that run on Databricks clusters, including optimized versions of Apache Spark, libraries, and integrations. It improves performance and compatibility with Databricks features.

How do you secure data and manage permissions in Databricks?

* role-based access control (RBAC), * secure cluster configurations - Azure Active Directory.

How to use databricks to process real time data

* Spark streaming, sources like kafka and event hubs

method for stakeholder meetings

MoSCoW method: (or weighted scoring) - Must have - Should have - Could have Won't have

Managing unrealistic expectations stakeholders

- show empathy by acknowledging the stakeholders desires and concerns before explaining why expectations are unrealistic Address asap, don’t let it build up

SMART

Specific, Measurable, Attainable, Relevant, Time-Bound

what to ask stakeholders

Ask them questionaire: 1. What's the actual goal here 2. Who are the primary users? What's the most annoying thing about x? W6H pattern * What * Where * When * Who * Why * Which - prompts consideration of alternative solutions or options to meet stakeholder needs. * How

2 planes in databricks azure

control plane compute plane

control plane

includes the backend services that Azure Databricks manages in your Azure Databricks account. The web application is in the control plane.

compute plane

is where your data is processed. There are two types of compute planes serverless compute classic Azure Databricks compute

databricks architecture execution jobs

jobs - triggered when you perform an action (collect,write,show) top level unit of execution

databricks architecture clusters

clusters - groups of compute resources, that hold nodes

databricks architecture node

node - individual compute resources in a cluster have a spark driver (master) node and worker nodes

databricks architecture executor

executor - each worker node runs one executor

databricks architecture driver

splits the work into jobs and further dividing into stages and then tasks assigning tasks to executors (worker nodes)

databricks architecture stage

chunk of a job that can be executed without needing to shuffle the data

databricks architecture task

tasks - single unit of work within a job, tasks are run on clusters smallest unit of execution in spark performed on one partition of the data

databricks architecture orchistration job

automates execution of notebook * A job in Databricks is the automation unit that lets you run your data pipeline or script on a schedule, with parameters, on a cluster, and with logging, retries, and alerts.

What is delta table in Databricks?

delta format A Delta table is a storage layer built on top of Apache Parquet that brings ACID transactions, versioning, and schema enforcement to data lakes in Databricks using Delta Lake. version controlled file system backed on disk

when to use dataframe

loading data temp transforming data for a pipeline when you dont need to save the result long term

when to use delta table

you want to persist the output of your transformations you can enforce rules automatically (schema enforcement) time travel - end users can do PIT easily

parquet

fine for raw data but no governance or rollback

how would you store your staging files in databricks?

parquet over csv or json parquet is columnar - only reads columns you query csv & json - row based, spark has to read it all parquet is highly compressed parquet has schema support you can easily convert parquet to delta later

vaa Flashcards

(56 cards)