vaa Flashcards

(56 cards)

1
Q

How do you optimize performance in a Databricks pipeline?

A
  • Delta tables are optimized for spark (acid transactions, schema enforcement)
  • Partitioning on common filter columns (like date)
  • Z-order Indexing – improves read performance by clustering data files based on access patterns.
  • Spark UI to find memory issues or skewed tasks
  • Job clusters - that spin up and down for duration of job
  • Wide transformations (groupBy, distinct, join) can cause excessive shuffles
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you approach data modeling for a large-scale warehouse (e.g., star vs. snowflake schemas)?

A
  • Hybrid approach
  • Star schema for speed on the analytics layer
  • Snowflake-like normalization for master data and staging layers
  • SCD type 2 where needed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you handle slowly changing dimensions in your data warehouse design?

A
  • Most enterprise-scale warehouses are type 2.
  • Use merge statements in sql server and databricks
  • Start and end date, version or current_flag, created by, updated at
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you ensure your data solutions stay aligned with evolving enterprise architecture?

A
  • I focus on solving the current business need first without over-engineering
  • But I make sure my solution follows enterprise standards
  • My work aligns with current architecture while staying modular and easy to evolve if things shift
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How have you contributed to a technology roadmap in the past?

A
  • YES – Data strategy (current state, future state, business requirements, stakeholder assessments)
  • Timelines, smart goals, resource requirements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some indicators that a data architecture needs rethinking?

A
  • When data volume grows faster than expected, performance degrades, manual workarounds increase, or tech debt blocks new features.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you gather data requirements from both technical and non-technical stakeholders?

A
  • Stakeholders to understand business goals behind the data need
  • For non tech stakeholders, focus on what questions they want answered - keep it outcome focused
  • For tech stakeholders, data sources, schemas, and technical constraints
  • Clarify data freshness, granularity, and access requirements (how often data updated, level of detail, who needs access)
  • If anything is unclear, I prototype queries or mock reports to help visualize the data flow
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you approach requirement gathering for a system you’ve never worked on before?

A
  • High level questions to understand purpose of system and business outcomes
  • Key stakeholder sessions – find pain points and current workflows, whats working and whats not
  • Review available documentation
  • Map out the data flow and draft assumptions and gaps
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What types of documentation do you typically produce for a data warehouse project?

A
  • Data models
  • Source to target mapping
  • Etl pipeline documentation
  • Data dictionary
  • Access and permission guide
  • Operational runbooks – how to troubleshoot, restart jobs, respond to failures in pipeline
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you ensure stakeholder expectations are aligned with what’s technically feasible?

A
  • Actively listening
  • Must haves vs nice to haves
  • Technical constraints are known early
  • Suggest alternatives for unfeasible option but same goal
  • Trade offs are known (real time updates trade off would be cost and complexity)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe your process for managing ongoing support and maintenance of a data system.

A
  • Hypercare period with fast feedback loop (quick triage of issues, communication to stakeholders)
  • Post hypercare transition to support model, knowledge transfer to ops or support teams
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Have you worked in agile environments? How do you manage data engineering tasks in sprint cycles?

A
  • Yes, we followed sprints daily standups and backlog grooming
  • Break down engineering work into small testable tasks that can fit into a sprint
  • 4h chunks for subtasks in a sprint
  • For larger features like etl pipeline the work gets split into phases like schema design, source ingestion etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Your Databricks job requires frequent joins between a large fact table and several dimension tables. How would you optimize the join operations to improve performance?

A
  • Broadcast joins for smaller dim tables
  • Partition on join key
  • Cahce dim tables in memory to reduct repeated i/o operations
  • Bucket the tables on join key to reduce shuffle overhead
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

You need to create a Databricks job that reads data from multiple sources (e.g., ADLS, Azure SQL Database, and Cosmos DB), processes it, and stores the results in a unified format. Describe your approach.

A
  • Spark connectors to read data
  • Standardize schema across different data sources
  • Apply transformations, aggregations and joins to integrate
  • Write to unified storage format, sql server, delta lake
  • Schedulet he job using databricks jobs or azure data factory for regular execution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Your organization needs to implement a data quality framework in Azure Databricks to ensure the accuracy and consistency of the data. What approach would you take?

A
  • Validation rules to check for data consistency completeness and accuracy
  • Spark transformations to clean data
  • Monitor to track quality and anomalies
  • Generate reports
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Scenario: You are experiencing intermittent network issues causing your Databricks job to fail. How would you ensure that the job completes successfully despite these issues?

A
  • Retry logic
  • Checkpointing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

You need to integrate Azure Databricks with Azure DevOps for continuous integration and continuous deployment (CI/CD) of your data pipelines. What steps would you follow?

A
  • Version control
  • Ci pipeline to auto test and validate changes to notebooks
  • Cd pipeline to deply validated notebooks to the databricks workspace
  • Databricks CLI for integration with azure devops
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Databricks management plane?

A
  • Set of tools and services used to manage and control the DB environment, it includes the db workspace
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Difference between instance and cluster?

A
  • Instance – single virtual machine used to run app or service
  • Cluster – set of instances that work together to provide a higher level performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Serverless data processing?

A
  • Process data without needing to worry about underlying infrastructure
  • Databricks manages the infastructure and allocates resources as needed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the main components of Databricks?

A
  • Workspace (organize projects)
  • Clusters (executing code)
  • Notebooks (interactive development)
  • Jobs (scheduling automated workflows)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is apache spark and how does it integrate with databricks

A
  • Spark – open source distributed computing system
  • Databricks provides managed spark environment that simplifies cluster management
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

RDDs in spark

A
  • RDDs (Resilient Distributed Datasets)
  • dataframe built ontop of rdd
  • low level, no schema
  • no query optimization
22
Q

Dataframes in spark

A
  • Dataframes distributed collections of data organized into named columns
  • this is what pyspark uses
23
datasets in spark
* Datasets are typed, distributed collections of data that provide benefits of RDD with convinenience of data frames * java scala only
24
Data transformations in spark
* Operations: map, filter, reduce, groupby, join
25
Lazy evaluation in spark
* Spark does not immediately excecute transformations on rdd,dataframe,dataset * instead it builds logical plan of transformations and only executes when an action is called
25
What is catalyst optimizer in spark
* Is a query optimization framework in spark sql that automatically optimizes the logical and physical execution plans to improve query performance
26
How do you manage Spark applications on Databricks clusters?
* Configure clusters, databricks job scheduler
27
How do you create and manage notebooks in databricks
* Create directly in workspace * Key features include cell execution, rich visualizations, version control, multiple langauges
28
What are Delta Lakes, and why are they important?
* Open source storage layer that brings ACID transactions to spark.
29
How do you handle data partitioning in Spark?
* Using repartition or coalesce to adjust number of partitions * understand data and frequent filters before partitioning
30
What is the difference between wide and narrow transformations in Spark?
* Narrow transformations (like map and filter) involve data shuffling within a single partition, * while wide transformations (like groupByKey and join) involve data shuffling across multiple partitions, which can be more resource-intensive.
31
What is the role of Databricks Runtime?
* Databricks Runtime is a set of core components that run on Databricks clusters, including optimized versions of Apache Spark, libraries, and integrations. It improves performance and compatibility with Databricks features.
32
How do you secure data and manage permissions in Databricks?
* role-based access control (RBAC), * secure cluster configurations - Azure Active Directory.
33
How to use databricks to process real time data
* Spark streaming, sources like kafka and event hubs
34
method for stakeholder meetings
MoSCoW method: (or weighted scoring) - Must have - Should have - Could have Won't have
35
Managing unrealistic expectations stakeholders
- show empathy by acknowledging the stakeholders desires and concerns before explaining why expectations are unrealistic Address asap, don’t let it build up
36
SMART
Specific, Measurable, Attainable, Relevant, Time-Bound
37
what to ask stakeholders
Ask them questionaire: 1. What's the actual goal here 2. Who are the primary users? What's the most annoying thing about x? W6H pattern * What * Where * When * Who * Why * Which - prompts consideration of alternative solutions or options to meet stakeholder needs. * How
38
2 planes in databricks azure
control plane compute plane
39
control plane
includes the backend services that Azure Databricks manages in your Azure Databricks account. The web application is in the control plane.
40
compute plane
is where your data is processed. There are two types of compute planes serverless compute classic Azure Databricks compute
41
databricks architecture execution jobs
jobs - triggered when you perform an action (collect,write,show) top level unit of execution
42
databricks architecture clusters
clusters - groups of compute resources, that hold nodes
43
databricks architecture node
node - individual compute resources in a cluster have a spark driver (master) node and worker nodes
44
databricks architecture executor
executor - each worker node runs one executor
45
databricks architecture driver
splits the work into jobs and further dividing into stages and then tasks assigning tasks to executors (worker nodes)
46
databricks architecture stage
chunk of a job that can be executed without needing to shuffle the data
47
databricks architecture task
tasks - single unit of work within a job, tasks are run on clusters smallest unit of execution in spark performed on one partition of the data
48
databricks architecture orchistration job
automates execution of notebook * A job in Databricks is the automation unit that lets you run your data pipeline or script on a schedule, with parameters, on a cluster, and with logging, retries, and alerts.
49
What is delta table in Databricks?
delta format A Delta table is a storage layer built on top of Apache Parquet that brings ACID transactions, versioning, and schema enforcement to data lakes in Databricks using Delta Lake. version controlled file system backed on disk
50
when to use dataframe
loading data temp transforming data for a pipeline when you dont need to save the result long term
51
when to use delta table
you want to persist the output of your transformations you can enforce rules automatically (schema enforcement) time travel - end users can do PIT easily
52
parquet
fine for raw data but no governance or rollback
53
how would you store your staging files in databricks?
parquet over csv or json parquet is columnar - only reads columns you query csv & json - row based, spark has to read it all parquet is highly compressed parquet has schema support you can easily convert parquet to delta later