vaa Flashcards
(56 cards)
How do you optimize performance in a Databricks pipeline?
- Delta tables are optimized for spark (acid transactions, schema enforcement)
- Partitioning on common filter columns (like date)
- Z-order Indexing – improves read performance by clustering data files based on access patterns.
- Spark UI to find memory issues or skewed tasks
- Job clusters - that spin up and down for duration of job
- Wide transformations (groupBy, distinct, join) can cause excessive shuffles
How do you approach data modeling for a large-scale warehouse (e.g., star vs. snowflake schemas)?
- Hybrid approach
- Star schema for speed on the analytics layer
- Snowflake-like normalization for master data and staging layers
- SCD type 2 where needed
How do you handle slowly changing dimensions in your data warehouse design?
- Most enterprise-scale warehouses are type 2.
- Use merge statements in sql server and databricks
- Start and end date, version or current_flag, created by, updated at
How do you ensure your data solutions stay aligned with evolving enterprise architecture?
- I focus on solving the current business need first without over-engineering
- But I make sure my solution follows enterprise standards
- My work aligns with current architecture while staying modular and easy to evolve if things shift
How have you contributed to a technology roadmap in the past?
- YES – Data strategy (current state, future state, business requirements, stakeholder assessments)
- Timelines, smart goals, resource requirements
What are some indicators that a data architecture needs rethinking?
- When data volume grows faster than expected, performance degrades, manual workarounds increase, or tech debt blocks new features.
How do you gather data requirements from both technical and non-technical stakeholders?
- Stakeholders to understand business goals behind the data need
- For non tech stakeholders, focus on what questions they want answered - keep it outcome focused
- For tech stakeholders, data sources, schemas, and technical constraints
- Clarify data freshness, granularity, and access requirements (how often data updated, level of detail, who needs access)
- If anything is unclear, I prototype queries or mock reports to help visualize the data flow
How do you approach requirement gathering for a system you’ve never worked on before?
- High level questions to understand purpose of system and business outcomes
- Key stakeholder sessions – find pain points and current workflows, whats working and whats not
- Review available documentation
- Map out the data flow and draft assumptions and gaps
What types of documentation do you typically produce for a data warehouse project?
- Data models
- Source to target mapping
- Etl pipeline documentation
- Data dictionary
- Access and permission guide
- Operational runbooks – how to troubleshoot, restart jobs, respond to failures in pipeline
How do you ensure stakeholder expectations are aligned with what’s technically feasible?
- Actively listening
- Must haves vs nice to haves
- Technical constraints are known early
- Suggest alternatives for unfeasible option but same goal
- Trade offs are known (real time updates trade off would be cost and complexity)
Describe your process for managing ongoing support and maintenance of a data system.
- Hypercare period with fast feedback loop (quick triage of issues, communication to stakeholders)
- Post hypercare transition to support model, knowledge transfer to ops or support teams
Have you worked in agile environments? How do you manage data engineering tasks in sprint cycles?
- Yes, we followed sprints daily standups and backlog grooming
- Break down engineering work into small testable tasks that can fit into a sprint
- 4h chunks for subtasks in a sprint
- For larger features like etl pipeline the work gets split into phases like schema design, source ingestion etc.
Your Databricks job requires frequent joins between a large fact table and several dimension tables. How would you optimize the join operations to improve performance?
- Broadcast joins for smaller dim tables
- Partition on join key
- Cahce dim tables in memory to reduct repeated i/o operations
- Bucket the tables on join key to reduce shuffle overhead
You need to create a Databricks job that reads data from multiple sources (e.g., ADLS, Azure SQL Database, and Cosmos DB), processes it, and stores the results in a unified format. Describe your approach.
- Spark connectors to read data
- Standardize schema across different data sources
- Apply transformations, aggregations and joins to integrate
- Write to unified storage format, sql server, delta lake
- Schedulet he job using databricks jobs or azure data factory for regular execution
Your organization needs to implement a data quality framework in Azure Databricks to ensure the accuracy and consistency of the data. What approach would you take?
- Validation rules to check for data consistency completeness and accuracy
- Spark transformations to clean data
- Monitor to track quality and anomalies
- Generate reports
Scenario: You are experiencing intermittent network issues causing your Databricks job to fail. How would you ensure that the job completes successfully despite these issues?
- Retry logic
- Checkpointing
You need to integrate Azure Databricks with Azure DevOps for continuous integration and continuous deployment (CI/CD) of your data pipelines. What steps would you follow?
- Version control
- Ci pipeline to auto test and validate changes to notebooks
- Cd pipeline to deply validated notebooks to the databricks workspace
- Databricks CLI for integration with azure devops
Databricks management plane?
- Set of tools and services used to manage and control the DB environment, it includes the db workspace
Difference between instance and cluster?
- Instance – single virtual machine used to run app or service
- Cluster – set of instances that work together to provide a higher level performance
Serverless data processing?
- Process data without needing to worry about underlying infrastructure
- Databricks manages the infastructure and allocates resources as needed
What are the main components of Databricks?
- Workspace (organize projects)
- Clusters (executing code)
- Notebooks (interactive development)
- Jobs (scheduling automated workflows)
What is apache spark and how does it integrate with databricks
- Spark – open source distributed computing system
- Databricks provides managed spark environment that simplifies cluster management
RDDs in spark
- RDDs (Resilient Distributed Datasets)
- dataframe built ontop of rdd
- low level, no schema
- no query optimization
Dataframes in spark
- Dataframes distributed collections of data organized into named columns
- this is what pyspark uses