BPS Flashcards

(8 cards)

1
Q
  • Engineered ETL pipelines in Python to ingest and transform over 500K daily transactions, orchestrated via Airflow and loaded into Snowflake to power downstream analytics and reporting.
A

In my most recent role, I was responsible for building and maintaining ETL pipelines that handled our company’s daily transaction data. We were processing around 500,000 transactions every day, which included everything from customer purchases to refunds and adjustments. The data was coming from multiple sources - **our main ERP called JDA or Blue Yonder, our Canadian ERP called SAP, and from flat files supplied from our call centre. **

In terms of transformations and orchestration, I used Airflow in combination with dbt. I set up DAGs that would first extract the raw data and handle the initial ingestion into Snowflake’s staging tables. Then I used dbt for the data transformations.

The dbt models would handle things like applying business logic, and creating the final tables that the business teams needed, and we’d end up with clean, transformed data loaded into Snowflake for our downstream analytics and reporting

One thing I learned was the importance of ensuring reliability and performance. I broke up the DAG into multiple tasks such that each task needed to accomplish its purpose before progressing to the next task.

I also utilized tools and strategies to speed up performance, like using hashing for upserting, which uses one to one comparisons versus many to many comparisions as it was before, making it finish in 15 minutes versus over 3 hours with our previous implementation.

The challenging part however was understanding the needs of the end users. It was a long and lengthy process to comprehensively understand the business requirements from the finance team, business intelligence team, inventory teams, and many others. However, I think that this was actually the most significant part as it allowed me to engineer and implement a solution that was perfect for our organization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  • Led the end-to-end design and implementation of a robust data infrastructure using Airflow and dbt, powering executive dashboards and reporting systems across the organization. Reduced manual reporting cycles by 70%.
A

One of the projects I’m most proud of was completely redesigning our data infrastructure from the ground up. When I started, the organization was running everything through Windows Task Scheduler on a Virtual Desktop Interface, which was really limiting our capabilities and causing a lot of headaches.

The biggest pain point was that our reporting was largely manual. **Teams across the organization - finance, operations, executives **- were spending hours each week pulling data, manipulating it in Excel, and creating reports. It was not only time-consuming but also prone to errors and inconsistencies.

I proposed moving to a modern data stack built around Airflow and dbt running on a Linux server. The implementation took several months because I wanted to make sure we got it right. I set up Airflow to handle all the orchestration - scheduling jobs, managing dependencies, and monitoring pipeline health. Then I used dbt for the actual data transformations.

The data flowed into Snowflake, and from there, I worked closely with different teams to understand their specific reporting needs. I collaborated with them to build Power BI dashboards that pulled directly from our clean, transformed data in Snowflake.

The impact was significant - we reduced manual reporting cycles by 70%. What used to take someone hours every week now happened automatically overnight.

The transition wasn’t without challenges though. There was definitely a learning curve for some team members, and I spent a lot of time training people and documenting processes to make sure everyone felt comfortable with the new system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  • Spearheaded a cloud migration initiative to move over 10 years of historical sales data from SAP to Snowflake on Google Cloud Platform (GCP), designing ingestion pipelines and scalable data models that laid the foundation for analytics modernization and cloud-based reporting.
A

One of the most complex projects I led was migrating over 10 years of historical sales data from our legacy SAP system to Snowflake. This was a critical initiative because our** analytics capabilities were really limited by the old system**, and leadership wanted to modernize our entire reporting infrastructure.

The scope was millions of sales records spanning a decade. The data structure in SAP was incredibly unintuitive and difficult to work with, really unlike any UI I used ever.

I started by doing a comprehensive data audit to understand exactly what we were working with. I worked closely with our SAP administrators and business users to map out all the critical tables and understand the relationships between them.

To optimize the bulk loading process, I converted the extracted data to Parquet format and leveraged Snowflake’s vectorized scanner. This approach allowed the COPY commands to only load relevant columns into memory, significantly reducing compute resources and speeding up the ingestion process. The performance gains from using vectorized Parquet ingestion were substantial - what we initially estimated would take several months of weekend bulk loads ended up completing much faster, allowing us to finish the entire data migration project 1 month ahead of our proposed schedule.

The biggest challenge was ensuring data integrity throughout the process. I implemented** validation checks at multiple stages and worked with business users to verify that key metrics matched between the old and new systems**. We ran parallel systems for a few weeks to build confidence before fully cutting over.

The end result was that we moved from a system where running historical reports that had to be prepared days in advance, to one where the same queries run in minutes, and it opened up possibilities for advanced analytics that just weren’t feasible before.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  • Built and deployed an ML-based Random Forest model using scikit-learn to impute missing retail traffic data, automating historical backfill and daily updates within a scheduled pipeline.
A

We had a pretty frustrating data quality issue that was affecting our retail analytics. We were tracking daily foot traffic across all our stores - basically how many people entered each location each day

The problem was that our traffic counting system wasn’t 100% reliable. We didn’t know exactly why or how data wasn’t captured, and we’d end up with missing data for certain days at certain stores. This was really problematic because our analytics teams needed complete datasets to do meaningful trend analysis and forecasting.

Initially, our analytics team suggested manually estimating the missing values or excluding those days entirely, which wasn’t feasible.

I decided to build a machine learning model to predict what the traffic should have been on those missing days using a random forest regressor, which estimates values for missing data by being trained on various parts of our existing data, then it averages them out and outputs a value that is used to fill our data gaps.

I implemented it using scikit-learn, which made the development process pretty straightforward.

The trickier part was integration. I **needed this to work both for historical backfill, and for ongoing daily updates when new gaps appeared. **

So I built it into our existing data pipeline using Airflow. When retail traffic is set to update, the pipeline would automatically detect missing values, run the model to generate predictions, and flag the imputed data so analysts knew which values were estimates versus actual measurements.

The business impact was significant because now our analytics teams could run reports with confidence, knowing they had complete datasets to work with. It also freed up time that people were spending manually dealing with these data gaps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  • Designed and implemented the team’s first dedicated testing and development environment for ETL pipelines, enabling safe experimentation, faster debugging, and more stable deployments.
A

one of the key initiatives I led was setting up our team’s first dedicated development and testing environment for our ETL pipelines. Before this, we were doing a lot of** work directly against production or in a very ad hoc way**, which was risky and made it hard to safely experiment or test changes.

To address that, I created a development database that was essentially a sanitized copy of our production environment. I also deployed a separate instance of Airflow configured to run our DAGs specifically against that dev database. That setup allowed us to test new pipelines, debug issues, and validate changes end-to-end without worrying about affecting live data or production workflows.

Beyond the infrastructure, I wanted to make sure it was usable and scalable for the whole team, so I also developed internal libraries that allowed our code to dynamically switch connections based on the environment — dev or prod — which made our deployments more reliable and reduced the risk of misconfigurations.

To help onboard others and standardize our process, I also wrote a detailed document outlining the ETL developer framework we were adopting. It** covers the structure we follow, why we follow it, and best practices for engineering maintainable, testable pipelines.**

This setup ended up being a big improvement for the team — it gave us confidence in our changes, significantly cut down the time we spent debugging issues in production, and laid the groundwork for more consistent engineering practices going forward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  • Assisted with configuring and implementing a DevSecOps-enabled development and testing environment for the ETL ecosystem, integrating Aqua, CyberArk, and protocols for secure code reviews to ensure the protection of sensitive user data.
A

I was involved in configuring and implementing a DevSecOps-ready development and testing environment for our ETL ecosystem, with a strong focus on security and compliance. A major part of that work was integrating Aqua Security and CyberArk into our pipeline.

For Aqua, we used it across our multi-cloud environment — mainly Azure and GCP — to scan our container images and dependencies during build time. We also implemented Aqua’s supply chain scanning features, which helped us catch vulnerabilities in our base images and third-party packages before they ever reached deployment. That gave us much better visibility and control over the security posture of our ETL workloads, which are often containerized and cloud-deployed.

On the secrets side, we integrated CyberArk into our Linux environments, specifically for managing and rotating access keys used in our jobs. These keys were stored and rotated in CyberArk’s Cloud, and our workloads interfaced with them via their connector management portal.

We also aligned this setup with secure code review protocols — making sure all changes went through peer-reviewed pipelines with security scanning hooks before merging. The overall goal was to ensure that the ETL development lifecycle was not just fast and functional, but also secure by design, especially given the sensitivity of the data we handled.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  • Recognized by senior leadership for creating comprehensive documentation on ETL scripts, pipeline architecture, and data models, significantly improving cross-team collaboration.
A

On initiative I’m really proud of was **improving the visibility and understanding of our ETL processes and data models through comprehensive documentation. **

This came out of a pretty common pain point: teams across the org were working with data they didn’t fully understandwhat pipelines were doing, what certain fields meant, where the data was coming from — and it was creating friction, especially between engineering, analytics, and product.

To address this, I took the lead in documenting our ETL scripts, pipeline architecture, and data models.

**I used Confluence for high-level architectural diagrams, pipeline overviews, and walkthroughs of common workflows — making it easy for non-engineers to navigate. **

For the technical side, I leveraged dbt’s built-in doc generation. That let us publish a static website where stakeholders could browse through models, sources, columns, and even field-level metadata — complete with descriptions, data types, and lineage.

This effort was recognized by senior leadership because it really improved cross-team communication and self-sufficiency. Analysts no longer had to ask engineering what a field meant, and product managers could better understand how data flowed through the system.

It also helped reinforce our company’s goal of becoming more data-driven — making sure that not only was the data accessible, but also that people trusted and understood it.

The most difficult aspect of this was that I had to get buy-in from different teams, standardize how we describe our models, and even add some CI steps to make sure the dbt docs stayed up to date. But it’s been rewarding to see how it’s helped bridge gaps between teams.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Tell me about your experience with Bass Pro Shops

A

I had a really rewarding experience at Bass Pro Shops as an ETL Developer. When I joined, they were dealing with some significant data infrastructure challenges that were affecting the entire organization.

The core of my role was building and maintaining ETL pipelines for transactions, around 500,000 daily event, from multiple sources - their main ERP system JDA Blue Yonder, their Canadian SAP system, and flat files from the call center.

I used Airflow for orchestration and dbt for transformations, loading everything into Snowflake to power their analytics and reporting.

But the bigger challenge was that their entire data infrastructure was outdated. They were running everything through Windows Task Scheduler on a virtual desktop, and teams across the organization - ETL, finance, operations, executives - were spending hours each week resolving the issues resulting from this unreliable setup.

I ended up leading a complete redesign of their data infrastructure, moving to a modern stack with Airflow, dbt, and a dedicated test environment. This in combination with new pipelines reduced manual reporting cycles by 70%, and what used to take people hours every week now happened automatically overnight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly