Interview questions Flashcards by Emilia van der Kooy

What is ETL?

Extract data from source(s), Transform the data (apply changes), Load into a central database. This instead of having to combine the data every time you want to load it and use it.

How well did you know this?

Not at all

Perfectly

What are ways of transforming the data?

Apply data cleansing
Removing duplicates
Adding keys and IDs
Combining and merging data from different sources

How well did you know this?

Not at all

Perfectly

What ETL tools are there?

SQL
Oracle Database Integrator (ELT!)
Oracle Warehouse Builder

How well did you know this?

Not at all

Perfectly

What is ODI?

It uses ELT instead of ETL, where data is first loaded and then transformed => more efficient for large datasets. Also more flexible as it is compatible with multiple data sources and technologies, and has good integration with Oracle databases.

How well did you know this?

Not at all

Perfectly

How do you automate the ETL process?

ETL scheduling tools: eg. ODI to automate jobs
Scripting: SQL or Python scripts to run processing automatically
Monitoring: get an alert when processes fail

How well did you know this?

Not at all

Perfectly

What is the difference between ODI and OWB?

ODI uses ELT, OWB uses ETL
ODI pushes transformations to the database, OWB performs transformations in a separate ETL engine before loading data into the warehouse
ODI supports multiple databases, OWB is designed mainly for Oracle-based data warehouses
ODI is more scalable than OWB

How well did you know this?

Not at all

Perfectly

How to ensure privacy of your data during ETL?

encrypt data at every stage
redaction (hiding sensitive parts of data)
anonymization (modify data) or tokenization (replace values with random ones)
logging and monitoring (unauthorized access attempts, anomalies)

How well did you know this?

Not at all

Perfectly

What are best practices for data anonymization?

data masking: replace sensitive info with fictional values (XXX - 339)
pseudonimizing: replace identified data with pseudonyms, eg. ID numbers
generalization: make data less specific (birth year instead of birth day)
adding noise

How well did you know this?

Not at all

Perfectly

How to optimize a slow SQL query?

indexing: make sure columns that you use often (WHERE, JOIN) have indices
only use the columns you need instead of SELECT *
use JOINS instead of a subquery
avoid unnessecary joins and use INNER JOIN where possible
use query analysis (SQL trace) to find problems

How well did you know this?

Not at all

Perfectly

What is INNER JOIN?

Only returns the rows that are in both tables

How well did you know this?

Not at all

Perfectly

What is LEFT JOIN?

Returns all rows that are in the left table, and a NULL value if there are no matches for the right table

How well did you know this?

Not at all

Perfectly

What is FULL JOIN?

Combines all rows of the left and right table, and a NULL value if there are no matches in both tables.

How well did you know this?

Not at all

Perfectly

How does indexing work in databases?

It speeds up queries in large tables.
For a certain column (eg. names) you create a sorted list with the block and index number of each name (B-tree). You can then look up the exact data block that contains the record you want and only load that into memory, and you don’t need a full table scan.

Only do this for columns you use a lot and if the data is not updated regularly!

How well did you know this?

Not at all

Perfectly

What is data modelling?

Designing and structuring data to define how it’s stored, organized and accessed in the database system.
- conceptual model (what data, table names)
- logical model (how data is structured: attributes, primary keys, relationships)
- physical model (exact database structure: data types, indexes, constraints)

How well did you know this?

Not at all

Perfectly

What is the difference between structured and unstructured data?

Structured data: organized in predefined format (table, row, column), stored in relational databses, eg. transaction data, customer records, sales reports

Unstructured data: without fixed format, stored in data lakes or file systems, eg. emails, images, social media posts, requires processing tools

How well did you know this?

Not at all

Perfectly

What is the difference between a data warehouse and an operational database?

Study These Flashcards

Data warehouse: optimized for analytics, storing large volumes of historical data, complex queries, rarely updated

Operational database: handles real-time updates and updates, fast read/write transactions, frequently updated

What does a data engineer do?

Study These Flashcards

Building, maintaining and optimizing the data infrastructure that enables data collection, storage and analysis.

What is the difference between SQL and NoSQL?

Study These Flashcards

NoSQL is for non-relational databases. It is flexible, instead of structured (SQL) and the schema can evolve over time. It is used for semi-structured or unstructured data, for example storing real-time chat messages or user activity logs.

How do you replace null values with 0 in SQL?

Study These Flashcards

Use the COALESCE function

How to check if a variable is uneven in SQL?

Study These Flashcards

WHERE MOD(variable, 2) = 1

How do you count the number of rows that have a certain value in a certain column in SQL?

Study These Flashcards

COUNT(CASE WHEN state = ‘approved’ THEN id END)
or you could do
SUM(CASE WHEN state = ‘approved’ THEN 1 ELSE 0 END)
or if you use MySQL you can just do
SUM(state = ‘approved)

How do you format a date in SQL?

Study These Flashcards

MySQL: DATE_FORMAT(date, ‘%Y-%m’)
Oracle: TO_CHAR(date, ‘YYYY-MM’)

Waarom is ETL belangrijk?

Study These Flashcards

Het zorgt ervoor dat de ruwe data bruikbaar en betrouwbaar wordt voor data analyse

Hoe optimaliseer je een groot ETL-proces?

Study These Flashcards

gebruik bulk-inserts ipv row by row
meerdere threads gebruiken, parallelliseren
indexeren van kolommen
ODI gebruiken voor efficiëntere migratie

Wat is het verschil tussen batch ETL en streaming ETL?

Batch: data wordt periodiek verwerkt (bv dagelijks of wekelijks) Streaming: data wordt real-time verwerkt

Wat is normalisatie in databases?

Data opsplitsen in meerdere tabellen om overbodige data te verminderen. - minimize redundancy - easier to understand - imrove data integrity

Wat is denormalisatie in databases?

Data combineren in 1 tabel om query performance te verbeteren

Hoe ga je om met ontbrekende waarden?

- Verwijderen (als het weinig is) - Vervangen door de mediaan of het gemiddelde - Machine learning gebruiken (KNN)

Hoe zou je een ETL pipeline ontwerpen voor een verzekeringsmaatschappij?

Data bronnen: claims, polisgegevens, klantendossiers Exctract: gebruik ODI om data uit verschillende bronnen te halen Transform: data schoonmaken, duplicaten verwijderen, standaardisatie Load: geoptimaliseerd opslaan in Data Warehouse

Hoe ga je om met miljoenen dagelijkse updates in een database?

- partitioning toepassen om queries sneller te maken - batch updates ipv row by row - incremental updates - caching gebruiken voor veelgevraagde data

Wat weet je over Oracle Health Insurance?

Het is een aparte afdeling die systemen (core administration software) ontwikkelt voor zorgverzekeraars, waar ze hun administratie, declaraties en financiën efficiënt kunnen beheren. Focus ligt op werken met veel gevoelige data en performance, betrouwbaarheid en regelgeving.

What is 1NF?

- each column contains atomic values (not lists) - each row has a unique identifier (primary key)

Interview questions Flashcards

(32 cards)