Interview questions Flashcards

(32 cards)

1
Q

What is ETL?

A

Extract data from source(s), Transform the data (apply changes), Load into a central database. This instead of having to combine the data every time you want to load it and use it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are ways of transforming the data?

A

Apply data cleansing
Removing duplicates
Adding keys and IDs
Combining and merging data from different sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What ETL tools are there?

A

SQL
Oracle Database Integrator (ELT!)
Oracle Warehouse Builder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is ODI?

A

It uses ELT instead of ETL, where data is first loaded and then transformed => more efficient for large datasets. Also more flexible as it is compatible with multiple data sources and technologies, and has good integration with Oracle databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you automate the ETL process?

A
  • ETL scheduling tools: eg. ODI to automate jobs
  • Scripting: SQL or Python scripts to run processing automatically
  • Monitoring: get an alert when processes fail
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between ODI and OWB?

A
  • ODI uses ELT, OWB uses ETL
  • ODI pushes transformations to the database, OWB performs transformations in a separate ETL engine before loading data into the warehouse
  • ODI supports multiple databases, OWB is designed mainly for Oracle-based data warehouses
  • ODI is more scalable than OWB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to ensure privacy of your data during ETL?

A
  • encrypt data at every stage
  • redaction (hiding sensitive parts of data)
  • anonymization (modify data) or tokenization (replace values with random ones)
  • logging and monitoring (unauthorized access attempts, anomalies)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are best practices for data anonymization?

A
  • data masking: replace sensitive info with fictional values (XXX - 339)
  • pseudonimizing: replace identified data with pseudonyms, eg. ID numbers
  • generalization: make data less specific (birth year instead of birth day)
  • adding noise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to optimize a slow SQL query?

A
  • indexing: make sure columns that you use often (WHERE, JOIN) have indices
  • only use the columns you need instead of SELECT *
  • use JOINS instead of a subquery
  • avoid unnessecary joins and use INNER JOIN where possible
  • use query analysis (SQL trace) to find problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is INNER JOIN?

A

Only returns the rows that are in both tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is LEFT JOIN?

A

Returns all rows that are in the left table, and a NULL value if there are no matches for the right table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is FULL JOIN?

A

Combines all rows of the left and right table, and a NULL value if there are no matches in both tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does indexing work in databases?

A

It speeds up queries in large tables.
For a certain column (eg. names) you create a sorted list with the block and index number of each name (B-tree). You can then look up the exact data block that contains the record you want and only load that into memory, and you don’t need a full table scan.

Only do this for columns you use a lot and if the data is not updated regularly!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is data modelling?

A

Designing and structuring data to define how it’s stored, organized and accessed in the database system.
- conceptual model (what data, table names)
- logical model (how data is structured: attributes, primary keys, relationships)
- physical model (exact database structure: data types, indexes, constraints)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between structured and unstructured data?

A

Structured data: organized in predefined format (table, row, column), stored in relational databses, eg. transaction data, customer records, sales reports

Unstructured data: without fixed format, stored in data lakes or file systems, eg. emails, images, social media posts, requires processing tools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between a data warehouse and an operational database?

A

Data warehouse: optimized for analytics, storing large volumes of historical data, complex queries, rarely updated

Operational database: handles real-time updates and updates, fast read/write transactions, frequently updated

17
Q

What does a data engineer do?

A

Building, maintaining and optimizing the data infrastructure that enables data collection, storage and analysis.

18
Q

What is the difference between SQL and NoSQL?

A

NoSQL is for non-relational databases. It is flexible, instead of structured (SQL) and the schema can evolve over time. It is used for semi-structured or unstructured data, for example storing real-time chat messages or user activity logs.

19
Q

How do you replace null values with 0 in SQL?

A

Use the COALESCE function

20
Q

How to check if a variable is uneven in SQL?

A

WHERE MOD(variable, 2) = 1

21
Q

How do you count the number of rows that have a certain value in a certain column in SQL?

A

COUNT(CASE WHEN state = ‘approved’ THEN id END)
or you could do
SUM(CASE WHEN state = ‘approved’ THEN 1 ELSE 0 END)
or if you use MySQL you can just do
SUM(state = ‘approved)

22
Q

How do you format a date in SQL?

A

MySQL: DATE_FORMAT(date, ‘%Y-%m’)
Oracle: TO_CHAR(date, ‘YYYY-MM’)

23
Q

Waarom is ETL belangrijk?

A

Het zorgt ervoor dat de ruwe data bruikbaar en betrouwbaar wordt voor data analyse

24
Q

Hoe optimaliseer je een groot ETL-proces?

A
  • gebruik bulk-inserts ipv row by row
  • meerdere threads gebruiken, parallelliseren
  • indexeren van kolommen
  • ODI gebruiken voor efficiëntere migratie
25
Wat is het verschil tussen batch ETL en streaming ETL?
Batch: data wordt periodiek verwerkt (bv dagelijks of wekelijks) Streaming: data wordt real-time verwerkt
26
Wat is normalisatie in databases?
Data opsplitsen in meerdere tabellen om overbodige data te verminderen. - minimize redundancy - easier to understand - imrove data integrity
27
Wat is denormalisatie in databases?
Data combineren in 1 tabel om query performance te verbeteren
28
Hoe ga je om met ontbrekende waarden?
- Verwijderen (als het weinig is) - Vervangen door de mediaan of het gemiddelde - Machine learning gebruiken (KNN)
29
Hoe zou je een ETL pipeline ontwerpen voor een verzekeringsmaatschappij?
Data bronnen: claims, polisgegevens, klantendossiers Exctract: gebruik ODI om data uit verschillende bronnen te halen Transform: data schoonmaken, duplicaten verwijderen, standaardisatie Load: geoptimaliseerd opslaan in Data Warehouse
30
Hoe ga je om met miljoenen dagelijkse updates in een database?
- partitioning toepassen om queries sneller te maken - batch updates ipv row by row - incremental updates - caching gebruiken voor veelgevraagde data
31
Wat weet je over Oracle Health Insurance?
Het is een aparte afdeling die systemen (core administration software) ontwikkelt voor zorgverzekeraars, waar ze hun administratie, declaraties en financiën efficiënt kunnen beheren. Focus ligt op werken met veel gevoelige data en performance, betrouwbaarheid en regelgeving.
32
What is 1NF?
- each column contains atomic values (not lists) - each row has a unique identifier (primary key)