Interview questions Flashcards
(32 cards)
What is ETL?
Extract data from source(s), Transform the data (apply changes), Load into a central database. This instead of having to combine the data every time you want to load it and use it.
What are ways of transforming the data?
Apply data cleansing
Removing duplicates
Adding keys and IDs
Combining and merging data from different sources
What ETL tools are there?
SQL
Oracle Database Integrator (ELT!)
Oracle Warehouse Builder
What is ODI?
It uses ELT instead of ETL, where data is first loaded and then transformed => more efficient for large datasets. Also more flexible as it is compatible with multiple data sources and technologies, and has good integration with Oracle databases.
How do you automate the ETL process?
- ETL scheduling tools: eg. ODI to automate jobs
- Scripting: SQL or Python scripts to run processing automatically
- Monitoring: get an alert when processes fail
What is the difference between ODI and OWB?
- ODI uses ELT, OWB uses ETL
- ODI pushes transformations to the database, OWB performs transformations in a separate ETL engine before loading data into the warehouse
- ODI supports multiple databases, OWB is designed mainly for Oracle-based data warehouses
- ODI is more scalable than OWB
How to ensure privacy of your data during ETL?
- encrypt data at every stage
- redaction (hiding sensitive parts of data)
- anonymization (modify data) or tokenization (replace values with random ones)
- logging and monitoring (unauthorized access attempts, anomalies)
What are best practices for data anonymization?
- data masking: replace sensitive info with fictional values (XXX - 339)
- pseudonimizing: replace identified data with pseudonyms, eg. ID numbers
- generalization: make data less specific (birth year instead of birth day)
- adding noise
How to optimize a slow SQL query?
- indexing: make sure columns that you use often (WHERE, JOIN) have indices
- only use the columns you need instead of SELECT *
- use JOINS instead of a subquery
- avoid unnessecary joins and use INNER JOIN where possible
- use query analysis (SQL trace) to find problems
What is INNER JOIN?
Only returns the rows that are in both tables
What is LEFT JOIN?
Returns all rows that are in the left table, and a NULL value if there are no matches for the right table
What is FULL JOIN?
Combines all rows of the left and right table, and a NULL value if there are no matches in both tables.
How does indexing work in databases?
It speeds up queries in large tables.
For a certain column (eg. names) you create a sorted list with the block and index number of each name (B-tree). You can then look up the exact data block that contains the record you want and only load that into memory, and you don’t need a full table scan.
Only do this for columns you use a lot and if the data is not updated regularly!
What is data modelling?
Designing and structuring data to define how it’s stored, organized and accessed in the database system.
- conceptual model (what data, table names)
- logical model (how data is structured: attributes, primary keys, relationships)
- physical model (exact database structure: data types, indexes, constraints)
What is the difference between structured and unstructured data?
Structured data: organized in predefined format (table, row, column), stored in relational databses, eg. transaction data, customer records, sales reports
Unstructured data: without fixed format, stored in data lakes or file systems, eg. emails, images, social media posts, requires processing tools
What is the difference between a data warehouse and an operational database?
Data warehouse: optimized for analytics, storing large volumes of historical data, complex queries, rarely updated
Operational database: handles real-time updates and updates, fast read/write transactions, frequently updated
What does a data engineer do?
Building, maintaining and optimizing the data infrastructure that enables data collection, storage and analysis.
What is the difference between SQL and NoSQL?
NoSQL is for non-relational databases. It is flexible, instead of structured (SQL) and the schema can evolve over time. It is used for semi-structured or unstructured data, for example storing real-time chat messages or user activity logs.
How do you replace null values with 0 in SQL?
Use the COALESCE function
How to check if a variable is uneven in SQL?
WHERE MOD(variable, 2) = 1
How do you count the number of rows that have a certain value in a certain column in SQL?
COUNT(CASE WHEN state = ‘approved’ THEN id END)
or you could do
SUM(CASE WHEN state = ‘approved’ THEN 1 ELSE 0 END)
or if you use MySQL you can just do
SUM(state = ‘approved)
How do you format a date in SQL?
MySQL: DATE_FORMAT(date, ‘%Y-%m’)
Oracle: TO_CHAR(date, ‘YYYY-MM’)
Waarom is ETL belangrijk?
Het zorgt ervoor dat de ruwe data bruikbaar en betrouwbaar wordt voor data analyse
Hoe optimaliseer je een groot ETL-proces?
- gebruik bulk-inserts ipv row by row
- meerdere threads gebruiken, parallelliseren
- indexeren van kolommen
- ODI gebruiken voor efficiëntere migratie