ETL Flashcards

1
Q
  1. You have two ETL jobs that feed into a single production table each day. What problems might this cause?
A

Off the top of my head the first one would be whichever one is secondary could be overwriting the initial ETL job.

Also you can have write skew where you are updating old data. So if you have a query that inserts something into a table but that table is not the current version then you will have write skew.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. What’s the difference between ETL and ELT?
A

So ETL transforms data on a seperate processing server. While ELT process the data in the dara warehouse. ELT sends raw daya right into a datawarehouse.

ETL is slower.

Generally organization should be using ELT now which is more efficient and sclaable. Also you get to store the raw data which we need.

So the one thing I would say to this and maybe you guys run into this. Is data compliance. So sometimes you have to transform the data to protect customer information. If thats so you are going to have to do some kind of Transform before you load the data or you’ll break compliance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an initial load in ETL? What about full load

A

Full load - All the data dumps when the source loads into the warehouse.

Initial load - Data is dumped between the source and target at regular intervals. Lastly, extract dates are stored; only records are added for the extract date load. This load can be either streaming (better for small volume) or batch (better for large volume).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

With what ETL tools are you most familiar?

A

SQL, Python, Airflow, AWS Glue, I understand you guys use informatica

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are partitions? Why might you increase the number of partitions?

A

partitions are subdividers that improve data perfomance.

You can think of them like signs in Home depot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are database snapshots? What’s their importance?

A

read only static views of a database at a certain point in time. I use snapshots to make sure we have something to fall back on if there is a data error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are views in ETL? What is used to build them?

A

Views are a SQL query that is a step in a transformation proess. You can store them in DB management ools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What could be potential bottlenecks in the ETL process?

A

Very long inefficient query’s with lots of joins. Make sure you are loading incrementally and not fully updating. Large tables that are not partitioned. Something airflow is reat at is processing in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How would you triage an ETL failure?

A

First thing is can I replicate the ETL failure? So figure out if it is replicable or not replicable.

If it is replicable then I can narrow it down. If not replicable that might take some time. But I would check the logs of the tools in the ETL process and see where it is breaking to narrow that down.

Also keeping edge cases in mind that usually break software. Null values, zeros. End of the month. Special characters in a string. Capitlised and non capitalised strings. extra spaces in strings and lists. things like that can help.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe how to use an operational data store.

A

An operational data store, or ODS, is a database that provides interim storage for data before it’s sent to a warehouse

AN ODS typically integrates data from multiple sources and provides an area for efficient data processing activities like operational reporting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

. Create an ETL query for an aggregate table called lifetime_plays that records each user’s song count by date.

A

okay so how often do you want this to update?

Great so I would do an insert into to add rows to this table and set it to update every second. I would say though that it might be pointless to do it by second because a song is like a couple minutes. So we are running alot of queries to not pull anything. So maybe i would do minimum like 2 minute time. Cut donw on queries maybe similair efificiency.

So i would be accessing like maybe a songs_played table for created_at, date, user_ud, song_id

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give some examples of uses for linked lists.

A

a linked list consists of nodes where each node has a head with the data and a refenrecre to the next node in the list. The end is null.

If you want to able to insert items in the middle of the list, memory is not a concern, and you arent iterating on it then linkedlists are prefferable.

If the list is absolutely massive use a linked list.

so maybe if you hit shuffle songs on spotify it creates a linked list becase it is essentially an endless list of shuffled songs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How would you implement a queue using a stack?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a dequeue?

A

Dequeue is a queue operation to remove items from the front of a queue.

Equeue adds items to the back of a list.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the assumptions of linear regression?

A

that there is a relationship between the feautes and the response variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What algorithms support missing values?

A

KNN - This algorithm uses K-nearest values to predict the missing value.
Random Forest - Random forests work on non-linear and categorical data and are valid for large datasets.

17
Q

What is the difference between linear and non-linear data structures?

A

Linear data structures are elements that attach to the previous and adjacent elements and only involve a single level. In non-linear structures, data elements attach hierarchically, and multiple levels are involved. With linear data structures, elements can only be traversed in a single run, whereas in non-linear structures, they cannot.

Examples of linear data structures include queue, stack, array, and linked list. Non-linear data structures include graphs and trees.