Batch Data Processing Flashcards

1
Q

What should data pipelines support?

A

Interpretability and observability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the most important pipeline features?

A

1 - immutable data
2 - data lineage
3 - test feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is immutable data important?

A

To make reproducible outcomes possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is data lineage important?

A

for diagnostics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why are test running features important?

A

To validate assumptions that have been made

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What kind of tests have to be done for the testing step?

A

1 - health check
2 - integration test
3 - latency test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Health test

A

Checks if the job has succeeded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Integration test

A

Verifies if some mock data makes its way through the data transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Latency test

A

measures the time it takes for the data pipeline to complete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Benefits of batch processing

A

1 - Load balancing (shift the time of the job processing to when the computing resources are less busy)
2 - Reducing manual intervention and supervision
3 - Overall high rate of utilisation
4 - Allowing priority differences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Combiner function

A

An extra layer which pre-aggregates values int he mapper itself. This can only be done if the reduce function is commutative and associative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly