Data Intesive Ch12 - Future of data systems Flashcards

1
Q

What the final chapter is about

A

Previous were describing how things are at present, the final chapter discusses how things SHOULD be

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Goal of the book was to…

A

…explore how to create apps and systems that are

  • RELIABLE
  • SCALABLE
  • MAINTAINABLE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Recurring theme in the book

A

For any given problem, there’re several solutions with pros, cons and trade-offs
Storage engines - log based, B-tree based, column-oriented
Replication - single leader, multi leader, leaderless

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The goal of data integration is…

A

…to make sure the data ends up in the right form in all the right places

Search index from CDC/Total Order Broadcast of DB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why does asynchrony is what makes systems based on event logs robust?

A

Allows a fault in one part of the system to be contained locally.
With synchronous distributed trx all must abort if any participant fails so failures are amplified and spread to the rest of the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Batch and stream processing vs maintaining derived state

A

Stream allows changes in the input to be reflected in derived views with LOW DELAY

Batch allows LARGE AMOUNTS of historical accumulated data to be reprocessed in order to derive new views onto an existing dataset
This supports evolvability of the system (new features can easily derive required view)
Without ability to fully reprocess the data schema evolution is limited to adding a new optional field to a record or new type of record for both schema-on-write and read.
With reprocessing it is possible to restructure dataset into any model that serve new requirements best.
Users can be gradually routed to the new view
Old view still exists and can be switched back to.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Schema migrations on railways

A

In early 19th century England there were various competing standards for the gauge (distance between 2 rails)
Trains had to be built for one particular gauge and were incompatible with another
Once single standard was decided in 1846 existing tracks of different gauge had to be converted.
Shutting down entire train line for months or more was not an emption.
The solution was to convert the track to dual gauge by adding a third rail which was done gradually.
Eventually the nonstandard rail could be removed.
Nevertheless it is expensive undertaking and non-standard gauges are still in use (BART in SF vs rest of USA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Lambda architecture

A

Running batch and stream system in parallel like Hadoop MR and Storm
Stream processor consumes the events and quickly produces APPROXIMATE update of the read view.
Batch processor later consumes the same set of events and produces a corrected version of derived view.

Basically - stream fast approximate and batch slow exact.

Cons:

  • same logic is duplicated in 2 places
  • 2 processors produce separate outputs that need to be merged (can be complex if view is derived using joins, sessionization, output is not a time series etc)
  • running full-reprocess is expensive on large datasets; incremental batches can be used but this raises problems with time like stragglers or how to handle windows crossing boundaries between batches.

Recent work - unifying batch and stream processing in one system. This gives capabilities like:

  • replay historical events through the same processing engine (log based message brokers)
  • exactly one semantics for stream processors
  • tools for windowing by EVENT time not processing time (as it’s meaningless for historical events)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What’s wrong with using processing time for windowing when processing events in stream processor?

A

Replaying makes the time meaningless. Using processing time assumes event is processed shortly after being published. Replay can take place long after that.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly