11 - Big Data Flashcards by basil wilson

What is Big Data?

Big Data is a catch-all term for data that won’t fit the usual containers.

How well did you know this?

Not at all

Perfectly

What are the three Vs that define Big Data?

Volume, Velocity, and Variety.

How well did you know this?

Not at all

Perfectly

What does Volume mean in the context of Big Data?

Volume means there is too much data for it all to fit on a conventional hard drive or even a server. Data has to be stored over multiple servers, each of which is composed of many hard drives.

How well did you know this?

Not at all

Perfectly

What does Velocity mean in the context of Big Data?

Velocity means data on the servers is created and modified rapidly. The servers must respond to frequently changing data within a matter of milliseconds.

How well did you know this?

Not at all

Perfectly

What does Variety mean in the context of Big Data?

Variety means the data held on the servers consists of many different types of data from binary files to multimedia files like photos and videos.

How well did you know this?

Not at all

Perfectly

Which attribute of Big Data causes the most trouble - Volume, Velocity, or Variety?

Big Data’s lack of structure (related to Variety) causes the most trouble, not Volume as might be expected.

How well did you know this?

Not at all

Perfectly

Why does Big Data’s unstructured nature make it difficult to analyse?

Big Data’s unstructured nature makes it difficult to analyse because conventional databases are not suited to storing big data as they require the data to conform to a row and column structure. Furthermore, conventional databases do not scale well across multiple servers.

How well did you know this?

Not at all

Perfectly

What techniques must be used to extract useful information from Big Data?

Machine learning techniques must be used to discern patterns in the data.

How well did you know this?

Not at all

Perfectly

Give two examples of Big Data.

Continuously monitored banking interactions and data from surveillance systems.

How well did you know this?

Not at all

Perfectly

When data is stored over multiple servers, what must happen to the processing?

The processing associated with using the data must also be split across multiple machines.

How well did you know this?

Not at all

Perfectly

Why would distributed processing be difficult with conventional programming paradigms?

It would be incredibly difficult because all the machines would have to be synchronised to ensure that no data is overwritten or otherwise damaged.

How well did you know this?

Not at all

Perfectly

What is the solution to the problem of processing data over multiple machines?

Functional programming is a solution to the problem of processing data over multiple machines.

How well did you know this?

Not at all

Perfectly

What are the key attributes of functional programs?

Functional programs are stateless (meaning that they have no side effects), make use of immutable data structures, and the functional programming paradigm supports higher-order functions.

How well did you know this?

Not at all

Perfectly

What makes functional programming easier for writing correct, efficient, distributed code?

The attributes of being stateless (no side effects), using immutable data structures, and supporting higher-order functions make it easier to write correct, efficient, distributed code with functional programming than with procedural programming techniques.

How well did you know this?

Not at all

Perfectly

Why doesn’t Big Data conform to typical data representation methods?

Because big data doesn’t conform to the row and column format typically used to represent data, it must be represented in a different manner.

How well did you know this?

Not at all

Perfectly

What is the fact-based model for representing data?

Study These Flashcards

In the fact-based model, each individual piece of information is stored as a fact. Facts are immutable (meaning that they never change once created) and can’t be overwritten.

What is stored with each fact in the fact-based model?

Study These Flashcards

Stored with each fact is a timestamp, indicating the date and time at which a piece of information was recorded.

Why are timestamps important in the fact-based model?

Study These Flashcards

Since facts are never deleted or overwritten, multiple different values could be held for the same attribute. Timestamps allow a computer to discern which value is the most recent.

In the house example given, what were the two facts and their timestamps?

Study These Flashcards

The house was green in 2010 but re-painted white in 2016. If the colour of house number 42 was queried, the timestamps would be compared and the most recent information (Colour: White) would be returned.

What are two benefits of using the fact-based model for storing Big Data?

Study These Flashcards

1) Using the fact-based model reduces the risk of accidentally losing data due to human error (thanks to facts being immutable and not overwritable). 2) The model does away with an index for the data and instead simply appends new data to the dataset as it is created.

What is graph schema used for?

Study These Flashcards

Graph schema uses graphs consisting of nodes and edges to graphically represent the structure of a dataset.

What do nodes represent in graph schema?

Study These Flashcards

Nodes in a graph represent entities and can contain the properties of the entity.

What do edges represent in graph schema?

Study These Flashcards

Edges are used to represent relationships between entities and are labelled with a brief description of the relationship.

In the graph schema example, what entities are shown and how are they represented?

Study These Flashcards

The example shows three entities represented in a graph schema as three circles. The properties of each entity (breed and name) are listed inside of the circles. Arrows linking the circles represent the relationships between the nodes.

Are timestamps typically included in graph schema diagrams?

Timestamps are rarely included in graph schema diagrams, instead you should assume that each node contains the most recent information available.

What is an accepted alternative method for representing properties in graph schema?

An accepted alternative to the method of representing properties used in the example is to list an entity's properties inside rectangles joined to entities with a dashed line. The dashed lines do not represent relationships like arrows do, just that the property belongs to the entity.

What does it mean for data to be 'too big to fit into a single server'?

This refers to the Volume aspect of Big Data - when data sizes are so big that they cannot fit on a single server, requiring distribution across multiple machines.

What must happen to processing when data sizes are too big for a single server?

The processing must be distributed across more than one machine.

Why is functional programming a solution for distributed processing?

Functional programming is a solution because it makes it easier to write correct and efficient distributed code.

What features of functional programming make it easier to write correct code?

The stateless nature (no side effects), immutable data structures, and support for higher-order functions make functional programming easier for writing correct code.

What features of functional programming make it easier to write code that can be distributed across multiple servers?

The same features - being stateless (no side effects), using immutable data structures, and supporting higher-order functions - make it easier to write code that can be distributed to run across more than one server.

11 - Big Data Flashcards

(31 cards)