11 - Big Data Flashcards
(31 cards)
What is Big Data?
Big Data is a catch-all term for data that won’t fit the usual containers.
What are the three Vs that define Big Data?
Volume, Velocity, and Variety.
What does Volume mean in the context of Big Data?
Volume means there is too much data for it all to fit on a conventional hard drive or even a server. Data has to be stored over multiple servers, each of which is composed of many hard drives.
What does Velocity mean in the context of Big Data?
Velocity means data on the servers is created and modified rapidly. The servers must respond to frequently changing data within a matter of milliseconds.
What does Variety mean in the context of Big Data?
Variety means the data held on the servers consists of many different types of data from binary files to multimedia files like photos and videos.
Which attribute of Big Data causes the most trouble - Volume, Velocity, or Variety?
Big Data’s lack of structure (related to Variety) causes the most trouble, not Volume as might be expected.
Why does Big Data’s unstructured nature make it difficult to analyse?
Big Data’s unstructured nature makes it difficult to analyse because conventional databases are not suited to storing big data as they require the data to conform to a row and column structure. Furthermore, conventional databases do not scale well across multiple servers.
What techniques must be used to extract useful information from Big Data?
Machine learning techniques must be used to discern patterns in the data.
Give two examples of Big Data.
Continuously monitored banking interactions and data from surveillance systems.
When data is stored over multiple servers, what must happen to the processing?
The processing associated with using the data must also be split across multiple machines.
Why would distributed processing be difficult with conventional programming paradigms?
It would be incredibly difficult because all the machines would have to be synchronised to ensure that no data is overwritten or otherwise damaged.
What is the solution to the problem of processing data over multiple machines?
Functional programming is a solution to the problem of processing data over multiple machines.
What are the key attributes of functional programs?
Functional programs are stateless (meaning that they have no side effects), make use of immutable data structures, and the functional programming paradigm supports higher-order functions.
What makes functional programming easier for writing correct, efficient, distributed code?
The attributes of being stateless (no side effects), using immutable data structures, and supporting higher-order functions make it easier to write correct, efficient, distributed code with functional programming than with procedural programming techniques.
Why doesn’t Big Data conform to typical data representation methods?
Because big data doesn’t conform to the row and column format typically used to represent data, it must be represented in a different manner.
What is the fact-based model for representing data?
In the fact-based model, each individual piece of information is stored as a fact. Facts are immutable (meaning that they never change once created) and can’t be overwritten.
What is stored with each fact in the fact-based model?
Stored with each fact is a timestamp, indicating the date and time at which a piece of information was recorded.
Why are timestamps important in the fact-based model?
Since facts are never deleted or overwritten, multiple different values could be held for the same attribute. Timestamps allow a computer to discern which value is the most recent.
In the house example given, what were the two facts and their timestamps?
The house was green in 2010 but re-painted white in 2016. If the colour of house number 42 was queried, the timestamps would be compared and the most recent information (Colour: White) would be returned.
What are two benefits of using the fact-based model for storing Big Data?
1) Using the fact-based model reduces the risk of accidentally losing data due to human error (thanks to facts being immutable and not overwritable). 2) The model does away with an index for the data and instead simply appends new data to the dataset as it is created.
What is graph schema used for?
Graph schema uses graphs consisting of nodes and edges to graphically represent the structure of a dataset.
What do nodes represent in graph schema?
Nodes in a graph represent entities and can contain the properties of the entity.
What do edges represent in graph schema?
Edges are used to represent relationships between entities and are labelled with a brief description of the relationship.
In the graph schema example, what entities are shown and how are they represented?
The example shows three entities represented in a graph schema as three circles. The properties of each entity (breed and name) are listed inside of the circles. Arrows linking the circles represent the relationships between the nodes.