4.11 - Big Data Flashcards
(17 cards)
What is Big Data in one sentence?
A catch-all term for data sets so large or complex that they no longer fit into the usual storage or processing “containers.”
Why is volume a challenge in Big Data?
The data is too big to fit on a single server.
What does velocity refer to in Big Data?
Data arrives as a stream that may need millisecond-to-second responses.
What does variety mean in the context of Big Data?
The data comes in many forms—structured, unstructured, text, images, audio, video, etc.
What is usually the hardest aspect of Big Data and why?
Its lack of structure; without rows-and-columns it is much harder to analyse and cannot be stored efficiently in relational databases.
Why don’t traditional relational databases suit Big Data?
They require tidy row-and-column structure and don’t scale well over multiple machines.
How do we extract patterns and useful information from unstructured Big Data?
By applying machine-learning techniques.
When does size become the real issue in Big Data?
Once the data set no longer fits on a single server and relational databases won’t scale horizontally.
What must happen when data no longer fits on one machine?
Processing has to be distributed (shared) across multiple machines.
Why is functional programming useful for distributed Big Data processing?
Its style makes it easier to write correct and efficient code that can be spread across servers.
How do immutable data structures help in distributed systems?
No in-place updates → no race conditions, so parallel code stays correct.
What does statelessness mean, and why is it good for distribution?
Functions depend only on their inputs, not shared state → easy to reproduce work on any node.
What are higher-order functions, and why are they handy for Big Data?
Functions that take/return other functions (e.g., map, filter) let you express large-scale data transformations concisely.
In a fact-based model, what is a fact?
A single, atomic piece of information.
What three parts make up a graph schema?
Nodes (entities), edges (relationships), and properties (attributes on nodes/edges).
Why use a graph schema for Big Data?
It captures the structure of a large, often irregular data set in a way that can be traversed and queried efficiently.
state two features of functional programimg languages that make it easier to write code that can be distributed to run accorss more than one server
- Immutable data structures // the state of a data structure cannot be changed
- Statelessness // functions do not have side-effects // all functions are pure
- Functions can be distributed to servers and executed on data sets then the results can be combined // map-reduce