Big Data Flashcards
(23 cards)
Define
Big Data
A broad term for datasets so large or complex that traditional data processing applications are inadequate, and the data must be stored on multiple servers.
Define
Volume
The Three V’s
The capacity required to store the data exceeds a single server.
Define
Velocity
The Three V’s
The data is produced and/or processed at very high speed.
Define
Variety
The Three V’s
The data is very diverse; data can appear in different types (eg text, video, images) and forms (eg structured, unstructured, semi-structured)
Define
Structured Data
Data that can be stored in a traditional system such as a relational database or spreadsheet, as they can be defined using fields and records.
Define
Unstructured Data
Data that cannot be defined in columns or rows (text documents, PDFs, voice messages, emails). It makes it difficult to anlayse the data.
Identify
Issues With Big Data
- Data sets so large they are difficult to store and analyse.
- Data is constantly changing, so it is difficult to keep track of changes.
- Massive storage and processing power required.
- Specialised software required to manage and extract meaningful info from the data.
- Data is unstructured so makes it very difficult to analyse.
Describe
Data Mining
The use of a variety of statistical analysis tools to uncover previously unknown patterns in the data stored in databases or relationships among variables.
Describe
Predictive Analysis
The use of data warehouses and complex algorithms to forecast future events, based on historical trends and calculated probabilities.
Describe
Data Warehousing
The process of bringing together data from various sources into one place so that meaningful data analysis can take place, such as data mining and predictive analytics.
Describe
Fact-Based Model
Used to represent, model, and query data sets at the scale of Big Data. It is similar to entity relationship models used in databases.
Define
Fact
Fact-Based Model
A piece of data that cannot be decomposed any further, and is forever true. The data:
- Must not include reduntant information.
- Must be specific to a particular point in time.
- Cannot be changed or deleted.
Describe
Graph Schema
A method of defining a structure of a big dataset using the fact-based model, as a graph.
Describe
Node
Represents a core entity in a data set. Depicted with an oval.
Describe
Edge
Represents the relationships between entities (nodes). Depicted using solid lines linking nodes togethor.
Define
Property
Defines information about a node. Depicted with a rectangular box.
Define
Distributed Processing
The principle of dividing processing work between two or more computers, linked together in a network.
Define
Functional Programming
A type of programming paradigm that is mainly used for calculations and distributed processing, as the code can be proven correct and can be distributed across multiple devices without fear of erroneous results Some characteristics are:
- Immutable Data Structures
- Statelessness
- Higher-Order Functions
Define
Immutable Data Structure
A data structure in which one cannot insert, remove, or replace the values contained therein.
Define
Statelessness
A given program does not change its state during execution (data structures don’t change and variables are not used).
Define
Higher-Order Function
A function that can use functions as parameters and return functions as a result.
Define
Map Function
A higher-order function which applies a function to each element of a list and returns a new list.
Define
Fold/Reduce Function
Applies a function recursively over a list and returns a value.