Big Data Flashcards

1
Q

Big Data

A

Big Data is a term that refers to data that can’t be processed or analysed using traditional methods

There are 3 features that make data Big Data:

  • Volume; the amount of data to be processed is too big to fit on a single server - data must be analysed in a single set in terabytes or petabytes to be classed as Big Data
  • Velocity; The data is generated very quickly and must be processed very quickly.
    If the data is at rest it can be batch processed but if the data is in motion (streaming systems) it must be processed in real time
  • Variety; The data is in many forms, including unstructured, semi-structured, structured, text and multimedia
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Structured, semi-structured and unstructured data

A

Structured data = data that can be represented in table form because it has a clear, identifiable structure

Semi-structured
data = data such as XML or JSON formatted files. They don’t have a formal structure but do have some kind of structure which can vary

Unstructured data = data whose text is so variable that it can’t be modelled in advance, it can’t be fitted into a table structure required by relational database modelling or its elements are not identifiable with tags

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Distributed Processing

A

In systems involved in the processing of big data, the data has to be distributed across multiple servers because there’s too much data to fit on one server

The program written to process Big Data must be able to execute on more than one machine at a time - this is called distributed code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Functional programming and Big Data

A

Functional languages are a solution to Big Data problems as:

  • Functional languages have immutable data strucures - an immutable object is one whose state cannot be changed after its been created
  • Functional programs are stateless; the program’s behaviour doesn’t depend on how often the function is called or in what order different functions are called
  • Functional languages support higher order functions; they are functions that take at least one function as a parameter or return a function as a result or both

Higher order functions can be easily parallelised so that many processors can work at the same time without affecting other parts of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Fact based models

A

Each fact in a facted based model captures a single piece of information

The data in a fact-based model immutable and cannot be altered except to delete any data that has been entered incorrectly

When a change in circumstance is to be recorded it’s recorded as a new fact rather than an update - this means the dataset grows continuously with the addition of time-stamped immutable data

Each fact is:
- Atomic; stores a single piece of info

  • Time-stamped
  • Kept immutable with timestamps
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Graph schema

A

A graph schema captures the structure of a dataset stored using the fact-based model

It shows entities in the dataset, properties of the entities and relationships between entities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly