Big Data Overview Flashcards

1
Q

Big Data

A

refers to non-conventional strategies and
innovative technologies used by businesses and
organizations to capture, manage, process, and make
sense of a large volume of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

challenges of big data

A

*Capturing, transporting, and moving the data

*Managing - the data, the hardware involved, and the software

*Processing - to provide insight

*Storing - safeguarding and securing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

conventional BI & DWH architecture

A

App Servers
Network Switches
Database Servers
SAN Switch
Storage Array
proprities : SQL based
High availability
Enterprise database
Right design for structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Analytics Architecture

A

Edge node
Network switches
Data nodes
porprities :Not only SQL based
High scalability, availability, and flexibility
Compute and storage in the same box for reducing network latency
Right design for semi-structured and unstructured data
Data and Application are in the same machine (Data nodes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The Vs of Big Data

A

Volume Variety Velocity{the speed at which vast amounts of data are
being generated, collected and analyzed} Veracity {is the quality or trust of the data} Value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Volume

A

how much data is there?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Variety

A
  • how many different types of sources are there?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Velocity

A
  • how quickly is the data being created, moved, or
    accessed?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Veracity

A

can we trust the data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Validity

A
  • is the data accurate and correct?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Viability

A
  • is the data relevant to the use case at hand?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Volatility

A
  • how often does the data change?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Vulnerability -

A

can we keep the data secure?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Visualization

A
  • how can the data be presented to the user?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Value

A
  • can this data produce a meaningful return on
    investment?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Types of Big Data

A

Structured semi-structured unstructured

15
Q

Structured

A

Data that can be stored
and processed in a
fixed format, aka schema

16
Q

Semi-structured

A

Data that does not have a formal structure of a data model, i.e. a table
definition in a relational DBMS, but nevertheless it has some
organizational properties like tags and other markers to separate semantic
elements that makes it easier to analyze, aka XML or JSON

17
Q

Unstructured

A

Data that has an unknown form and cannot be stored in RDBMS and
cannot be analyzed unless it is transformed into a structured format

18
Q

5’Vs and Data : Volume Velocity Variety Veracity Value

A

Data at rest : not in use
Data in motion : analyzing data on the fly
Data in many forms
data in doubt
Data into money

19
Q

Hadoop

A

Apache open source software framework for reliable,
scalable, distributed computing of massive amount of data

20
Q

What Hadoop is good for

A

Massive amounts of data through
parallelism

A variety of data (structured, unstructured,
semi-structured)

Inexpensive commodity hardware

21
Q

Hadoop is not good for

A

Not to process transactions (random access)

Not good when work cannot be parallelized

Not good for low latency data access

Not good for processing lots of small files

Not good for intensive calculations with little data

22
Q

Data Lake

A

a large storage repository and processing engine

23
Q

Data munging/Data wrangling

A

is the process of
transforming and mapping data from one
“raw” data form into another format with the intent of
making it more appropriate and valuable for analytics

24
Q

Oceans of data

A

data at rest

25
Q

Streams of data

A

data in motion

26
Q

The main
categories of data are

A
  • Structured
  • Unstructured
  • Natural language
  • Machine-generated
  • Graph-based
  • Audio, video, and image
  • Streaming
27
Q

The six
design
principles in
Industry 4.0

A

Interoperability
Virtualization
Decentralization
Real-time Capability
Service Orientation
Modularity