Lecture 4 – Data quality Flashcards

1
Q

What are the 4 Vs of big data?

A

Volume
Velocity
Variety
Veracity

2
Q

4Vs: What does Velocity refer to?

A

The speed at which data is generated
- sensors generate data every seconds
- interactions on a website create data every second
high velocity –> analysis of streaming data

3
Q

4Vs: What does Volume refer to?

A

Scale of data

Datasets in Terrabytes or Petabytes –> too big to process by a single processing computer –> new data storage and processing technology

4
Q

4Vs: What does Veracity refer to?

A

Uncertainty of data
Quality of the data (high veracity = valuable to analyze and contributes in a meaningful way, low veracity = not valuable, inaccurate, not contributing)

5
Q

4Vs: What does Variety refer to?

A

Different forms of data
(sources, formats, structured/unstructured)

6
Q

Which of the 4Vs are related to each growth law?

A

Moore’s: Velocity & Variety
Koomey’s: Variety
Bell’s: Variety & Veracity
Zimmerman’s: all of them

7
Q

Elements of the NIST Framework

A

Data sources
Data volume
Data velocity
Data variety
Data veracity
Software
Analytics
Processing
Capabilities
Security/Privacy
Lifecycle
Other

8
Q

What is data wrangling?

A

Cleaning the data so it can be used

9
Q

Issues related to the 4Vs that data wrangling may need to correct for?

A

Volume: with a lot of data, irregularities creep in

Velocity: data can be our-of-date quickly

Variety: data can be of different formats and types

Veracity: the accuracy of consistency of data from different sources or sets

10
Q

R: What is the difference between NA and NaN

A

NA = not available, missing data point
NaN = not a number, undefined or unrepresentable value (e.g. we divided number by 0)

11
Q

What are possible strategies to deal with missing data?

A
• omit them from the data (drop whole column, or drop observations)
• give them a value (impute)
12
Q

What is the purpose of the shadow matrix in R?

A

see how missing values relate to other variables in the table

13
Q

Different methods for imputation?

A

Simple parametric:
use mean/median

Simple non-parametric:
find the k nearest neighbors and average these

multiple imputation:
use a statistical distribution and simulate for the missing values