L1: Course Introduction Flashcards

1
Q

What are the three Vs that distinguish big data from just data?

A

Volume
Variety
Volecity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

One of the three Vs that distinguish big data from just data is Volume - what does this mean

A

Big data is defined by its volume. Owing to the digitalisation of life as we know it, there are now immense amounts of data being captured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

One of the three Vs that distinguish big data from just data is Variety - what does this mean

A

Big data comes in a variety of forms captured by a multitude of different sensors and stored in a variety of formats. Big data goes beyond numbers, and also comprises images, videos and more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

One of the three Vs that distinguish big data from just data is Veloicty- what does this mean

A

Velocity refers to the speed of which data is generated and transmitted – i.e., big data is not just big sets of various types of data, but big data is often in motion; constantly changing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Some critics argue that two additional Vs characterise big data. Which?

A) Veracity: difficulty of assessing the data’s reliability, completeness, or trustworthiness
B) Visualisability: ability to visualise meaningful infomration
D) Value: data’s ability to fuel business applications

A

A) Veracity: difficulty of assessing the data’s reliability, completeness, or trustworthiness

D) Value: data’s ability to fuel business applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Despite the fact that BDA is increasingly adopted i organisations, some limitations hinder this adoption. Provide some examples hereto

A

Budget: expensive to implement

Data security concerns: how to store it responsibly

Integration challenges: shortage of technical expertise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Generally, when working with and analysing big data, at following methods can be considered. Which? (Select all correct)

A) Chunk and pull
B) Split and search
C) Push compute to data
D) Sample and model

A

A) Chunk and pull
C) Push compute to data
D) Sample and model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

One of the methods that can be applied when working with big data is CHUNK AND PULL. Which statements are NOT true?
A) the method suggests to split up the dataset into smaller chunks, allowing a local device to handle them
B) Chunks are typically logical and structured rather than based on randomized separation
C) After data split, each chunk can be pulled individually to conduct analysis
D) When all chunks are analyzed, the results are aggregated to get conclusion
E) Poorly suited for parallelization
F) Not all data is appropriate for chunking logically, posing a limitation to the method

A

FALSE: E) Poorly suited for parallelization

Chunk and pull are well-suited for parallelization and ultimately allows you to analyze large sets of data with lower computational power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

One of the methods that can be applied when working with big data is PUSH COMPUTE TO DATA. Which statements are NOT true?
A) Entails compressing the big dataset in database where it is stored
B) Once compression is complete, data can be pulled into a local device to analyze the compressed dataset
C) Disadvantage is that it relies on database speed and functionalities
D) the advantage is that the entire dataset is used at once and that it can be faster than CHUNK AND PULL

A

All are correct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

One of the methods that can be applied when working with big data is SAMPLE AND MODEL. Which statements are NOT true?
A) Entails taking a sample from a big dataset (a volume that can be handled by a local device)
B) Essentially we downsample the dataset to a more convenient size
C) Advantages: data can be modelled by standard software packages: allows or rapid prototyping with different techniques
D) Disadvantages: must ensure that sample is valid and representative; potential scalability issues
E) Not the focus of the course

A

WRONG: E) Not the focus of the course

SAMPLE AND MODEL is the method used in the course

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

You want to estimate the price of which a given property will sell for under given conditions.

Do you need a predictive or explanatory model for this purpose?

A

Predictive - you want to know if, when, where, and how much of something will happen.

For this purpose, we are only interested in the predictive accuracy of the model - not causal effects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

You want to investigate why properties sell for more or less; e.g., the causal effect of the size of the property, the location, material design etc. on the price

Do you need a predictive or explanatory model for this purpose?

A

An explanatory model identifies cause-and-effect relationships: If you want to know WHY or HOW something will happen, you are interested in explanation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Predictive models explain correlations between the independent variables and the dependent variable

TRUE/FALSE

A

TRUE

In predictive models, you will get correlation coefficients estimating the association between the independent and dependent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explanatory models explain causations between the independent variables and the dependent variable

TRUE/FALSE

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is true about overfitting? (Select all correct)
A) One of the most common pitfalls in BDA and ML
B) Means that model has been fitted to tightly to the training dataset and fails to be generalisable
C) It can be mitigated if dataset is sufficiently large and by using cross-validation
D) It is particularly a big problem i very large datasets

A

A) One of the most common pitfalls in BDA and ML
B) Means that model has been fitted to tightly to the training dataset and fails to be generalisable
C) It can be mitigated if dataset is sufficiently large and by using cross-validation

WRONG: D) very large datasets allow for cross-validation with a lower number of folds (faster computation) since each fold incl. an ample amount of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does garbage in, garbage out refer to?

A

whatever junk there is in the data will be propagated into the analytical results  you cannot model your way around poor quality data

17
Q

In R, the inputs to a function is also called _____

A) base
B) facotral
C) argument

A

In R, the inputs to a function is also called ARGUMENT

18
Q

To save results from functions run, it requires creating a ____

A) object
B) factoral
C) base

A

To save results from functions run, it requires creating a OBJECT

19
Q

In R, there are different types of data. Fill in the blanks:

____ represents text or string data
____ represents categorical data type and contains a finite set of unique values or levels
_____ represents continuous data and allows for decimal points
_____ represents whole numbers without decimal places
_____ represents Boolean values (TRUE or FALSE)

A

CHARACTER represents text or string data

FACTOR represents categorical data type and contains a finite set of unique values or levels

NUMERICAL represents continuous data and allows for decimal points

INTEGER represents whole numbers without decimal places

LOGICAL represents Boolean values (TRUE or FALSE)