Lecture 3 Flashcards

Data Preparation

1
Q

How is data analytics perfomed in practice ?

A

Data analytics goes through 4 phases in practice:
1. Preparation
2. Preprocessing
3. Analysis
4. Postprocessing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the tasks performed during the preparation phase ?

A

The tasks perfomed during preparation are:
* Planning
* Data collection
* Feature generation
* Data selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the tasks performed during the preprocessing phase ?

A

The tasks perfomed during preprocessing are:
* cleaning
* filtering
* completion
* correction
* standardization
* transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the tasks performed during the analysis phase ?

A

The tasks perfomed during the analysis are:
* visualisation
* correlation
* regression
* forecasting
* classification
* clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the tasks performed during the postprocessing phase ?

A

The tasks perfomed during postprocessing are:
* interpretation
* documentation
* evaluation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

This is for the data preparation!

Why is merging often necessary in data analytics ?

A

It is often the case that one dataset is insufficient to perform the whole analysis, hence dataset have to be merged. This depends on the number of datasets and type of variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the types of merging possible ?

A

There are 4 types of merging:
* Appending
* Horizontal stacking
* Join family
* Variable selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When is Appending used ?

A

Appending is used with 2 datasets with the same variables. That is when the data gets vertically stacked

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When is Horizontal stacking used ?

A

Horizontal stacking is used for 2 datasets with the same variables. They get horizontally sequenced.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When is Join familiy used ?

A

It is used for two datasets with different variables. There are 6 types of Join families:
* inner_join: everything that is in one AND the other
* full_join: everything that is in one OR the other
* left_join: everything in the left join AND in the intersection of both
* right_join: everything in the right join AND in the intersection of both
* semi_join: only what is in both, with only variables from the first
* anti_join: what is in neither keeping only the first

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is common variable selection?

A

With more than two datasets, it is about selecting all the common features and joining them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is creation of multiple subdataset?

A

It is a second technique employed to merge more than 2 datasets. That is to merge subsets of datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When does duplicate data occur ?

A

It occurs from data entry:
* lack of unique identifiers
* lack of integrity or validation checks
* data errors

Or from data merging:
* structural heterogeneity
* lexical heterogeneity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What can you do to limit duplicates ?

A

Duplicates can be prevented by design with:
* use of standards
* integrity rules
* validation checks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is strucutral heterogeneity?

A

Fields of different databases represent the same information in a structurally different manner.

EX:
DB1: Contact Name
DB2: Salutation, First Name, Last Name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Lexical heterogeneity?

A

Fields of different databases are structurally the same, but they represent the same information in a different manner.

DB1 - Address: 32 E St. 4.
DB2 - Address: 32 East,4 th Street

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a unique identifiers?

A

A unique identifier is a data field that is always unique for an entity

Ex: Social Security Number for customer data, Manufacturer Part Number..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are validation checks ?

A

Validation check: Do the unique identifiers conform to valid
patterns (for example, AAA-GG-SSSS for SSN)?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are
integrity constraints?

A

Integrity constraint: Do the identifier comply with the standard length? (for example, 11 char limit for SSN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What can you do when faced with duplicate data ?

A

You can use deduplication, which is the process of removing duplicates from a dataset. Deduplication techniques for a single dataset are:
* Identify duplicate rows
* Find duplicates in one column
* Find duplicates in one column
* Drop duplicates in one column
* Drop duplicates in multiple columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is preprocessing?

A

Preprocessing is often used as an umbrella term to define all the operations that are performed prior to start the
analysis.
There is :
* General purpose preprocessing
* Preprocessing for diagnostic analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

We are now moving onto the data preprocessing section

Why is preprocessing needed ?

A

The data quality strongly influences the analysis quality (Garbage In => Garbage Out)
Preprocessing brings solutions to improve data quality in our dataset.

23
Q

How is data quality measured?

A

There are 4 measures of data quality:
* Validity
* Accuracy
* Consistency
* Uniformity
* Completeness

24
Q

What is validity ?

A

Validity is the degree to which the data conform to defined
business rules or constraints.

25
Q

What is accuracy ?

A

Accuracy: The degree to which the data is close to the true values.

Note! This is different from precision.

Precision: How consistent results are when measurements are repeated.

Data might be valid (i.e. respecting validity constraints) but not accurate.

26
Q

What is consistency?

A

Consistency: The degree to which the data is consistent,
within the same data set or across multiple data sets.

27
Q

What is uniformity?

A

Uniformity: The degree to which the data is specified using the same unit of measure.
In terms of probability distributions, uniformity can also refer to the data having a uniform distribution.
We consider uniformity in terms of measurement units (1st
example)

28
Q

What is completeness?

A

Completeness: The degree to which all required measures are known.
Several reasons could exist for incomplete data:
* Human-intensive workflow
* Multiple data sources
* Optional data

29
Q

What are some contraints of data validity ?

A

Some constraints are:
* Data-Type Constraints: Values in a particular column must be of a particular datatype, e.g., boolean, numeric, date, etc.
* Range Constraints: Numbers or dates should fall within a certain range.
* Mandatory Constraints: Certain columns cannot be empty.
* Unique Constraints: A field, or a combination of fields, must be unique across a dataset.
* Set-Membership constraints: Values of a column come from a set of discrete values, e.g. enum values
* Foreign-key constraints: As in relational databases, a foreign key column can’t have a value that does not exist in the referenced primary key
* Cross-field validation: Certain conditions that span across multiple fields must hold. For example, a patient’s date of discharge from the hospital cannot be earlier than the date of admission.
* Regular expression patterns: text fields that have to be in a
certain pattern. For example, phone numbers may be required to have the pattern (999) 999–9999.

30
Q

What is missing data ?

A
  • Structurally missing data
  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Missing not at random
31
Q

What is structurally missing data?

And how can you deal with it ?

A

Structurally missing data is data that is missing for a logical reason.
Solutions:
* Excluding from any analysis the variables with the structurally missing values.

  • If we can logically deduce that the correct value, this value should be used in place of the missing values in our analysis (imputation)

imputation is the assignment of a value to something by inference from the value of the products or processes to which it contributes

32
Q

What is missing data completely at random ?

MCAR = missing completely at random

A

It is relatively easy to check the assumption that data is missing completely at random. If you can predict which units have missing data (e.g., using common sense, regression, or some other method), then the data is not MCAR.
A more formal way of testing is to use Little’s MCAR test.
To solve these analyses using only observations that have complete data (provided we have enough of such observations) should be used.

N.B. The MCAR assumption is rarely a good assumption. It is only likely to be true in situations where the data is missing due to some truly random phenomena (e.g., if people were
randomly asked 10 of 15 questions in a questionnaire).

33
Q

What is Missing (data) at random (MAR) ?

A

In the case of missing completely at random, the assumption was that there was no pattern. An alternative assumption, known somewhat confusingly as missing at random (MAR), instead assumes that we can predict the value that is missing based on the other data.
Solution: When data is missing at random, it means that we
need to either use an advanced imputation method, such as
multiple imputation, or an analysis method specifically designed for missing at random data.

34
Q

What is Missing (data) not at random?

A

Solution: When data is missing not at random, it means that we cannot use any of the standard methods for dealing
with missing data (e.g., imputation, or algorithms specifically designed for missing values). If the missing data is missing not at random, any standard calculations give the wrong answer.

35
Q

What is diagnostic analysis?

A

Diagnostic analytics aims to look at a phenomenon and understand the reasons behind it and the relationships between the observed variables.
In this course, it is focused on analysing the relationships among variables:
* Correlation analysis (~ Is there a relationship between a pair of variables)
* **Regression analysis **(~ Is there a relationship between one (dependent) variable and one or more (independent) variables)

36
Q

We are now talking about diagnostic analytics

What is feature engineering and why is it applied?

Diagnostic analytics aims to look at a phenomenon and understand “why did it happen”?

A

feature engineering is *transforming and combining *existing data to provide new information and extra more information from the dataset.
This is used because the raw data is sometimes not sufficient to provide all the necessary information.

37
Q

How can you create more data?

A

You can create more data through:
* Feature extraction
* Feature combination

38
Q

How can you transform existing data?

A

Depending on whether the data is missing or whther it has to be scaled, we need to use different techniques.
For Missing data :
* Numerical imputation
* Categorical imputation
For Scaling:
* Log-transform
* Normalization
* Standardization
Otherwise, we can work with outlier removal.

39
Q

What is numerical imputation?

A

numerical imputation is about “Filling the gaps” = Transforming a variable to remove missing values
Imputation techniques involve:
* Replacement with statistically ininfluent values (mean,median)
* Replacement with null values
* Replacement with model-generated values
* Exclusion of missing values

40
Q

What is categorical imputation?

A

It is also about “Filling the gaps” = Transforming a variable to
remove missing values
A technique is
Hot-deck imputation: Replace missing value with most similar value from existing dataset

41
Q

What is log-transformation ?

A

It is taking the logarithm of a given feature.
It allows to turn a skewed-distribution into a more or less
normal one
Useful feature transformation technique for tools that makes
an assumption of normality

42
Q

What is standardisation ?

A

Standardising is rescaling a dataset so that its mean is 0 and its standard deviation is 1
No specific upper or lower bound for the maximum and
minimum values.

43
Q

What is normalisation?

A

It is rescaling a variable so that the range of the variable is [0,1]

44
Q

What are outliers ?

A

Outliers are values that are situated in the tails of the distribution, outside the “normal” range for the values.
Can be caused by different reasons:
* Measurement errors
* Statistically relevant anomalies (rare events)
* Missing values
Can be tackled by
* Removal
* Imputation
* Capping

45
Q

How does outlier removal work ?

A

The outlier removal process goes through a few steps:
1. Visual inspection/EDA
2. Human-in-the-loop procedure to assess validity of outlier
3. Outlier/anomaly removal:
A. Via training of a dedicated algorithm
B. Via human inspection

46
Q

What is feature extraction ?

A

Feature extration is the Process of creating (extracting) new variables from the existing one.
Includes:
* Creating derived variables (e.g. aggregations, time since)
* Discretizing existing variables
* Encoding variables in different formats (e.g. one-hot)
* Applying transformations on existing variables (logarithm, exponentiation, standardization, normalization)

47
Q

What is feature combination?

A

Process of creating (extracting) new variables by combining existing ones
Includes:
* Applying operations among variables (sum, counts, ratios) on numerical variables
* Polynomial/spline/trigonomet rical combination
* Combining categorical variables (e.g. outer product)

48
Q

What is causation ?

A

when one thing (a cause) causes another thing to happen (an effect)

49
Q

What is correlation ?

A

when two or more things appear to be related.

Correlation doesn’t always mean causation!

50
Q

What is the difference between a correlation and an experimental research ?

A

Correlational (C) and experimental (E) research both use quantitative methods to investigate relationships between variables. But there are important differences in data collection methods and the types of conclusions you can draw. They differ in
* Purpose:
* C - used to test strength of association between variables
* E - used to test cause-and-effect relationships between variables
* Variables:
* C - Variables are only observed with no manipulation or intervention by researchers
* E - independent variable is manipulated and a dependent variable is observed
* Control:
* C - limited control is used, so other variables may play
a role in the relationship
* E - extraneous variables are controlled so that they
can’t impact your variables of interest
* Validity:
* C - high external validity: you can confidently generalize
your conclusions to other populations or settings
* E - high internal validity: you can confidently draw
conclusions about causation

51
Q

What is correlation analysis ?

A

Using a correlation analysis, you can summarize the relationship between variables into a correlation coefficient: a single number that describes the strength and direction of the relationship between variables.
With this number, you’ll quantify the degree of the relationship between variables.
The Pearson product-moment correlation coefficient, also known as Pearson’s r, is commonly used for assessing a linear relationship between two quantitative variables.
Correlation coefficients are usually found for two variables at
a time, but you can use a multiple correlation coefficient for three or more variables.

52
Q

What is regression analysis?

A

Simple linear regression is a parametric test, meaning that it makes certain assumptions about the data. These assumptions are:
Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change significantly across the values of the independent variable.
Independence of observations: the observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among observations.
Normality: The data follows a normal distribution.
Simple linear regression is a parametric test (with all the limitations of statistical testing):
* Statistical power/sample size
* Dependency on assumptions (if assumptions are violated, conclusions are unreliable)
When all the assumptions are verified, conclusions can be drawn only on the observed data sample, not extrapolated to the entire population.

The formula for a simple linear regression is:
𝑦 = 𝛽0 + 𝛽”𝑋 + 𝜖

53
Q

How can you characterise distributions ?

A

Several statistics can be computed about a probability
distribution:
* Mean
* Mode
* Median
* Standard deviation/Variance
* Skewness
* Kurtosis

54
Q

What are the steps for statistical testing ?

A
  1. Experimental design and Hypothesis definition (Null hypothesis and alternative hypothesis)
  2. Data collection (Probability sampling, Selective sampling, Sample size determination)
  3. Data summarization (Central tendency measure, Variability measure, Test statistic)
  4. Hypothesis testing
  5. Result interpretation (One-sample test, Paired/Unpaired samples test (Two-sample), One-tailed/Two-tailed test (Directional vs Undirectional), Correlation-testing, Regression-testing)