Lecture 3 Flashcards

Question

What is accuracy ?

Answer 1

Accuracy: The degree to which the data is close to the true values. | Note! This is different from precision. ## Footnote Precision: How consistent results are when measurements are repeated. Data might be valid (i.e. respecting validity constraints) but not accurate.

Answer 2

Consistency: The degree to which the data is consistent, within the same data set or across multiple data sets.

Answer 3

**Uniformity**: The degree to which the data is specified using the same unit of measure. In terms of probability distributions, uniformity can also refer to the data having a uniform distribution. We consider uniformity in terms of measurement units (1st example)

Answer 4

Completeness: The degree to which all required measures are known. Several reasons could exist for incomplete data: * Human-intensive workflow * Multiple data sources * Optional data

Answer 5

Some constraints are: * **Data-Type Constraints**: Values in a particular column must be of a particular datatype, e.g., boolean, numeric, date, etc. * **Range Constraints**: Numbers or dates should fall within a certain range. * **Mandatory Constraints**: Certain columns cannot be empty. * **Unique Constraints**: A field, or a combination of fields, must be unique across a dataset. * **Set-Membership constraints**: Values of a column come from a set of discrete values, e.g. enum values * **Foreign-key constraints**: As in relational databases, a foreign key column can’t have a value that does not exist in the referenced primary key * **Cross-field validation**: Certain conditions that span across multiple fields must hold. For example, a patient’s date of discharge from the hospital cannot be earlier than the date of admission. * **Regular expression patterns:** text fields that have to be in a certain pattern. For example, phone numbers may be required to have the pattern (999) 999–9999.

Answer 6

* Structurally missing data * Missing completely at random (MCAR) * Missing at random (MAR) * Missing not at random

Answer 7

Structurally missing data is data that is missing for a logical reason. Solutions: * Excluding from any analysis the variables with the structurally missing values. * If we can logically deduce that the correct value, this value should be used in place of the missing values in our analysis (imputation) ## Footnote imputation is the assignment of a value to something by inference from the value of the products or processes to which it contributes

Answer 8

It is relatively easy to check the assumption that data is missing completely at random. If you can predict which units have missing data (e.g., using common sense, regression, or some other method), then the data is not MCAR. A more formal way of testing is to use Little’s MCAR test. To solve these analyses using only observations that have complete data (provided we have enough of such observations) should be used. ## Footnote N.B. The MCAR assumption is rarely a good assumption. It is only likely to be true in situations where the data is missing due to some truly random phenomena (e.g., if people were randomly asked 10 of 15 questions in a questionnaire).

Answer 9

In the case of missing completely at random, the assumption was that there was no pattern. An alternative assumption, known somewhat confusingly as missing at random (MAR), instead assumes that we can predict the value that is missing based on the other data. Solution: When data is missing at random, it means that we need to either use an advanced imputation method, such as multiple imputation, or an analysis method specifically designed for missing at random data.

Answer 10

Solution: When data is missing not at random, it means that we cannot use any of the standard methods for dealing with missing data (e.g., imputation, or algorithms specifically designed for missing values). If the missing data is missing not at random, any standard calculations give the wrong answer.

Answer 11

Diagnostic analytics aims to look at a phenomenon and understand the reasons behind it and the relationships between the observed variables. In this course, it is focused on analysing the relationships among variables: * **Correlation analysis** (~ Is there a relationship between a pair of variables) * **Regression analysis **(~ Is there a relationship between one (dependent) variable and one or more (independent) variables)

Answer 12

**feature engineering** is *transforming and combining *existing data to provide new information and extra more information from the dataset. This is used because the raw data is sometimes not sufficient to provide all the necessary information.

Answer 13

You can create more data through: * Feature extraction * Feature combination

Answer 14

Depending on whether the data is missing or whther it has to be scaled, we need to use different techniques. For Missing data : * Numerical imputation * Categorical imputation For Scaling: * Log-transform * Normalization * Standardization Otherwise, we can work with outlier removal.

Answer 15

numerical imputation is about “Filling the gaps” = Transforming a variable to remove missing values **Imputation techniques involve:** * Replacement with statistically ininfluent values (mean,median) * Replacement with null values * Replacement with model-generated values * Exclusion of missing values

Answer 16

It is also about “Filling the gaps” = Transforming a variable to remove missing values A technique is **Hot-deck imputation**: Replace missing value with most similar value from existing dataset

Answer 17

It is taking the logarithm of a given feature. It allows to turn a skewed-distribution into a more or less normal one Useful feature transformation technique for tools that makes an assumption of normality

Answer 18

Standardising is rescaling a dataset so that its mean is 0 and its standard deviation is 1 No specific upper or lower bound for the maximum and minimum values.

Answer 19

It is rescaling a variable so that the range of the variable is [0,1]

Answer 20

Outliers are values that are situated in the tails of the distribution, outside the “normal” range for the values. Can be caused by different reasons: * Measurement errors * Statistically relevant anomalies (rare events) * Missing values Can be tackled by * Removal * Imputation * Capping

Answer 21

The outlier removal process goes through a few steps: 1. Visual inspection/EDA 2. Human-in-the-loop procedure to assess validity of outlier 3. Outlier/anomaly removal: A. Via training of a dedicated algorithm B. Via human inspection

Answer 22

Feature extration is the Process of creating (extracting) new variables from the existing one. Includes: * Creating derived variables (e.g. aggregations, time since) * Discretizing existing variables * Encoding variables in different formats (e.g. one-hot) * Applying transformations on existing variables (logarithm, exponentiation, standardization, normalization)

Answer 23

Process of creating (extracting) new variables by combining existing ones Includes: * Applying operations among variables (sum, counts, ratios) on numerical variables * Polynomial/spline/trigonomet rical combination * Combining categorical variables (e.g. outer product)

Answer 24

when one thing (a cause) causes another thing to happen (an effect)

Answer 25

when two or more things appear to be related. | Correlation doesn't always mean causation!

Answer 26

Correlational (C) and experimental (E) research both use quantitative methods to investigate relationships between variables. But there are important differences in data collection methods and the types of conclusions you can draw. They differ in * Purpose: * C - used to test strength of association between variables * E - used to test cause-and-effect relationships between variables * Variables: * C - Variables are only observed with no manipulation or intervention by researchers * E - independent variable is manipulated and a dependent variable is observed * Control: * C - limited control is used, so other variables may play a role in the relationship * E - extraneous variables are controlled so that they can’t impact your variables of interest * Validity: * C - high external validity: you can confidently generalize your conclusions to other populations or settings * E - high internal validity: you can confidently draw conclusions about causation

Answer 27

Using a correlation analysis, you can *summarize the relationship between variables* into a **correlation coefficient**: a single number that describes the strength and direction of the relationship between variables. With this number, you’ll quantify the degree of the relationship between variables. The **Pearson product-moment** correlation coefficient, also known as Pearson’s r, is commonly used for assessing a linear relationship between two quantitative variables. Correlation coefficients are usually found for two variables at a time, but you can use a multiple correlation coefficient for three or more variables.

Answer 28

Simple linear regression is a parametric test, meaning that it makes certain assumptions about the data. These assumptions are: **Homogeneity of variance (homoscedasticity)**: the size of the error in our prediction doesn’t change significantly across the values of the independent variable. **Independence of observations**: the observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among observations. **Normality**: The data follows a normal distribution. Simple linear regression is a parametric test (with all the limitations of statistical testing): * Statistical power/sample size * Dependency on assumptions (if assumptions are violated, conclusions are unreliable) When all the assumptions are verified, conclusions can be drawn only on the observed data sample, not extrapolated to the entire population. ## Footnote The formula for a simple linear regression is: 𝑦 = 𝛽0 + 𝛽"𝑋 + 𝜖

Answer 29

Several statistics can be computed about a probability distribution: * Mean * Mode * Median * Standard deviation/Variance * Skewness * Kurtosis

Answer 30

1. Experimental design and Hypothesis definition (Null hypothesis and alternative hypothesis) 2. Data collection (Probability sampling, Selective sampling, Sample size determination) 3. Data summarization (Central tendency measure, Variability measure, Test statistic) 4. Hypothesis testing 5. Result interpretation (One-sample test, Paired/Unpaired samples test (Two-sample), One-tailed/Two-tailed test (Directional vs Undirectional), Correlation-testing, Regression-testing)

Lecture 3 Flashcards

Data Preparation (54 cards)