Chapter 3- Data Exploration Flashcards

Question

Explain approach 4

Answer 1

- Imputation replaces missing feature values with a plausible estimated value based on the feature values that are present - Most commonly you replace missing values with a measure of central tendency of that feature - We should be reluctant to use imputation on features missing in excess of 30% of their values - Strongly recommend against using imputation on features missing in excess of 50% of their values

Answer 2

Use clamp transformation

Answer 3

It clamps all values above an upper threshold and below a lower threshold to these threshold values thus removing the offending outliers

Answer 4

Three popular rules of thumb: - Values located at least 1.5*IQR above Q3 or below Q1. Requires sorting so is expensive - Values more than 2 standard deviations from the mean. Both mean and sd can usually be computed easily - 2% from top and bottom of your ordered data. Trivial to implement but hard to scientifically defend heuristics

Answer 5

- It is based on 2 axes, horizontal and vertical axis - Each instance is represented by a point on the plot determined by the values for that instance of the two features involved

Answer 6

- Scatter plot matrix (SPLOM) shows scatter plots for a whole collection of features arranged into a matrix - It is useful for exploring the relationship between groups of features - It is a visualization of the correlation matrix

Answer 7

Calculate formal measures of the relationship between two continuous features using covariance and correlation

Answer 8

- [Negative infinity, Positive infinity] - Negative values indicate a negative relationship - Positive values indicate a positive relationship - Values near zero indicate that there is little to no relationship

Answer 9

- A normalized form of correlations - Range [-1, +1] - Correlation values close to -1 indicate strong negative correlation - Values close to 1 indicate a strong positive correlation - Values around 0 indicate no correlation - Features that have no correlation are said to be linearly independent

Answer 10

- Covariance matrix - Correlation matrix (normalized version of covariance matrix)

Answer 11

- Distributive - Algebraic - Holistic (non-algebraic)

Answer 12

- If results are derived by applying the function to n aggregate values is the same as the one derived by applying the function on all the data without partitioning - count(), sum(), min(), max()

Answer 13

- If the measure can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function - Average(), StandardDeviation(), Top-K()-K largest values, CenterOfMass()

Answer 14

- If there is no constant bound on the storage size needed to describe a sub aggregate - Median(), Rank() - Sometimes we may be able to approximate Holistic Measures using Non-Holistic ones

Answer 15

- Correlation is a good measure of the linear relationship/ dependency between two continuous features, but it is by no means perfect - When we focus only on correlation values without data visualization, it may cause us to not notice other dependencies between our features - Correlation does not necessarily imply causation

Answer 16

- Mistaking the order of a causal relationship - Like when you assume playing basketball makes people tall when people actually choose to play basketball because of their height - Unawareness of the existence of a third (confounding) feature - There is a third feature that influences the two others making them appear highly correlated - Like assuming the drinking coffee causes cardiovascular disease when actually smoking causes it and most smokers are coffee drinkers

Answer 17

Using a collection of small multiple bar plots

Answer 18

If the number of levels of one of the features being compared is no more than three

Answer 19

A small multiples approach that draws a histogram of the values of the continuous feature for each level of the categorical feature

Answer 20

Using a box plot

Answer 21

- It lets us visually show the five-number summary of a distribution - Trimmed minimum, Q1, Median, Q3, Trimmed maximum

Answer 22

- A combination of a box plot and density plot - It provides even more detailed information about our data distribution

Answer 23

They change the way data is represented just to make it more compatible with certain machine learning algorithms

Answer 24

- Normalization (feature scaling) - Binning - Sampling

Answer 25

- These techniques can be used to change a continuous feature to fall within a specified range while maintaining the relative differences between the values for the feature

Answer 26

To ensure you retain all information about the normalization procedure

Answer 27

- To assure no feature with larger range is misleading out ML algorithm - To safely fit the original data into a range desired by out computer/ ML software

Answer 28

- Min-Max normalization (Range normalization) - Z-score normalization (Standardization) - Decimal scaling

Answer 29

- It is used to linearly convert all feature values into the new range - It allows us to maintain all relative differences between the values for the feature

Answer 30

- It is convenient and intuitive - It lets us fit the data into any range we want easily while the original proportions between data points are guaranteed to be perfectly preserved

Answer 31

- It is sensitive to the presence of outliers in a dataset - It carries the risk of easy generation of out-of-bound errors for new entries

Answer 32

- It is the process of transforming data to standard scores - A standard score measures how many standard deviations a feature value is from the mean for that feature

Answer 33

- It squashes the values of the feature so that the feature values have a mean of 0 and standard deviation of 1 - This results in the majority (68%) of the feature values being in a range of [-1, +1] - If the data in all features follow a Gaussian/Normal distribution they will be converted to standard normal and comparisons between different features become easier.

Answer 34

- It is easy to specify the output range for Min-Max Normalization (often [0,1] or [0.1, 0.9] or [1, 10 to remain positive) - It is more of a challenge to guarantee the exact range from Z-score normalization - Standardization is slightly more resistant than Min-Max to out-of-bound errors - Standardization will introduce distortions to the data if it is not normally distributed

Answer 35

- A scaling method impacted less by outliers - It is good because outliers impact sample mean and standard deviations very significantly - It removes the median and scales using IQR

Answer 36

- It transforms data by removing the decimal points of the feature values - The number of decimal points moved depends on the maximum absolute value of that feature - It is trivial to implement - It is highly out-of-bounds resistant

Answer 37

- Signal community when we want to separate noise from data easily - Recommended for dealing with Skewed left unimodal or Exponential data distributions

Answer 38

This involves converting a continuous feature into an ordinal feature

Answer 39

Define a series of ranges called bins for the continuous feature that corresponds to the levels of the new categorical feature we are creating

Answer 40

- Equal-width binning - Equal-frequency binning

Answer 41

- If we set the number of bins too low we may lose a lot of information - If the number of bins is too high we may have very few instances in each bin or end up with empty bins - That is preferable to losing information

Answer 42

- It splits the range of the feature values into b bins each of the size range/b - We want to store the bucket boundaries and the sum of frequencies of the values with each bucket

Answer 43

- Use one pass over ABT to construct an accurate equal-width histogram - Keep a running count for each bucket - If scanning is not acceptable use sampling - Construct a histogram on ABT sample and scale the frequencies by |ABT|/|ABT sample|

Answer 44

- Incremental maintenance: for each update in ABT increment/decrement the corresponding bucket frequencies - Periodical re-computation: because distribution changes slowly

Answer 45

- It first sorts the feature values in ascending order and then places an equal number of instances into each bin starting with bin 1 - The number of instances in each bin is (total number of instances)/ (number of bins)

Answer 46

- Sort all data entries then take equally spaces slices - Sampling also works

Answer 47

- Incremental maintenance - Merge adjacent buckets with small counts - Split any bucket with large count - Select the median value to split - Need a sample of the values within this bucket to work well - Periodic re-computation also works

Answer 48

- It is used when the dataset is so large we do not use all the data available to us in an ABT - We have to be careful to ensure that the resulting datasets are still representative of the original dataset and that no unintended bias is introduced during this process

Answer 49

- Top sampling - Random sampling - Stratified sampling - Under-sampling - Over-sampling

Answer 50

- Selects the top percentage of instances from the dataset to create a sample - Runs the risk of introducing bias since the sampling will be affected by any ordering of the original dataset - Recommended to avoid

Answer 51

- Randomly selects a proportion of percentages of the instances from the large dataset to create a smaller set - It is a good choice in most cases since the randomness should avoid introducing bias

Answer 52

- It ensure that the relative frequencies of the levels of a specific feature are maintained in the sampled dataset

Answer 53

- Divide the instances in the dataset into groups - Each group contains only instances that have a particular level for the stratification feature - Percentages of the instances in each stratum are randomly selected - These selections are combined to give an overall sample of percentages of the original dataset

Answer 54

When we would like a sample to contain different relative frequencies of the levels of a particular feature to the distribution in the original dataset

Answer 55

- Begins by dividing a dataset into groups where each group contains only instances that have a particular level for the feature to be under-sampled - The number of instances in the smallest group is the under-sampling target size - Each group that has more instances than the smallest is randomly sampled by the appropriate percentage to create a subset that is the under-sampling target size - The under-sampled groups are combined to create the overall under-sampled dataset

Answer 56

- It addresses the same issue as under-sampling but the opposite way - After dividing the dataset into groups the number of instances in the largest group becomes the over-sampling target size - From each smaller group, we create a sample containing that number of instances using random sampling with replacement - Larger samples are combined to form the overall over-sampled dataset

Chapter 3- Data Exploration Flashcards

(80 cards)