Chapter 3- Data Exploration Flashcards

1
Q

What is in a data quality report?

A

Tabular reports that describe the characteristics of each feature in an ABT using standard measures of central tendency and variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which data visualizations accompany tabular reports?

A
  • Histogram for continuous features we can apply Quantitative scales to
  • Bar plot for each categorical feature we can apply Qualitative scales to
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are frequency histograms?

A

The consist of a set of rectangles/ bars that reflect the counts of frequencies of the classes present in the given data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you get to know categorical features?

A
  • Examine the mode, 2nd mode, mode % and 2nd mode %
  • These tell us the most common levels within the features and will identify if any levels dominate the dataset
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you get to know continuous features?

A
  • Examine the mean and standard deviation
  • they will help us get a sense of the central tendency and variation of the values within the dataset for the feature
  • Examine the minimum and maximum (and other quartiles)
  • They will help us understand the range that is possible for each feature
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the types of histogram characteristics?

A
  • Uniform
  • Normal (unimodal)
  • Unimodal (skewed right)
  • Unimodal (skewed left)
  • Exponential
  • Multimodal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a uniform distribution

A
  • It indicated that a feature is equally likely to take a value in any of the ranges present
  • Sometimes it shows that a descriptive feature contains an ID rather than a measure of something more interesting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Normal (unimodal) distribution?

A
  • They have a strong tendency toward a central value and symmetrical variation on either side of the central tendency
  • It is called unimodal because of the single peak around the central tendency
  • Naturally occurring phenomena follow normal distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the skew when the data contains some very high values?

A

Skew right/ positive skew

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the skew when the data contains some very low values?

A

Skew left/ Negative skew

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the mode and median relationship during skews?

A
  • Right skewed- Mode < Median
  • Left skewed- Mode > Median
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is exponential distribution?

A
  • The likelihood of certain values occurring is very high but diminishes rapidly for higher or lower values
  • It is a clear warning sign that outliers are likely
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is multimodal distribution?

A
  • It has two or more very commonly occurring ranges of values that are clearly separated
  • Bi-modal distribution can be thought of as two normal distributions pushed together
  • Occurs when a feature contains a measurement made across a number of distinct groups
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the two things multimodal distribution is a cause for?

A
  • Caution because measures of central tendency and variation tend to break down for multimodal data
  • Optimism because if we are lucky, the separate peaks in the distribution will be associated with the different target levels we are trying to predict
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What happens when a distribution has different means but identical standard deviations?

A

The distribution moves side to side

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What happens when a distribution has identical means but different standard deviations?

A

Distribution moves up and down

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the 68-95-99.7 rule state?

A
  • 68% of observations will be between one standard deviation of mean (mean - sd)
  • 95% of observations will be within two standard deviations of mean (mean - 2sd)
  • 99.7% of observations will be within three standard deviations of mean (mean - 3sd)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a data quality issue?

A

It is loosely defined as anything unusual about the data in an ABT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the most common data quality issues?

A
  • Missing values
  • Irregular cardinality
  • Outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the data quality issues we identify from a data quality report?

A
  • Issues due to invalid data (syntax error)
  • Issues due to valid data (human error)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you handle data issues?

A
  • Approach 1: Drop any features that have missing values
  • Approach 2: Apply complete case analysis (delete records)
  • Approach 3: Derive a missing indicator feature from features with missing value
  • Approach 4: Impute the missing values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the effects of approach 1?

A
  • It can result in massive, and frequently needless loss of data
  • Only features with 60% excess missing values should be considered for removal
  • An alternative is to derive a missing indicator feature for them, could be categorical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is approach 2?

A
  • We delete instances that are missing one or more feature values
  • This results in significant amounts of data loss and can introduce a bias to the dataset
  • This should rarely be used and only when a data instance is missing values for multiple features
  • It is recommended to remove instances that are missing the value in the target feature
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Explain approach 3

A
  • This could be a categorical feature that flags the missing data as a new label (unknown in marital status)
  • Or a binary feature that flags whether the value was present or missing (T/F)
  • When missing indicator features are used the original feature is usually discarded
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Explain approach 4

A
  • Imputation replaces missing feature values with a plausible estimated value based on the feature values that are present
  • Most commonly you replace missing values with a measure of central tendency of that feature
  • We should be reluctant to use imputation on features missing in excess of 30% of their values
  • Strongly recommend against using imputation on features missing in excess of 50% of their values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the easiest way to handle outliers?

A

Use clamp transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What does clamp transformation do?

A

It clamps all values above an upper threshold and below a lower threshold to these threshold values thus removing the offending outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How do you identify outliers?

A

Three popular rules of thumb:
- Values located at least 1.5*IQR above Q3 or below Q1. Requires sorting so is expensive
- Values more than 2 standard deviations from the mean. Both mean and sd can usually be computed easily
- 2% from top and bottom of your ordered data. Trivial to implement but hard to scientifically defend heuristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is a scatter plot?

A
  • It is based on 2 axes, horizontal and vertical axis
  • Each instance is represented by a point on the plot determined by the values for that instance of the two features involved
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is a scatter plot matrix?

A
  • Scatter plot matrix (SPLOM) shows scatter plots for a whole collection of features arranged into a matrix
  • It is useful for exploring the relationship between groups of features
  • It is a visualization of the correlation matrix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What else can we do in addition to visually inspecting scatter plots?

A

Calculate formal measures of the relationship between two continuous features using covariance and correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What range does covariance usually fall into?

A
  • [Negative infinity, Positive infinity]
  • Negative values indicate a negative relationship
  • Positive values indicate a positive relationship
  • Values near zero indicate that there is little to no relationship
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is correlation?

A
  • A normalized form of correlations
  • Range [-1, +1]
  • Correlation values close to -1 indicate strong negative correlation
  • Values close to 1 indicate a strong positive correlation
  • Values around 0 indicate no correlation
  • Features that have no correlation are said to be linearly independent
32
Q

What tools are useful for exploring relationships between multiple continuous features?

A
  • Covariance matrix
  • Correlation matrix (normalized version of covariance matrix)
33
Q

What are the three types of measures for correlation tests?

A
  • Distributive
  • Algebraic
  • Holistic (non-algebraic)
34
Q

Explain the distributive measure

A
  • If results are derived by applying the function to n aggregate values is the same as the one derived by applying the function on all the data without partitioning
  • count(), sum(), min(), max()
35
Q

Explain algebraic measure

A
  • If the measure can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function
  • Average(), StandardDeviation(), Top-K()-K largest values, CenterOfMass()
36
Q

Explain holistic (non-algebraic measure)

A
  • If there is no constant bound on the storage size needed to describe a sub aggregate
  • Median(), Rank()
  • Sometimes we may be able to approximate Holistic Measures using Non-Holistic ones
37
Q

How does correlation help with measuring relationships between features?

A
  • Correlation is a good measure of the linear relationship/ dependency between two continuous features, but it is by no means perfect
  • When we focus only on correlation values without data visualization, it may cause us to not notice other dependencies between our features
  • Correlation does not necessarily imply causation
38
Q

What are the reasons for mistakenly assumed causation?

A
  • Mistaking the order of a causal relationship
  • Like when you assume playing basketball makes people tall when people actually choose to play basketball because of their height
  • Unawareness of the existence of a third (confounding) feature
  • There is a third feature that influences the two others making them appear highly correlated
  • Like assuming the drinking coffee causes cardiovascular disease when actually smoking causes it and most smokers are coffee drinkers
39
Q

What is the simplest way to visualize the relationship between two categorical variables?

A

Using a collection of small multiple bar plots

40
Q

What situation can you use stacked bar plots as an alternative to the small multiples approach?

A

If the number of levels of one of the features being compared is no more than three

41
Q

What do you use to visualize the relationship between a continuous feature and categorical feature?

A

A small multiples approach that draws a histogram of the values of the continuous feature for each level of the categorical feature

42
Q

What other way can you visualize the relationship between categorical and continuous features?

A

Using a box plot

43
Q

What does a boxplot do?

A
  • It lets us visually show the five-number summary of a distribution
  • Trimmed minimum, Q1, Median, Q3, Trimmed maximum
44
Q

What is a violin plot?

A
  • A combination of a box plot and density plot
  • It provides even more detailed information about our data distribution
45
Q

What do data preparation techniques do?

A

They change the way data is represented just to make it more compatible with certain machine learning algorithms

46
Q

What are the data preparation techniques?

A
  • Normalization (feature scaling)
  • Binning
  • Sampling
47
Q

Explain normalization

A
  • These techniques can be used to change a continuous feature to fall within a specified range while maintaining the relative differences between the values for the feature
48
Q

Why must you remember to scale back when presenting your findings?

A

To ensure you retain all information about the normalization procedure

49
Q

Why is normalization done?

A
  • To assure no feature with larger range is misleading out ML algorithm
  • To safely fit the original data into a range desired by out computer/ ML software
50
Q

What are the common types of data normalization?

A
  • Min-Max normalization (Range normalization)
  • Z-score normalization (Standardization)
  • Decimal scaling
51
Q

Explain Min-Max normalization

A
  • It is used to linearly convert all feature values into the new range
  • It allows us to maintain all relative differences between the values for the feature
52
Q

What is the benefit of min-max normalization?

A
  • It is convenient and intuitive
  • It lets us fit the data into any range we want easily while the original proportions between data points are guaranteed to be perfectly preserved
53
Q

What is the drawback of range normalization?

A
  • It is sensitive to the presence of outliers in a dataset
  • It carries the risk of easy generation of out-of-bound errors for new entries
54
Q

Explain Z-score normalization

A
  • It is the process of transforming data to standard scores
  • A standard score measures how many standard deviations a feature value is from the mean for that feature
55
Q

How does z-score normalization work?

A
  • It squashes the values of the feature so that the feature values have a mean of 0 and standard deviation of 1
  • This results in the majority (68%) of the feature values being in a range of [-1, +1]
  • If the data in all features follow a Gaussian/Normal distribution they will be converted to standard normal and comparisons between different features become easier.
56
Q

Compare Min-Max and Z-score normalization

A
  • It is easy to specify the output range for Min-Max Normalization (often [0,1] or [0.1, 0.9] or [1, 10 to remain positive)
  • It is more of a challenge to guarantee the exact range from Z-score normalization
  • Standardization is slightly more resistant than Min-Max to out-of-bound errors
  • Standardization will introduce distortions to the data if it is not normally distributed
57
Q

What is robust scaling?

A
  • A scaling method impacted less by outliers
  • It is good because outliers impact sample mean and standard deviations very significantly
  • It removes the median and scales using IQR
58
Q

Explain Normalization by Decimal Scaling

A
  • It transforms data by removing the decimal points of the feature values
  • The number of decimal points moved depends on the maximum absolute value of that feature
  • It is trivial to implement
  • It is highly out-of-bounds resistant
59
Q

When is Decimal Scaling used?

A
  • Signal community when we want to separate noise from data easily
  • Recommended for dealing with Skewed left unimodal or Exponential data distributions
60
Q

Explain binning

A

This involves converting a continuous feature into an ordinal feature

61
Q

How do you perform binning?

A

Define a series of ranges called bins for the continuous feature that corresponds to the levels of the new categorical feature we are creating

62
Q

What are the popular way of defining bins

A
  • Equal-width binning
  • Equal-frequency binning
63
Q

What are the trade-off for deciding on the number of bins?

A
  • If we set the number of bins too low we may lose a lot of information
  • If the number of bins is too high we may have very few instances in each bin or end up with empty bins
  • That is preferable to losing information
64
Q

Explain equal-width binning

A
  • It splits the range of the feature values into b bins each of the size range/b
  • We want to store the bucket boundaries and the sum of frequencies of the values with each bucket
65
Q

Explain the construction of equal-width binning

A
  • Use one pass over ABT to construct an accurate equal-width histogram
  • Keep a running count for each bucket
  • If scanning is not acceptable use sampling
  • Construct a histogram on ABT sample and scale the frequencies by |ABT|/|ABT sample|
66
Q

Explain the maintenance of equal-width binning

A
  • Incremental maintenance: for each update in ABT increment/decrement the corresponding bucket frequencies
  • Periodical re-computation: because distribution changes slowly
67
Q

Explain equal-frequency binning

A
  • It first sorts the feature values in ascending order and then places an equal number of instances into each bin starting with bin 1
  • The number of instances in each bin is (total number of instances)/ (number of bins)
68
Q

Explain the construction of equal-frequency binning

A
  • Sort all data entries then take equally spaces slices
  • Sampling also works
69
Q

Explain the maintenance of equal-frequency binning

A
  • Incremental maintenance
    • Merge adjacent buckets with small counts
    • Split any bucket with large count
      • Select the median value to split
      • Need a sample of the values within this bucket to work well
  • Periodic re-computation also works
70
Q

Explain sampling

A
  • It is used when the dataset is so large we do not use all the data available to us in an ABT
  • We have to be careful to ensure that the resulting datasets are still representative of the original dataset and that no unintended bias is introduced during this process
71
Q

What are the common forms of sampling?

A
  • Top sampling
  • Random sampling
  • Stratified sampling
  • Under-sampling
  • Over-sampling
72
Q

Explain top sampling

A
  • Selects the top percentage of instances from the dataset to create a sample
  • Runs the risk of introducing bias since the sampling will be affected by any ordering of the original dataset
  • Recommended to avoid
73
Q

Explain random sampling

A
  • Randomly selects a proportion of percentages of the instances from the large dataset to create a smaller set
  • It is a good choice in most cases since the randomness should avoid introducing bias
74
Q

Explain stratified sampling

A
  • It ensure that the relative frequencies of the levels of a specific feature are maintained in the sampled dataset
75
Q

How do you perform stratified sampling?

A
  • Divide the instances in the dataset into groups
  • Each group contains only instances that have a particular level for the stratification feature
  • Percentages of the instances in each stratum are randomly selected
  • These selections are combined to give an overall sample of percentages of the original dataset
76
Q

When is over-sampling and under-sampling used?

A

When we would like a sample to contain different relative frequencies of the levels of a particular feature to the distribution in the original dataset

77
Q

Explain under-sampling

A
  • Begins by dividing a dataset into groups where each group contains only instances that have a particular level for the feature to be under-sampled
  • The number of instances in the smallest group is the under-sampling target size
  • Each group that has more instances than the smallest is randomly sampled by the appropriate percentage to create a subset that is the under-sampling target size
  • The under-sampled groups are combined to create the overall under-sampled dataset
78
Q

Explain over-sampling

A
  • It addresses the same issue as under-sampling but the opposite way
  • After dividing the dataset into groups the number of instances in the largest group becomes the over-sampling target size
  • From each smaller group, we create a sample containing that number of instances using random sampling with replacement
  • Larger samples are combined to form the overall over-sampled dataset