Questions set 1 Flashcards
(144 cards)
[EXAM- UDEMY] You are asked to solve a classification task.
You must evaluate your model on a limited data sample by using k-fold cross-validation. You start by configuring a k parameter as the number of splits.
You need to configure the k parameter for the cross-validation.
Which value should you use?
k = 10 k = 0.9 K = 0.5 K = 1
Leave One Out (LOO) cross-validation
Setting K = n (the number of observations) yields n-fold and is called leave-one out cross-validation (LOO), this is a special case of the K-fold approach.
LOO CV is sometimes useful but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance.
This is why the usual choice is K=5 or 10. This provides a good compromise for the bias-variance tradeoff.
[PERSONAL] what is the purpose of K-fold cross validation
- maximize the use of the available data for training and then testing a model.
- assessing model performance, as it provides a range of accuracy scores across (somewhat) different data sets.
[PERSONAL] what is the purpose of cross-validation?
Cross validation (CV) is one of the technique used to test the effectiveness of a machine learning models, it is also a re-sampling procedure used to evaluate a model if we have a limited data. To perform CV we need to keep aside a sample/portion of the data on which is not used to train the model, later use this sample for testing/validating.
[PERSONAL] Give the variations on cross-validation
Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a single train/test split is created to evaluate the model. LOOCV: Taken to another extreme, k may be set to the total number of observations in the dataset such that each observation is given a chance to be the held out of the dataset. This is called leave-one-out cross-validation, or LOOCV for short. Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation. Repeated: This is where the k-fold cross-validation procedure is repeated n times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample. Nested: This is where k-fold cross-validation is performed within each fold of cross-validation, often to perform hyperparameter tuning during model evaluation. This is called nested cross-validation or double cross-validation.
[EXAM- UDEMY] Your manager asked you to analyze a numerical dataset which contains missing values in several columns.
You must clean the missing values using an appropriate operation without affecting the dimensionality of the feature set.
You need to analyze a full dataset to include all values.
Solution:
Use the Last Observation Carried Forward (LOCF) method to impute the missing data points.
Explanation
Instead of using Last Observation Carried Forward method, you need to use the Multiple Imputation by Chained Equations (MICE) method.
Replace using MICE: For each missing value, this option assigns a new value, which is calculated by using a method described in the statistical literature as “Multivariate Imputation using Chained Equations” or “Multiple Imputation by Chained Equations”. With a multiple imputation method, each variable with missing data is modeled conditionally using the other variables in the data before filling in the missing values.
Note:
Last observation carried forward (LOCF) is a method of imputing missing data in longitudinal studies. If a person drops out of a study before it ends, then his or her last observed score on the dependent variable is used for all subsequent (i.e., missing) observation points. LOCF is used to maintain the sample size and to reduce the bias caused by the attrition of participants in a study.
[PERSONAL] Pro’s and Cons of mean/median imputation
Pros:
Easy and fast.
Works well with small numerical datasets.
Cons:
Doesn’t factor the correlations between features. It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.
[PERSONAL] Pro’s and Cons of Most Frequent or Zero/Constant values
Pros:
Works well with categorical features.
Cons:
It also doesn’t factor the correlations between features.
It can introduce bias in the data.
[PERSONAL] Pro’s and Cons
Imputation Using k-NN
Pros:
Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).
Cons:
Computationally expensive.
KNN works by storing the whole training dataset in memory.
K-NN is quite sensitive to outliers in the data (unlike SVM)
[PERSONAL] Pro’s and Cons
Imputation Using Multivariate Imputation by Chained Equation (MICE)
pro’s
- better
- flexible: can handle different data types
- can handle complexities
This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.
[PERSONAL] what is Hot-Deck imputation
Works by randomly choosing the missing value from a set of related and similar variables.
[PERSONAL] what is Extrapolation and Interpolation imputation?
It tries to estimate values from other observations within the range of a discrete set of known data points.
[PERSONAL] what is Stochastic regression imputation
It is quite similar to regression imputation which tries to predict the missing values by regressing it from other related variables in the same dataset plus some random residual value
[EXAM - UDEMY]
You are a senior data scientist of your company and you use Azure Machine Learning Studio.
You are asked to normalize values to produce an output column into bins to predict a target column.
Solution:
Apply a Quantiles normalization with a QuantileIndex normalization.
Does the solution meet the goal?
Quantile Normalization: Summary of YT-video - start with highest value - calculate mean - put elements of different distribution on the mean.
https://www.youtube.com/watch?reload=9&v=ecjN6Xpv6SE
In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile-normalize a test distribution to a reference distribution of the same length, sort the test distribution and sort the reference distribution.
This has nothing to do with bins.
Entropy MDL: This method requires that you select the column you want to predict and the column or columns that you want to group into bins. It then makes a pass over the data and attempts to determine the number of bins that minimizes the entropy. In other words, it chooses a number of bins that allows the data column to best predict the target column. It then returns the bin number associated with each row of your data in a column named
[EXAM - UDEMY]
You are analyzing a raw dataset that requires cleaning.
You must perform transformations and manipulations by using Azure Machine Learning Studio.
You need to identify the correct module to perform the below transformation.
Which module should you choose?
Scenario:
Remove potential duplicates from a dataset
- remove duplicate rows
- SMOTE
- Convert to indicator values
- Clean missing data
- Threshold filter
Use the Remove Duplicate Rows module in Azure Machine Learning Studio (classic), to remove potential duplicates from a dataset.
[PERSONAL]
What are all the categories in the data transformation category?
Data Transformation - Filter
Learning with Counts
Data Transformation - Manipulation
Data Transformation - Sample and Split
Data Transformation - Scale and Reduce
[PERSONAL] Data Transformation - Filter
Give al the types of filters and what they do
Apply Filter: Applies a filter to specified columns of a dataset.
FIR Filter: Creates an FIR filter for signal processing.
See also
IIR Filter: Creates an IIR filter for signal processing.
Median Filter: Creates a median filter that’s used to smooth data for trend analysis.
Moving Average Filter: Creates a moving average filter that smooths data for trend analysis.
Threshold Filter: Creates a threshold filter that constrains values.
User-Defined Filter: Creates a custom FIR or IIR filter.
[PERSONAL] Data Transformation - Learning with Counts
The basic idea of count-based featurization is that by calculating counts, you can quickly and easily get a summary of what columns contain the most important information. The module counts the number of times a value appears, and then provides that information as a feature for input to a model.
Build Counting Transform: Creates a count table and count-based features from a dataset, and then saves
the table and features as a transformation.
Export Count Table: Exports a count table from a counting transform. This module supports backward
compatibility with experiments that create count-based features by using Build Count Table (deprecated)
and Count Featurizer (deprecated).
Import Count Table: Imports an existing count table. This module supports backward compatibility with experiments that create count-based features by using Build Count Table (deprecated) and Count Featurizer (deprecated). The module supports conversion of count tables to count transformations.
Merge Count Transform: Merges two sets of count-based features.
Modify Count Table Parameters: Modifies count-based features that are derived from an existing count table.
[PERSONAL]
Data Transformation - Manipulation
Give some modules of this module.
Add Columns: Adds a set of columns from one dataset to another.
See also
Add Rows: Appends a set of rows from an input dataset to the end of another dataset.
Apply SQL Transformation: Runs a SQLite query on input datasets to transform the data.
Clean Missing Data: Specifies how to handle values that are missing from a dataset. This module replaces Missing Values Scrubber, which has been deprecated.
Convert to Indicator Values: Converts categorical values in columns to indicator values.
Edit Metadata: Edits metadata that’s associated with columns in a dataset.
Group Categorical Values:
Groups data from multiple categories into a new category.
Join Data: Joins two datasets.
Remove Duplicate Rows: Removes duplicate rows from a dataset.
Select Columns in Dataset: Selects columns to include in a dataset or exclude from a dataset in an operation.
Select Columns Transform: Creates a transformation that selects the same subset of columns as in a
specified dataset.
SMOTE: Increases the number of low-incidence examples in a dataset by using synthetic minority
oversampling.
[PERSONAL] Data Transformation - Sample and Split
Give the two modules and what they do.
Partition and Sample: Creates multiple partitions of a dataset based on sampling.
Split Data: Partitions the rows of a dataset into two distinct sets.
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/data-transformation-sample-and-split
[PERSONAL] Data Transformation - Scale and Reduce
Clip Values: Detects outliers, and then clips or replaces their values.
Group Data into Bins: Puts numerical data into bins.
Normalize Data: Rescales numeric data to constrain dataset values to a standard range.
Principal Component Analysis: Computes a set of features that have reduced dimensionality for more efficient
learning.
[EXAM - UDEMY]
You are a data scientist using Azure Machine Learning Studio.
You are performing a filter-based feature selection for a dataset to build a multi-class classifier by using Azure Machine Learning Studio.
The dataset contains categorical features that are highly correlated to the output label column.
You need to select the appropriate feature scoring statistical method to identify the key predictors.
Which method should you use?
- spearman correlation
- Kendal correlation
- Chi-squared
- Pearson correlation
Explanation
The chi-square statistic is used to show whether or not there is a relationship between two categorical variables
Incorrect Answer:
Pearson’s correlation coefficient (r) is used to demonstrate whether two variables are correlated or related to each other.
[PERSONAL]
Explain CHI-squared test, for what is it used?
is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance.
How likely is it that two sets of observations arose from the same distribution?
YT: https://www.youtube.com/watch?v=2QeDRsxSF9M
[PERSONAL]
spearman correlation
Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a Spearman correlation to evaluate whether the order in which employees complete a test exercise is related to the number of months they have been employed
Spearman’s Rank correlation coefficient is a technique which can be used to summarise the strength and direction (negative or positive) of a relationship between two variables. The result will always be between 1 and minus 1.
A Spearman correlation of 1 results when the two variables being compared are monotonically related, even if their relationship is not linear. This means that all data points with greater x values than that of a given data point will have greater y values as well. In contrast, this does not give a perfect Pearson correlation.
[PERSONAL]
Kendal correlation
n statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall’s τ coefficient (after the Greek letter τ, tau), is a statistic used to measure the ordinal association between two measured quantities. … can be formulated as special cases of a more general correlation coefficient.
In the normal case, the Kendall correlation is preferred than the Spearman correlation because of a smaller gross error sensitivity (GES) (more robust) and a smaller asymptotic variance (AV) (more efficient).