Chapter 3 - Linear Models Flashcards
(33 cards)
Types of variables (2; 3.1.1)
1) Numeric - take the form of numbers with a well-defined order and associated range. Leads to supervised regression problems.
a) Discrete - restricted to only certain numeric values in that range
b) Continuous - can assume any value in a continuum
2) Categorical - take the form of predefined values in a countable collection of categories (called levels or classes). Leads to supervised classification problems.
a) Binary - can only take two possible levels (Y/N, etc.)
b) Multi-level - can take any possible levels (State, etc.)
Supervised vs. unsupervised problems (2; 3.1.1)
1) Supervised learning problems - target variable ‘supervises” the analysis; goal is to understand the relationship between the target variable and the predictors and/or to make accurate predictions for the target based on the predictors
2) Unsupervised learning problems - No target variable supervising the analysis. Goal is to extract relationships and structures between different variables in the data.
The model building process (6; 3.1.2)
1) Problem Definition
2) Data Collection and Validation
3) Exploratory Data Analysis - with the use of descriptive statistics and graphical displays, clean the data for incorrect, unreasonable, and inconsistent entries, and understand the characteristics of and key relationships among variables in the data.
4) Model Construction, Evaluation, and Selection
5) Model Validation
6) Model Maintenance
Characteristics of predictive modeling problems (6; 3.1.2)
1) Issue - there is a clearly defined business issue that needs to be addressed
2) Questions - the issue can be addressed with a few well-defined questions
a) What data do we need?
b) What is the target or outcome?
c) What is the success criteria (how will the model performance be evaluated)?
3) Data - Good and useful data is available for answering the questions above
4) Impact - the predictions will likely drive actions or increase understanding
5) Better solution - predictive analytics likely produces a solution better than any existing approach
6) Update - We can continue to monitor and update the models when new data becomes available
Defining the problem (2; 3.1.2)
1) Hypotheses - use prior knowledge of the business problem to ask questions and develop hypotheses to guide analysis efforts in a clearly defined way.
2) Key performance indicators - to provide a quantitative basis to measure the success of the project
Data design considerations (3; 3.1.2)
1) Relevance - representative of the environment where our predictive model will operate
a) Population - data source aligns with the true population of interest
b) Time frame - should best reflect the business environment in which the model will be implemented
2) Sampling - process of taking a manageable subset of observations from the data source. Methods include:
a) Random sampling
b) Stratified sampling - dividing population into strata and randomly sampling a set number of observations from each stratum
3) Granularity - refers to how precisely a variable in a dataset is measured / how detailed the information contained by the variable is
Data quality considerations (3; 3.1.2)
1) Reasonableness - data values should be reasonable
2) Consistency - records in the data should be inputted consistently
3) Sufficient documentation - should at least include the following:
a) A description of the dataset overall, including the data source
b) A description of each variable in the data, including its name, definition, and format
c) Notes about any past updates or other irregularities of the dataset
d) A statement of accountability for the correctness of the dataset
e) A description of the governance processes used to manage the dataset
Other data issues (3; 3.1.2)
1) PII/PHI - Data with PII/PHI should be de-identified and should have sufficient data security protections
2) Variables with legal/ethical concerns - variables with sensitive information or of protected classes may lead to unfair discrimination and raise equity concerns. Care should also be taken with proxy variables of prohibited variables (i.e. , occupation may be a proxy variable of gender)
3) Target leakage - when predictors in a model include information about the target variable that will not be available when the model is applied in practice (i.e., target variable of IP LOS, # of lab procedures may be a predictor variable, but will not be known in practice until the inpatient stay concludes)
a) When target leakage occurs, may develop a predictive model for the leaky predictor, then use the predicted value in a separate model to predict the target variable
Training/test dataset split (3.1.2)
1) Training set (typically 70-80% of the full data) - used for training or developing the predictive model to estimate the signal function and model parameters
2) Test set (typically 20-30% of the full data) - apply the trained model to make a prediction on each observation in the test set to assess the prediction performance
Common performance metrics (3; 3.1.2)
1) Square loss - squared difference between observed and predicted values
a) Root Mean Squared Error (RMSE) - SQRT of the sum of all observations’ square losses (divided by population size)
2) Absolute loss - absolute difference between observed and predicted values (less used than RMSE because not differentiable at zero)
a) Mean Absolute Loss (MAE) - average of all observations’ mean losses
3) Zero-one loss - 1 if predicted and observed losses are not equal, 0 if equal (commonly used in classification problems)
a) Classification error rate = proportion of 1’s in all observations
Cross-validation (3.1.2)
Method for tuning hyperparameters (parameters that control some aspect of the fitting process itself) without having to further divide the training set
1) Randomly split the training data into k folds of approximately equal size (10 default value in many model fitting functions in R)
2) One of the k folds is left out and the predictive model is fitted to the remaining k-1 folds. The fitted model then predicts for each observation for the left out fold and a performance metric is computed on that fold.
3) Repeat the process for all folds, resulting in k performance metric values.
4) Overall prediction performance of the model can be estimated as the average of the k performance values, known as the CV error.
This technique can be used on each set of hyperparameter values under consideration to select the combination that produces the best model performance.
Considerations for selecting the best model (3; 3.1.2)
1) Prediction performance
2) Interpretability - model predictions should be easily explained in terms of the predictors and lead to specific actions or insights
3) Ease of implementation - models should not require prohibitive resources to construct and maintain
Model validation techniques (3; 3.1.2)
1) Training set - for GLMs, there is a set of model diagnostic tools designed to check the model assumptions based on the training set
2) Test set - compare predicted values and the observed values of the target variable on the test set
3) Compare to an existing, baseline model - use a primitive model to provide a benchmark which any selected model should beat as a minimum requirement
Model maintenance steps (5; 3.1.2)
1) Adjust the business problem to account for new information
2) Consult with subject matter experts - if there are new findings that don’t fit current understanding of the business problem or modeling issues that cannot be easily resolved, or to understand limitation on what can be reasonably implemented
3) Gather additional data - gather new observations and retrain model or gather new variables
4) Apply new types of models - when new technology or implementation possibilities are available
5) Refine existing models - try new combinations of predictors, alternative hyperparameter values, alternative performance metrics, etc.
Bias/variance trade-off (4; 3.1.3)
1) Bias - the difference between the expected value of a signal function and the true value of the signal function
a) the more complex/flexible a model, the lower the bias due to its higher ability to capture the signal in the data
b) Corresponds to accuracy
2) Variance - the amount by which the expected value of a signal function would change if estimated using a different training set
a) the more flexible a model, the higher the variance due to its attunement to the training set
b) Corresponds to precision
3) Irreducible error - variance of noise, which is independent of the predictive model but inherent in the random nature of the target variable
4) Bias/variance trade-off - a more flexible model generally has a lower bias but a higher variance than a less flexible model
a) Underfitting - as a model becomes more fitted, bias error drops faster than the variance error increases
b) Overfitting - once a model starts to become overfitted, variance increases faster than bias drops
Feature generation and feature selection (2; 3.1.4)
1) Feature generation is the process of generating new features (i.e., derivatives of raw variables) based on existing variables in the data.
a) Predictive power - transform the data so that a predictive model can better “absorb” the information
b) Interpretability - a new feature can also make a model easier to interpret by transforming the original variables into something more meaningful or interpretable
2) Feature selection is the process of dropping features or variables with limited predictive power and therefore reducing the dimension of the data
a) Predictive power - feature selection is an attempt to control model complexity and prevent overfitting
b) Interpretability - preference for simpler, cleaner (parsimonious) models
Common strategies for reducing the dimensionality of a categorical predictor (3; 3.1.4)
1) Combining sparse categories with others - categories with very few observations should be folded into more populous categories in which the target variable exhibits a similar behavior
2) Combining similar categories - if the target variable behaves similarly in two categories of a predictor, these categories can be consolidated without losing much information
3) Using the prior knowledge of a categorical variable - e.g., reducing day of week variable into weekday/weekend
Differences between granularity and dimensionality (2; 3.1.4)
1) Applicability - dimensionality is a concept specific to categorical variables, while granularity applies to both numerical and categorical variables
2) Comparability - Categorical variables can be ordered by dimension (# of dimensions), but can’t always be ordered by granularity
Linear Model Formulation (2; 3.2.1)
1) Model equation: Y = B0 + B1X1 + B2X2 + … + BpXp + E
Y = target variable
X1 - Xp = p predictors
B0 = intercept, EV of Y when all predictors are zero
B1 - Bp = unknown regression coefficients
E = unobservable zero-mean random error term, assumed to follow a normal distribution with zero mean and a common variance
2) Model fitting - most popular way is to estimate the regression coefficients using ordinary least squares (OLS) approach. Select Bj’s to minimize the sum of the squared differences between the observed target values and fitted values.
Linear model goodness-of-fit measures (2; 3.2.2)
1) Residual sum of squares (RSS) - Sum of squares of residuals (from the training set); residual = difference between observed target and fitted value.
a) Absolute goodness-of-fit measure with no upper bound
b) The smaller the RSS, the better the fit of the linear model to the training set
2) Coefficient of determination (R^2)
a) R^2 = 1 - RSS/TSS
b) Total sum of squares (TSS) = sum of differences between each observation’s target value and the mean of target values
c) Range of 0 to 1. The higher the R^2, the better the fit of the model to the training set
R^2 and RSS will always favor more complex models
Traditional model selection methods: Hypothesis testing (2; 3.2.2)
1) t-test - the t-statistic for a particular predictor is the ratio of it’s associated OLS estimate to the estimated standard deviation (or standard error) of the OLS estimate
a) Measure of the effect of adding the predictor to the model after accounting for the effects of other variables
b) The larger the t-statistic, the stronger the linear association between the predictor and target variable
2) F-test - the F-test is for assessing the joint significance of the entire set of predictors against the alternative hypothesis (at least one of the regression coefficients is non-zero)
General model selection measures (2; 3.2.2)
1) Akaike Information Criteria (AIC) - defined as -2l + 2*(p+1)
l = maximized loglikelihood of the linear model on the training set
p = number of predictors
a) Goodness of fit to the training data is measured by -2l (the higher the l and lower -2l, the better the fit)
b) Complexity measured by 2(p+1) - the more parameters, the more complex
c) Goal is to minimize the AIC
2) Bayesian Information Criterion (BIC) - defined as -2l + ln(size of training dataset) * (p+1)
a) Different metric for complexity, penalty for parameters is higher in BIC
b) Same goal, to minimize BIC
Properties of a well-defined linear model (3; 3.2.2)
1) No special patterns - the residuals should cluster around zero in a random fashion, both on their own and when plotted against the fitted values
2) Homoscedasticity (constant variance) - The residuals should possess approximately the same variance
3) Normality - The residuals should be approximately normally distributed
Linear model plots and interpretation (2; 3.2.2)
1) Residuals vs Fitted plot - plots the residuals of the model against fitted values (with a smooth curve superimposed)
a) Residuals should display no prominent patterns and spread symmetrically in either direction
b) Systematic patterns (e.g., a U shape - implying quadratic error term) or non-uniform spread in the residuals (e.g., funnel shape - implying increase or decrease in variance with fitted values) are symptomatic of an inadequate model specification and heteroscedasticity (the residuals have non-constant variance)
c) Variance-stabilizing transformations can be applied such as the log transformation (requires all y values >= 1, can add constant) or square root transformation (works well if target variable is non-negative)
2) Normal Q-Q plot - graphs the empirical quantiles of the standardized residuals (residuals divided by their standard error) against the theoretical standard normal quantiles. Can be used for checking the normality of the random errors.
a) Points on the plot are expected to lie closely on the 45 degree line passing through the origin if residuals are normally distributed
b) Systematic departures from that line suggest that the normality assumption is not entirely fulfilled and a distribution with a heavier tail for the target variable is warranted.