Statistical Learning Flashcards
Technical Definition: Statistical Learning
A set of approaches for estimating the relationship f between predictors (X) and an output (Y) using data.
Synonyms for X (input)
- Predictors
- Independent variables
- Features
- Variables
Synonyms for Y (output)
Y can be called the response + dependent variable
General Form for Y
Y = f(X) + ε, where f is an unknown function and ε is a random error term with mean zero.
What is systematic information in the context of Y = f(X) + ε?
The portion of Y explained by f(X), i.e., the non-random component driven by the predictors.
Why Estimate f?
Prediction and inference
Why Estimate f?
What does prediction focus on?
Obtaining an accurate Ŷ for new observations.
Why Estimate f?
What does inference aim to achieve?
Understand how each predictor impacts Y.
Definition: Reducible Error
Error introduced because our estimate of f, f_hat, is not perfect. It can potentially be reduced by improving the model.
Equation for Reducible Error
[f(X) - f_hat(X)]^2
Definition: Irreducible Error
Error that cannot be eliminated, even with a perfect model.
Equation for Irreducible Error
Var(ε)
Why is the irreducible error larger than zero?
Unmeasured variables that are useful in predicting Y or inherent randomness in Y.
What are some example questions one may be interested in answering in the case of inference?
3 questions
- Which predictors are associated with the response?
- What is the relationship between the response and each predictor?
- Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
Questions Related to Inference
Which predictors are associated with the response?
Please explain.
Identifying the few important predictors among a large set of possible variables.
Questions Related to Inference
What is the relationship between the response and each predictor?
What is the overall goal?
Evaluating each predictor’s effect on Y.
Questions Related to Inference
What is the relationship between the response and each predictor?
What are some examples of the types of relationships?
- Positive
- Negative
- More complex (e.g., it may depend on other variables via interactions)
Questions Related to Inference
Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
Please explain.
2 points.
- In some situations, assuming a linear relationship is reasonable or even desirable.
- However, the true relationship can be non-linear or more complex, in which case a linear model may not provide an accurate representation of the relationship between the input and output variables.
Is this a prediction or inference problem?
Consider a company that is interested in conducting a direct-marketing campaign. The goal is to identify individuals who will respond positively to a mailing, based on observations of demographic variables measured on each individual. In this case, the demographic variables serve as predictors, and response to the marketing campaign (either positive or negative) serves as the outcome.
Prediction
Prediction problem
Consider a company that is interested in conducting a direct-marketing campaign. The goal is to identify individuals who will respond positively to a mailing, based on observations of demographic variables measured on each individual. In this case, the demographic variables serve as predictors, and response to the marketing campaign (either positive or negative) serves as the outcome.
Why is this a prediction problem?
Want an accurate model to predict the response using the predictors.
Prediction problem
Consider a company that is interested in conducting a direct-marketing campaign. The goal is to identify individuals who will respond positively to a mailing, based on observations of demographic variables measured on each individual. In this case, the demographic variables serve as predictors, and response to the marketing campaign (either positive or negative) serves as the outcome.
Why is this not an inference problem?
Not interested in obtaining a deep understanding of the relationships between each individual predictor and the response.
Is this a prediction or inference problem?
Inference
Inference problem
Why is this an inference problem?
3 points
The company may want to understand:
* Which media contributes to sales
* Which media generate the biggest boost in sales
* How much increase in sales is associated with a given increase in TV advertising
Is this a prediction or inference problem?
Modeling the brand of a product that a customer might purchase based on variables such as price, store location, discount levels, competition price, and so forth. In this situation one might really be most interested in how each of the individual variables affects the probability of purchase.
Inference