10 Statistical Techniques Flashcards
(39 cards)
What is a Random forest?
- very similar to bagging.
- bootstrap samples of your training set.
- faster, because each tree learns only from a subset of features.
- in bagging, you give each tree the full set of features
“data scientist is a person who is better at statistics than any ——– and better at programming than any ———– .”
programmer, statistician
The two best-known techniques for shrinking the coefficient estimates towards zero
ridge regression and the lasso
what is Bootstrapping in Resampling?
sampling with replacement from the original data, and take the “not chosen” data points as test cases. We can make this several times and calculate the average score as estimation of our model performance.
techniques for Non linear models
- step function
- piecewise function
- spline
- -generalized additive model
What is Classification?
a data mining technique that assigns categories to a collection of data in order to aid in more accurate predictions and analysis.
Do body weight calorie intake, fat intake, and participant age have an influence on heart attacks (Yes vs No)?
Logistic Regression
Types of Dimensionality reduction?
1) Principal Components Regression (unsupervised)
2) Partial least squares (PLS) (supervised)
How Discriminant Analysis works?
use Bayes’ theorem to provide propability that a new member belongs to which class.
Three Tree-Based Methods
1) Bagging
2) Boosting
3) Random forest
Why study Statistical Learning?
It is important to understand the ideas behind the various techniques, in order to know how and when to use them.
What is the advantage of Boosting?
By combining the advantages and pitfalls by varying your weighting formula you can come up with a good predictive force for a wider range of input data, using different narrowly tuned models.
Types of Subset Selection?
1) Best-Subset Selection
2) Forward Stepwise Selection
3) Backward Stepwise Selection
4) Hybrid Methods
What is Support Vector Machines (SVM)?
a hyperplane is n-1 dimensional subspace of an n-dimensional space) that best separates two classes of points with the maximum margin.
what is Principal Component Analysis(PCA) ?
Producing low dimensional representation of the dataset by identifying a set of linear combination of features which have maximum variance and are mutually un-correlated.
* Understanding latent interaction between the variable in an unsupervised setting
How to solve “How does the probability of getting lung cancer (Yes vs No) change for every additional pound of overweight and for every pack of cigarettes smoked per day?”
Logistic Regression
What is Bagging
Way to decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multistep of the same carnality/size as your original data
What is Best-Subset Selection?
Choose the model with the highest R² and lowest RSS on testing error estimates.
Types Discriminant Analysis
Linear and Quadratic.
What technique to solve: How monthly income and trips per month are correlated with monthly spending?
Multiple Linear Regression
What is Dimensionality reduction?
Dimension reduction reduces the problem of estimating p + 1 coefficients to the simple problem of M + 1 coefficients, where M < p.
2 major Classification techniques
1) Logistic Regression
2) Discriminant Analysis.
Types of Linear Regression.
1 – Simple Linear Regression
2–Multiple Linear Regression
What is Unsupervised Learning?
1) Principal Component Analysis
2) k-Means clustering
3) Hierarchical clustering