10 Statistical Techniques Flashcards by Ibrahim Abualhaol

What is a Random forest?

- very similar to bagging.
- bootstrap samples of your training set.
- faster, because each tree learns only from a subset of features.
- in bagging, you give each tree the full set of features

How well did you know this?

Not at all

Perfectly

“data scientist is a person who is better at statistics than any ——– and better at programming than any ———– .”

programmer, statistician

How well did you know this?

Not at all

Perfectly

The two best-known techniques for shrinking the coefficient estimates towards zero

ridge regression and the lasso

How well did you know this?

Not at all

Perfectly

what is Bootstrapping in Resampling?

sampling with replacement from the original data, and take the “not chosen” data points as test cases. We can make this several times and calculate the average score as estimation of our model performance.

How well did you know this?

Not at all

Perfectly

techniques for Non linear models

- step function
- piecewise function
- spline
-generalized additive model

How well did you know this?

Not at all

Perfectly

What is Classification?

a data mining technique that assigns categories to a collection of data in order to aid in more accurate predictions and analysis.

How well did you know this?

Not at all

Perfectly

Do body weight calorie intake, fat intake, and participant age have an influence on heart attacks (Yes vs No)?

Logistic Regression

How well did you know this?

Not at all

Perfectly

Types of Dimensionality reduction?

1) Principal Components Regression (unsupervised)

2) Partial least squares (PLS) (supervised)

How well did you know this?

Not at all

Perfectly

How Discriminant Analysis works?

use Bayes’ theorem to provide propability that a new member belongs to which class.

How well did you know this?

Not at all

Perfectly

Three Tree-Based Methods

1) Bagging
2) Boosting
3) Random forest

How well did you know this?

Not at all

Perfectly

Why study Statistical Learning?

It is important to understand the ideas behind the various techniques, in order to know how and when to use them.

How well did you know this?

Not at all

Perfectly

What is the advantage of Boosting?

By combining the advantages and pitfalls by varying your weighting formula you can come up with a good predictive force for a wider range of input data, using different narrowly tuned models.

How well did you know this?

Not at all

Perfectly

Types of Subset Selection?

1) Best-Subset Selection
2) Forward Stepwise Selection
3) Backward Stepwise Selection
4) Hybrid Methods

How well did you know this?

Not at all

Perfectly

What is Support Vector Machines (SVM)?

a hyperplane is n-1 dimensional subspace of an n-dimensional space) that best separates two classes of points with the maximum margin.

How well did you know this?

Not at all

Perfectly

what is Principal Component Analysis(PCA) ?

Producing low dimensional representation of the dataset by identifying a set of linear combination of features which have maximum variance and are mutually un-correlated.
* Understanding latent interaction between the variable in an unsupervised setting

How well did you know this?

Not at all

Perfectly

How to solve “How does the probability of getting lung cancer (Yes vs No) change for every additional pound of overweight and for every pack of cigarettes smoked per day?”

Study These Flashcards

Logistic Regression

What is Bagging

Study These Flashcards

Way to decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multistep of the same carnality/size as your original data

What is Best-Subset Selection?

Study These Flashcards

Choose the model with the highest R² and lowest RSS on testing error estimates.

Types Discriminant Analysis

Study These Flashcards

Linear and Quadratic.

What technique to solve: How monthly income and trips per month are correlated with monthly spending?

Study These Flashcards

Multiple Linear Regression

What is Dimensionality reduction?

Study These Flashcards

Dimension reduction reduces the problem of estimating p + 1 coefficients to the simple problem of M + 1 coefficients, where M < p.

2 major Classification techniques

Study These Flashcards

1) Logistic Regression

2) Discriminant Analysis.

Types of Linear Regression.

Study These Flashcards

1 – Simple Linear Regression

2–Multiple Linear Regression

What is Unsupervised Learning?

Study These Flashcards

1) Principal Component Analysis
2) k-Means clustering
3) Hierarchical clustering

What is cross validation?

a technique for validating the model performance, We take the k — 1 parts as our training set and use the “held out” part as our test set. We repeat that k times differently. Finally, we take the average of the k scores as our performance estimation.

What is Shrinkage (regularization)?

a model involving all p predictors, however, the estimated coefficients are shrunken towards zero relative to the least squares estimates.

what is Subset Selection?

identifies a subset of the p predictors that we believe to be related to the response. We then fit a model using the least squares of the subset features.

What are Nonlinear Models?

a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables.

What is Boosting?

approach to calculate the output using several different models and then average the result using a weighted average approach.

Differences between statistical learning and machine learning

* ML subfield of AI. * Statistical learning subfield of Statistics. * ML emphasis on large-scale applications and prediction accuracy. * Statistical learning emphasizes models and their interpretability, and precision and uncertainty.

10 Statistical Techniques

``` 1 — Linear Regression 2 — Classification 3 — Resampling Methods 4 — Subset Selection 5 — Shrinkage 6 — Dimension Reduction 7 — Nonlinear Models 8 — Tree-Based Methods 9 — Support Vector Machines 10 — Unsupervised Learning ```

What is Linear Regression?

a method to predict a target variable by fitting the best linear relationship between the dependent and independent variable

How Tree-Based Methods work?

- - both regression and classification problems. | - - segmenting the predictor space into a number of simple regions.

Why Lasso is better than Ridge regression?

Lasso regression also performs variable selection (s=1 OLS regression,, as s small it penalize the small variance predictors"

Data scientists live at the intersection of

coding, statistics, and critical thinking

What is Hierarchical clustering?

builds a multilevel hierarchy of clusters by creating a cluster tree.

What is k-Means clustering?

partitions data into k distinct clusters based on distance to the centroid of a cluster.

what is Resampling?

is the method that consists of drawing repeated samples from the original data samples.

Statistical Learning problems

* Identify the risk factors for prostate cancer. | * Establish the relationship between salary and demographic variables in population survey data.

10 Statistical Techniques Flashcards

(39 cards)