# 10 Statistical Techniques Flashcards

1
Q

What is a Random forest?

A
• very similar to bagging.
• bootstrap samples of your training set.
• faster, because each tree learns only from a subset of features.
• in bagging, you give each tree the full set of features
2
Q

“data scientist is a person who is better at statistics than any ——– and better at programming than any ———– .”

A

programmer, statistician

3
Q

The two best-known techniques for shrinking the coefficient estimates towards zero

A

ridge regression and the lasso

4
Q

what is Bootstrapping in Resampling?

A

sampling with replacement from the original data, and take the “not chosen” data points as test cases. We can make this several times and calculate the average score as estimation of our model performance.

5
Q

techniques for Non linear models

A
• step function
• piecewise function
• spline
• -generalized additive model
6
Q

What is Classification?

A

a data mining technique that assigns categories to a collection of data in order to aid in more accurate predictions and analysis.

7
Q

Do body weight calorie intake, fat intake, and participant age have an influence on heart attacks (Yes vs No)?

A

Logistic Regression

8
Q

Types of Dimensionality reduction?

A

1) Principal Components Regression (unsupervised)

2) Partial least squares (PLS) (supervised)

9
Q

How Discriminant Analysis works?

A

use Bayes’ theorem to provide propability that a new member belongs to which class.

10
Q

Three Tree-Based Methods

A

1) Bagging
2) Boosting
3) Random forest

11
Q

Why study Statistical Learning?

A

It is important to understand the ideas behind the various techniques, in order to know how and when to use them.

12
Q

What is the advantage of Boosting?

A

By combining the advantages and pitfalls by varying your weighting formula you can come up with a good predictive force for a wider range of input data, using different narrowly tuned models.

13
Q

Types of  Subset Selection?

A

1) Best-Subset Selection
2) Forward Stepwise Selection
3) Backward Stepwise Selection
4) Hybrid Methods

14
Q

What is Support Vector Machines (SVM)?

A

a hyperplane is n-1 dimensional subspace of an n-dimensional space) that best separates two classes of points with the maximum margin.

15
Q

what is Principal Component Analysis(PCA) ?

A

Producing low dimensional representation of the dataset by identifying a set of linear combination of features which have maximum variance and are mutually un-correlated.
* Understanding latent interaction between the variable in an unsupervised setting

16
Q

How to solve “How does the probability of getting lung cancer (Yes vs No) change for every additional pound of overweight and for every pack of cigarettes smoked per day?”

A

Logistic Regression

17
Q

What is Bagging

A

Way to decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multistep of the same carnality/size as your original data

18
Q

What is Best-Subset Selection?

A

Choose the model with the highest R² and lowest RSS on testing error estimates.

19
Q

Types Discriminant Analysis

A

20
Q

What technique to solve: How monthly income and trips per month are correlated with monthly spending?

A

Multiple Linear Regression

21
Q

What is Dimensionality reduction?

A

Dimension reduction reduces the problem of estimating p + 1 coefficients to the simple problem of M + 1 coefficients, where M < p.

22
Q

2 major Classification techniques

A

1) Logistic Regression

2) Discriminant Analysis.

23
Q

Types of Linear Regression.

A

1 – Simple Linear Regression

2–Multiple Linear Regression

24
Q

What is Unsupervised Learning?

A

1) Principal Component Analysis
2) k-Means clustering
3) Hierarchical clustering

25
Q

What is cross validation?

A

a technique for validating the model performance, We take the k — 1 parts as our training set and use the “held out” part as our test set. We repeat that k times differently. Finally, we take the average of the k scores as our performance estimation.

26
Q

What is Shrinkage (regularization)?

A

a model involving all p predictors, however, the estimated coefficients are shrunken towards zero relative to the least squares estimates.

27
Q

what is Subset Selection?

A

identifies a subset of the p predictors that we believe to be related to the response. We then fit a model using the least squares of the subset features.

28
Q

What are Nonlinear Models?

A

a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables.

29
Q

What is Boosting?

A

approach to calculate the output using several different models and then average the result using a weighted average approach.

30
Q

Differences between statistical learning and machine learning

A
• ML subfield of AI.
• Statistical learning subfield of Statistics.
• ML emphasis on large-scale applications and prediction accuracy.
• Statistical learning emphasizes models and their interpretability, and precision and uncertainty.
31
Q

10 Statistical Techniques

A
```1 — Linear Regression
2 — Classification
3 — Resampling Methods
4 — Subset Selection
5 — Shrinkage
6 — Dimension Reduction
7 — Nonlinear Models
8 — Tree-Based Methods
9 — Support Vector Machines
10 — Unsupervised Learning```
32
Q

What is  Linear Regression?

A

a method to predict a target variable by fitting the best linear relationship between the dependent and independent variable

33
Q

How Tree-Based Methods work?

A
• both regression and classification problems.

- - segmenting the predictor space into a number of simple regions.

34
Q

Why Lasso is better than Ridge regression?

A

Lasso regression also performs variable selection (s=1 OLS regression,, as s small it penalize the small variance predictors”

35
Q

Data scientists live at the intersection of

A

coding, statistics, and critical thinking

36
Q

What is Hierarchical clustering?

A

builds a multilevel hierarchy of clusters by creating a cluster tree.

37
Q

What is k-Means clustering?

A

partitions data into k distinct clusters based on distance to the centroid of a cluster.

38
Q

what is Resampling?

A

is the method that consists of drawing repeated samples from the original data samples.

39
Q

Statistical Learning problems

A
• Identify the risk factors for prostate cancer.

* Establish the relationship between salary and demographic variables in population survey data.