Decision Tree Modeling Flashcards

Question

What are Hyperparameters?

Answer 1

- parameters that can be **set before the model is trained** - affect how the **model fits the data** - Help balance best model to **neither underfit nor overfit** the data

Answer 2

- how deep the tree is allowed to grow - The depth = number of **levels between the root node and the farthest node** - the root node is **level 0**

Answer 3

1. **reduce overfitting** problems by limiting how deep the tree will go 2. it can **reduce the computational complexity** of training and using the model

Answer 4

- **the minimum number of samples** that must be **in each child node** after the parent splits. - split only if there are **enough samples** in each of the result nodes to satisfy the required value. Example There's a decision node that currently has 10 samples. However, the min samples leaf hyper parameter is set to six. There would be no way to split the data so that each leaf node has six samples and therefore no further split can take place

Answer 5

A tool to find the optimal values for the parameters

Answer 6

- to confirm that a **model achieves goal** - by systematically **checking every combination** of hyper parameters - to identify **which set produces the best results** based on the selected metric.

Answer 7

- model **learns the training data so closely** that it captures more than the intrinsic patterns of all such data distributions - model that **scores very well on the training data** but considerably **worse on unseen data** because it cannot generalize well. - identify when accuracy of training model is high ~1

Answer 8

- model **does not learn** the patterns and characteristics of the training data well, and consequently **fails to make accurate predictions** on new data. - easier to identify underfitting, because the model **performs poorly on both training and test data**

Answer 9

1. Max Depth 2. min samples split 3. Min Samples Leaf

Answer 10

- overfitting As you increase the max depth parameter, the performance of the model on the training set will continue to increase. It’s possible for a tree to grow so deep that leaves contain just a single sample. However, this overfits the model to the training data, and the performance on the testing data would probably be much worse.

Answer 11

- **minimum number** of samples the **parent node** must have before splitting if you set this to 10, then any node that contains nine or fewer samples will automatically become a leaf node. It will not continue splitting.

Answer 12

Min: 2 is the smallest number that can be divided into two separate child nodes. Max: The greater the value you use the sooner the tree will stop growing.

Answer 13

- the process of **reducing model complexity** to **prevent overfitting**. - Regularization helps to make the model **more generalizable to new data** - regularization trades a marginal **decrease in training accuracy** for an **increase in generalizability**.

Answer 14

Regularization **introduces penalty terms** to the model’s **loss function** - **discouraging overly complex** solutions and - **promoting better generalization** to unseen data.

Answer 15

Cross-validation is a technique used to **assess a model’s performance** by **dividing** the data into multiple subsets (**folds**) for training and testing. It helps to estimate how well a model will **generalize to new data** and **mitigates the risk of overfitting** by using different subsets for training and testing.

Answer 16

the whole process of - **evaluating** different models - **selecting best model** and then - continuing to **analyze the performance** of the selected model to better understand its strengths and limitations.

Answer 17

- The simplest way to **maintain the objectivity** of the test data - is to **create** another partition in the data—a **validation set**—and - **save the test data** for after you select the **final model**. The **validation set** is then used, instead of the test set, to **compare different models.**

Answer 18

With validation, the data is actually split into **three sets** 1. **Train**: used to train all models of interest 2. **Validation**: is used to evaluate the models leaving the test set untouched 3. **Test**: used after final model selected

Answer 19

- process that uses different **folds** of the data to test and train a model across several iterations. - avoids having to split the data into three partitions (train / validate / test) in advance.

Answer 20

- Instead of having one validation set to evaluate the model, the training data is **split into multiple sections** known as folds. - Then the model is trained on different **combinations of these folds**. - The training process occurs **k times**, each time using a **different fold** as the validation set. - At the end, the **final validation score** is the **average of all k scores**.

Answer 21

- useful when working with **smaller datasets** - as it **maximizes the utility** of the data available. More so than standard validation.

Answer 22

- not necessary when working with very **large datasets**

Answer 23

_ split a dataset into **training** and **testing** data. - Then, you **fit** a model to the **training data** and - **evaluate** its performance on the **test data.**

Answer 24

1. Validation sets (Separation Validation) 2.Cross validation

Answer 25

- very large dataset. - The reason for this is that the more data you use for validation, the less you have for training and testing.

Answer 26

once the final model is selected, best practice is to go back and fit the selected model to all the non-test data (i.e., the training data + validation data) before scoring this final model on the test data.

Answer 27

- should **not be used** to **select a final model**. - The test data is used only **for this final model** . - Your model’s score on this data is how you can expect the **model to perform** on completely **new data.**

Answer 28

popular **ensemble learning method** for classification, regression, and other tasks that operates by constructing a multitude of decision trees at training time. It's essentially a **collection of decision trees**, each trained on a random subset of the data and featuresone methodology for building **tree-based ensemble models**.

Answer 29

involves **building multiple models** and then **aggregating their outputs** to make a **final prediction**

Answer 30

- powerful because it **combines the results** of many models to help make more reliable final predictions, - plus these predictions have **less bias** and **lower variance** than other standalone models. - predictions using an ensemble of models are **very accurate** even when the individual models themselves are barely more accurate than a random guess.

Answer 31

A best practice when building an ensemble is to **use very different methodologies** for each model it contains, such as a logistic regression, a Naive Bayes model, and a decision tree classifier. In this way, when the models make errors and they always will, the **errors will be uncorrelated.** - The goal is for them to **not all make the same errors** for the same reasons.

Answer 32

is any individual model in an ensemble.

Answer 33

Bagging, Boosting, Stacking

Answer 34

Each base learner **samples from the data with replacement**, for bagging this means the various base learners all sample the same observation, and a single learner can sample that observation multiple times during training.

Answer 35

The aggregation part of bagging refers to the fact that the predictions of all the individual models are aggregated to produce a final prediction.

Answer 36

it's often whichever class receives the most predictions, which is the mode.

Answer 37

this is typically the average of all the predictions.

Answer 38

Bagging + random feature sampling. random forest takes the randomization from Bagging one step further and randomizes the features used to train each base learner too

Answer 39

A technique used by certain kinds of models that use **ensembles of base learners** to make predictions; refers to the combination of **bootstrapping and aggregating**

Answer 40

Bootstrap aggregrating

Answer 41

Bagging = base learners are trained on data that is randomized by observation. Random forest takes the randomization from bagging one step further. It randomizes the data by features too. - A regular decision tree model will seek the best feature to use to split a node. - A random forest model will grow each of its trees by taking a random subset of the available features in the training data and then splitting each node at the best feature available to that tree. - This means that each base learner in a random forest model has different combinations of features available to it, which helps to prevent the problem of correlated errors between learners in the ensemble. Each individual base learner is a decision tree. It may be fully grown, so each leaf is a single observation or it may be very shallow depending on how you choose to tune your model.

Answer 42

- **Reduces variance**: Standalone models can result in high variance. Aggregating base models’ predictions in an ensemble help reduce it. - **Fast**: Training can happen in parallel across CPU cores and even across different servers. - **Good for big data**: Bagging doesn’t require an entire training dataset to be stored in memory during model training.

Answer 43

leverage randomness to reduce the likelihood that a given base learner will make the **same mistakes** as other base learners. When **mistakes** between learners are **uncorrelated**, it **reduces both bias and variance.** In bagging, this randomization occurs by training each base learner on a sampling of the observations, with replacement. performance scores and faster execution times

Answer 44

For car data, a random forest model of 3 base learners, each trained on bootstrapped samples of 3 observations and 2 features -> - Observation 1: Feature = mile and price. - Obs. 2: year, mile. - Obs 3: model, price

Answer 45

- **doesn’t affect** prediction. - not only is it possible for model scores to **improve with sampling**, but they also require significantly **less time to run** since each tree is built from less data.

Answer 46

- Max_depth. - min-samples-leaf, - min-samples-split, - Max Features, - number of estimators

Answer 47

- controls the **randomness** of the trees. - specifies the **number of features** that each tree **selects randomly** from the training data to **determine its splits.**

Answer 48

controls **how many decision trees** your **model will build** for its ensemble. For example, if you set your number of estimators to 300, your model will train 300 individual trees.

Answer 49

is a supervised learning technique where you **build an ensemble of weak learners**. This is done **sequentially** with each consecutive base learner trying to **correct the errors** of the **one before.**

Answer 50

A model that performs slightly better than randomly guessing

Answer 51

Like random forest, boosting is an - **ensembling technique**, and it - also **builds many weak learners,** - then **aggregates their predictions.**

Answer 52

Unlike random forest, which builds base learners in parallel, boosting builds learners sequentially. the methodology you choose for the weak learner isn't limited to tree-based methods.

Answer 53

1. Adaptive boosting or AdaBoosting. 2. Gradient Boosting

Answer 54

is a tree-based boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner. This process repeats until either a tree makes a perfect prediction or the ensemble reaches the maximum number of trees, which is a hyperparameter that is specified by the data professional.

Answer 55

both classification and regression problems, hence aggregatin differs depending on problem type

Answer 56

the ensemble uses a voting process that places weight on each vote. Base learners that make more accurate predictions are weighted more heavily in the final aggregation.

Answer 57

the model calculates a weighted mean prediction for all the trees in the ensemble.

Answer 58

You can't train your model in parallel across many different servers, because each model in the ensemble is dependent on the one that preceded it. - This means that in terms of computational efficiency, it **doesn't scale well to very large datasets** when compared to bagging.

Answer 59

1. accurate 2. it's based on an ensemble of weak learners means that the problem of high variance is reduced. 3. This is because no single tree weighs too heavily in the ensemble. 4. reduces bias 5. It's also easy to understand and doesn't require the data to be scaled or normalized 6. can handle both numeric and categorical features 7. it can still function well even with multicollinearity among the features, 8. robust to outliers.

Answer 60

Model ensembles that use gradient boosting

Answer 61

boosting methodology where each base learner in the sequence is built to predict the residual errors of the model that preceded it and therefore compensate for it. Its base learner trees are known as “weak learners” or “decision stumps.” They are generally very shallow.

Answer 62

Gradient Boosting is different from adaptive boosting because instead of assigning weights to incorrect predictions, each base learner in the sequence is built to predict the residual errors of the model that preceded it.

Answer 63

1. One of these is high accuracy. 2. scalable. 3. work well with missing data. 4. GBMs don't require the data to be scaled and they can handle outliers easily.

Answer 64

Extreme Gradient Boosting used to tune GBM models

Answer 65

Max Depth - n_estimators - learning_rate - min_child_weight

Answer 66

controls how deep each base learner tree will grow. The best way to find this value is through cross-validation. The model's final max depth value is usually low.

Answer 67

which is the number of estimators or maximum number of base learners that the ensemble will grow. This is best determined using Grid search. - For smaller data sets, more trees, maybe better than fewer. - For very large data sets, the opposite could be true. Typical ranges are 50-500.

Answer 68

Values can range from (0–1]. we use the learning rate to indicate how much weight the model should give to each consecutive base learner's prediction. - **Lower learning rates** mean that each subsequent tree contributes less to the ensemble's final prediction. - This helps **prevent over-correction, and over-fitting.** - Another common name for this concept is **shrinkage**, because less, and less weight is given to each consecutive tree's prediction in the final ensemble.

Answer 69

This is a regularization parameter. a tree will not split a node if it results in any child node with less weight than what you specify in this hyper-parameter, instead, the node would become a leaf.

Answer 70

Higher values will stop trees splitting further, if model is overfitting, increase this value to stop your trees from getting too finely divided

Answer 71

lower values will allow trees to continue to split further. If your model is underfitting, then you may want to lower it to allow for more complexity.

Answer 72

1. Split the data into training, validation, and test sets 2. Tune hyperparameters using cross-validation on the training set 3. Use *all* tuned models to predict on the validation set 4. Select a champion model based on performance on the validation set 5. Use champion model alone to predict on test data

Answer 73

Pros: - The coding workload is reduced. - The scripts for data splitting are shorter. - It's only necessary to evaluate test dataset performance once, instead of two evaluations (validate and test). Cons: - If a model is evaluated using samples that were also used to build or fine-tune that model, it likely will provide a biased evaluation. - A potential overfitting issue could happen when fitting the model's scores on the test data.

Decision Tree Modeling Flashcards

(97 cards)