How to Encode Categorical Data Flashcards

1
Q

WHAT ARE THE TWO MOST POPULAR TECHNIQUES FOR ENCODING CATEGORICAL DATA? P258

A

The two most popular techniques are an Ordinal encoding and a One Hot encoding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

WHAT IS DISCRETIZATION? P259

A

Converting a numerical variable to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

WHAT IS THE DIFFERENCE BETWEEN NOMINAL VARIABLE AND ORDINAL VARIABLES? P259

A

There’s no rank-order between the values in nominal, but there is a rank-order in ordinal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

WHAT IS AN EXAMPLE OF AN ALGORITHM CAPABLE OF WORKING WITH CATEGORICAL DATA DIRECTLY? P259

A

Decision Trees; it can be learned directly from categorical data with no data transform required (this depends on the specific implementation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

WHAT IS THE DIFFERENCE BETWEEN LABEL ENCODER AND ORDINAL ENCODER? P260

A

Label encoder expects 1-D input but OrdinalEncoder can receive a matrix, other than this, they do the same thing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

WHY ORDINAL ENCODER CAN CAUSE PROBLEMS IF USED FOR NOMINAL VARIABLES? WHAT CAN BE USED INSTEAD? P260

A

An integer ordinal encoding is a natural encoding for ordinal variables. For categorical variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one hot encoding may be used instead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

WHAT IS THE USE OF PARAMETER “CATEGORIES” IN ONE HOT ENCODER? P262

A

If you know all of the labels to be expected in the data, they can be specified via the categories argument as a list.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

WHAT DOES ONE HOT ENCODER DO WHEN IT ENCOUNTERS UNKNOWN CATEGORIES IN NEW DATA? P262

A

If new data contains categories not seen in the training dataset, the handle unknown argument can be set to ‘ignore’ to not raise an error, which will result in a zero value for each label

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

WHAT’S ONE WAY TO REDUCE REDUNDANCY WHEN USING ONE HOT ENCODER? P262

A

Using dummy encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

HOW DOES DUMMY ENCODER WORK? P262

A

When there are C categories, it creates C-1 column. One category gets all 0 values. . For example, if we know that [1, 0, 0] represents blue and [0, 1, 0] represents green we don’t need another binary variable to represent red, instead we could use 0 values alone, e.g. [0, 0, 0].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

WHEN WORKING WITH TREE-BASED MODELS, IS FULL ONE HOT ENCODING BETTER OR IS DUMMY ENCODING BETTER? P262

A

We recommend using the full set of dummy variables when working with tree-based models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

HOW CAN WE IMPLEMENT DUMMY ENCODING USING ONE HOT ENCODER CLASS? P262

A

The “drop” parameter can be set to indicate which category will become the one that’s assigned all zero values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

WHAT IS THE CATEGORY THAT’S ASSIGNED ALL ZEROS CALLED? P262

A

Baseline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

WHAT IF I HAVE A MIXTURE OF CATEGORICAL AND ORDINAL DATA? P270

A

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model. Alternately, you can use the ColumnTransformer to conditionally apply different data transforms to different input variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

WHAT IF I HAVE HUNDREDS OF CATEGORIES? P270

A

You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly