Project Questions Flashcards

1
Q

What is the purpose of the “set.seed()” function in R?

A

It ensures that output from randomness will be the same given same inputs.

I.e., omitting the seed function, running the same code will lead to different results every time due to different points of departure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the utility of the “caret” package?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the utility of the “corrplot” package?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the utility of the “tidyverse” package?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the utility of the “pROC” package?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the utility of the “rpart.plot” package?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a factor?

A

In R, a factor is a categorical variable that represents distinct levels or categories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an integer?

A

An integer is a whole number that does not have any fractional or decimal part.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a character?

A

A character is a data type in programming that represents individual letters, numbers, symbols, or spaces

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

To incorporate interaction effect with one or both variables being categorical, dummy variables are used for this purpose. Why?

A

In logistic regression, Dummy variables provide a way to represent categorical variables numerically. Each dummy variable represents one category of the categorical variable and takes on values of 0 or 1, indicating the absence or presence of the category. If absent, the entire term becomes 0. If present, the model multiples the corresponding coefficient with 1 (present).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which of the following additional use cases does BDA have in hospitality?
A) facilitates service innovation
B) insights into customer satisfaction through e.g., big data text analysis of customer reviews
C) creating client profiles and enhance customer relationship management
D) all of the abovw

A

D) All of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the goal of data transformation? It involves modifying the structure or content of a dataset.

A

The goal is to ensure appropriate fit between the type of data and chosen statistical method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which data transformation steps did you perform?

A

1) Transformation from character to factor
2) Transformation from integer to factor
3) Combining #kids and #babies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why did you transform characters to factors?

A

Factors allows the model to interpret and utilise categorical variables efficiently by providing a finite number of options that the value can take

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why did you transform integers to factors?

A

Factors allows the model to distinguish between numerical variables that represent categories and numerical variables that represent continuous quantities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Give me an example of which integer you you transformed into a factor, and explain why it made sense

A

Is_repeat_guest and is_cancelled: in reality, these variable can take one of two values: yes (1) or no (0). Thus, it made sense to transform them from integers to factors to ensure that the model interpreted them as categorical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Give me an example of which character you you transformed into a factor, and explain why it made sense

A

Meal: there are 4 options of meal types, all each by two capital letters and came as characters. We wanted the model to interpret the variable as categorical to allow it to map any associations between this and the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

As part of data cleaning, you removed some observations based on their deposit type. Elaborate on this

A

Removed the observations with deposit type = non-refundable: EDA showed that 99% of these bookings were cancelled - highly counter-intuitive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

As part of data cleaning, you removed observations with no adults recorded. Why? Is this necessarily an error?

A

Not necessarily an error. However, the rooms with no adults were expected to be perfectly associated with another booking. Thus, these rooms are not representative as a singular observation to be considered in the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why did you keep the remaining two deposit types rather than deleting the entire variable altogether?

A

The observations with refundable deposit and no deposit paid behaved intuitively, and no abnormality was detected. E.g., cancellation rate is slightly higher for no deposits, which makes sense

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why did you remove observations with “undefined” meal and distribution channel? Were they missing completely at random?

A

This resulted in deletion of ~1000 rows, which we evaluated not to impact the model significantly (w. +100k observations remaining after all cleaning)
MCAR:

22
Q

How did you seek to tackle the bias-variance trade-off?

A

To ensure that at right balance was striked between generalisation (bias) and overfitting (variance), we partitioned the the data into a training set (60%) and test set (40%). Then, we applied 5-fold cross-validation

23
Q

What does it mean that you applied 5-fold cross validation?

A

5-fold cross validation is done within the TRAINING set, where the data is divided into 5 portions (folds). In each fold, 4/5 of the training data is used for training, and 1 is used for testing. This is repeated 5 times until all training data has been used for training AND testing

24
Q

Why did you choose 5 folds?

A

It came down to a trade-off between the degree of cross-validation and computational time - each fold takes a long time.
Meanwhile, with the significant size of the dataset, each fold still contains a large amount of data - suggesting that the bias-variance tradeoff was sufficiently handled

25
Q

What did you do to deal with dataset imbalance?

A

We applied stratified random sampling, which ensures that the distribution between 0 and 1 are similar in the test and training sets. This mitigates the risk of drawing a “lucky” or “unlucky” test/train set, affecting the predictive performance

26
Q

How does the LR model estimate coefficients?

A

By maximum likelihood estimation, where the model applies an iterative process to estimate the coefficients that maximises the likelihood of the given outcome being equal to the observed outcome

27
Q

Which of the following statements are TRUE about DT?
A) it operates by recursive partitioning a dataset into subsets based on the values and input features, which creates a hierarchical structure of decision nodes
B) Each intermediary node represents a decision point, wherein a given feature is evaluated
C) Each terminal leaf node reflects a final outcome or prediction
D) The recursive partitioning divides the dataset into subsets that minimises entropy and Gini impurity
E) all of the above

A

E) all of the above

28
Q

Which of the following options are advantages of LR relative to DT?
A) if offers a precise interpretation of the association between each variable and the output through coefficients
B) it is more flexible in adapting to non-linear relations
C) it provides a visual, intuitive mapping of the decision process, revealing the interplay of various predictors
D) all of the above

A

LR > DT in following aspects
A) if offers a precise interpretation of the association between each variable and the output through coefficients

DT < LR
B) it is more flexible in adapting to non-linear relations
C) it provides a visual, intuitive mapping of the decision process, revealing the interplay of various predictors

29
Q

Explain the process employed in your selection of variables in the mode.

How did you determine the importance of numerical variables?

How did you determine the importance of
categorical variables?

How did you determine the importance of
different interactions?

A

For numerical variables, the importance of each predictors was evaluated based on their respective correlation with cancellation rate

For categorical variables, the importance of each predictor was evaluated through graphical illustrations to illuminate any meaningful associations with cancellation rate

For interactions, a series of heatmaps were constructed to explore any association across categorical predictors. For numerical variables, those with highest correlation were chosen

30
Q

You excluded following explanatory variables. Why?
A) room type
B) agent ID
C) hotel type
D) reservation status
E) company
F) days_waiting_list and arrival_date_day_of_month

A

A) room type: denoted by letters -> unable to interpret
B) agent ID: denoted by numbers –> unable to interpret
C) hotel type: to enhance generalisability beyond the two hotel types
D) reservation status: for predictive purposes on future bookings, you don’t have this observation prior to expected guest arrival
E) company: 94% were NULL observations
F) days_waiting_list and arrival_date_day_of_month: negligible correlation with cancellation

31
Q

AUC is a threshold independent metric
TRUE/FALSE

A

TRUE
AUC provides a general evaluation of a model’s performance across its sensitivity and 1-specificity

32
Q

Confusion-matrix-derived metrics are threshold-dependent
TRUE/FALSE

A

TRUE

33
Q

Proportion of correct predictions in all records is measure by____
A) accuracy
B) sensitivity
C) specificity
D) precision
E) negative predictive value

A

Proportion of correct predictions in all records is measure by ACCURACY

34
Q

Ability to correctly predict the positive class is measure by____
A) accuracy
B) sensitivity
C) specificity
D) precision
E) negative predictive value

A

Ability to correctly predict the positive class is measure by SENSITIVITY (=RECALL)

Sensitivity = TP/(TP+FN)

35
Q

Ability to correctly predict the negative class is measure by____
A) accuracy
B) sensitivity
C) specificity
D) precision
E) negative predictive value

A

Ability to correctly predict the negative class is measure by SPECIFICITY

Specificity = TN/(TN+FP)

36
Q

Proportion of correct predictions in those the classifier predicted as positives is measure by_____
A) accuracy
B) sensitivity
C) specificity
D) precision
E) negative predictive value

A

Proportion of correct predictions in those the classifier predicted as positives is measure by PRECISION

Precision = TP/(TP+FP)

37
Q

Proportion of correct predictions in those the classifier predicted as negatives is measure by_____
A) accuracy
B) sensitivity
C) specificity
D) precision
E) negative predictive value

A

Proportion of correct predictions in those the classifier predicted as negatives is measure by NEGATIVE PREDICTIVE VALUE

NPV= TN/(FN+TN)

38
Q

In the context of the core business objective being to improve operational performance, prioritising a slight overestimation of cancellations (with some false positives) is preferable to risking revenue loss due to underestimating cancellations (false negatives). Therefore, the primary focus in the model comparison centres on _____ vs. _____

A

In the context of the core business objective being to improve operational performance, prioritising a slight overestimation of cancellations (with some false positives) is preferable to risking revenue loss due to underestimating cancellations (false negatives). Therefore, the primary focus in the model comparison centres on SENSITIVITY vs. PRECISION

39
Q

½Hyperparameters play a crucial role in the domain of machine learning, setting themselves apart from model parameters as they operate _______ to the learning process.
Fill in the bland

A

Hyperparameters play a crucial role in the domain of machine learning, setting themselves apart from model parameters as they operate EXTERNALLY to the learning process.

40
Q

What role does the cost complexity parameter play in decision tree modeling, particularly in relation to overfitting?

A

The cost complexity parameter in decision tree modeling balances the trade-off between the complexity of the tree and its predictive performance, preventing overfitting

41
Q

What is the purpose of the tune-length hyperparameter in the systematic tuning process for pruning decision trees?

A

The tune-length defines the number of complexity parameters to be evaluated, and in this study, a tune-length of 15 is chosen to strike a balance between performance, risk of overfitting, and computational demand.

42
Q

Why did you set the threshold at below 0.5?
A) asymmetric costs associated to overprediction vs. underprediction
B) the study defines a preference for slight overestimation of cancellations, leading to lower threshold choice
C) to ensure that the model overestimates cancellations rather than underestimating it
D) all of the above

A

D) all of the above

43
Q

You set different thresholds for each of the model versions. Why?
A) to ensure an unbiased evaluation of predictive performance across models that allows for comparability
B) each model possesses unique characteristics that impact its performance at various threshold levels
C) it ensures a more impartial performance assessment of each model rather than setting an “unfair” threshold for some of the models
D) all of the above

A

D) all of the above

44
Q

How did you set the individual thresholds?

A

An iterative process. Every model starts with a threshold of 0.25. Then, it is gradually adjusted upward using two decimal points to a number that maximises the accuracy of the model BUT under the condition that it must predict more false positives than false negatives

45
Q

A Sensitivity of 0.59 in Model 3 indicates
that the model correctly predicts 59.% of the actual positives, whereas a Precision of 0.57 indicates that 57% of the positive predictions are actually positive.

TRUE/FALSE

A

TRUE

46
Q

What does a no information rate of 0.7151 mean?

A

The NIR tells us the accuracy (predictive performance) of a naive benchmark.
The naive consistently predicts the outcome based on the largest proportion of the output being positive or negative.

I.e., if NIR is 0.71, simply means that 29% of the dataset has a is cancelled outcome (1). By predicting “not cancelled” every time, the naive model will then be right in 71% of the time.

47
Q

Which of the DT models was optimal? Why?

A

DT 4 was rendered the most optimal DT model.
The accuracy of 3 and 4 are the only two that exceeds NIR. DT 3 had higher precision, but DT 4 had higher sensitivit –> DT 4 is best based on business goal (preference for overprediction at the expense of higher false positives)

48
Q

Which of the LR models was optimal? Why?

A

LR 3 was rendered the most optimal RL model.
Both sensitivity and precision increase from LR 1-3.
Despite Precision being highest for model 4, sensitivity is highest for LR 3 –> LR 3 best based on business goal (preference for overprediction at the expense of higher false positives)

49
Q

Best DT model (4) outperforms the best LR model (3) across all evaluated metrics.
This leads to the conclusion that this DT emerges as the better choice based
on predictive performance - both sensitivity and precision is higher for DT 4 than LR 3

TRUE/FALSE

A

TRUE

50
Q

Which advantages does LR have over DT wrt. practical adoption?

A

1) Lower computational demand and faster update
2) Easier to interpret with coefficients indicating the association to predictive features - easier to communicate

51
Q

LR provides a percentage likelihood for a particular booking being cancelled (probability) - do DTs give the same output?

A

No - decision trees tells you whether the given booking is most likely to be cancelled or not, not the distinct probability. It does tell you however that based on the training set, x% of the total data was captured in a given leaf node, and 0.Y of those observations turned out to be cancelled/not cancelled

52
Q

What is the business objective of the study?

A

Based on booking cancellation prediction, hotels can leverage the insight to increase operational performance through more informed revenue management initiatives (overbooking) and minimise cost base (staffing)