DS Technical Interview Questions Flashcards

1
Q

What are the differences between supervised and unsupervised learning?

A

Supervised Learning:
- Uses known and labeled data as input
- Supervised learning has a feedback mechanism
- The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine

Unsupervised Learning:
- Uses unlabeled data as input
- Unsupervised learning has no feedback mechanism
- The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is logistic regression done?

A

Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).

p(x) = 1 / (1 + exp(- \theta^T * x)) [w/ Bernoulli likelihood]

Sigmoid:
S(x) = 1 / (1 + exp(-x))
S’(x) = S(x) * (1 - S(x))

Sigmoid generalisation: Softmax function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is logistic regression done? [ICL notes]

A

ICL notes:
- Binary classification problems
- Linear model with non-Gaussian likelihood
- Implicit modeling assumptions
- Parameter estimation (MLE, MAP) no longer in closed form
- Bayesian logistic regression with Laplace approximation of the
posterior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the steps in making a decision tree

A
  1. Take the entire data set as input
  2. Calculate entropy/loss of the target variable, as well as the predictor attributes/features
  3. Calculate your information gain of all attributes/features (we gain information on sorting different objects from each other)
  4. Choose the attribute/feature with the highest information gain as the root node
  5. Repeat the same procedure on every branch until the decision node of each branch is finalised
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Bias vs Variance

A
  • Straight line = high bias (potential underfitting)
  • Perfect fitting line = high variance
    —- low/zero train error
    —- high test error (overfitting)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three commonly used methods for finding the sweet spot between a simple and complicated model?

A
  • Regularisation, e.g. L1, L2
  • Boosting, e.g. AdaBoost, GradientBoost, XGBoost, LightGBM
  • Bagging, e.g. Random Forest
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does Random Forest work?

A
  • Bootstrap data, i.e. sample from original dataset
  • Create decision trees on bootstrapped dataset (one each), “random” subsample columns to be used for decisions
  • Final voting called bagging (averaging all DT)
  • Use out-of-bag sample to estimate the RF accuracy, which also helps to choose the right columns for decisions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does AdaBoost work?

A
  • Stump: A tree with just one node and two leaves (weak learner)
  • E.g. “Stump” Forest = AdaBoost
  • Combine a lot of weak learners
  • Some stumps get more say in classification/regression than others (boosting weights of mis-classified samples)
  • Each stump is made by taking the previous stumps-s mistakes into account
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does GradientBoost work?

A
  • Start with a leaf that is the average value of the variable we want to predict
  • Add a tree based on the residuals, the difference between the observed and predicted values
  • Scale the tree’s contribution to the final prediction with a learning rate
  • Repeat on the previous residuals until converged/done
  • This works for regression; for classification us log-odds and sigmoid
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does LightGBM work?

A
  • Faster training speed and higher efficiency:
    —- LightGBM uses a histogram based algorithm, i.e. it buckets continuous feature values into discrete bins which speed up the training procedure
    —- Exclusive feature bundling
  • Lower memory usage:
    —- Replaces continuous values to discrete bins which result in a lower memory usage
  • Better accuracy than any other boosting algorithm:
    —- I produces more complex trees by following a leaf-wise split approach than level-wise which is the main factor in achieving higher accuracy
    —- Can lead to overfitting if max_depth or leaf_no is not restricted
    —- Gradient-based one-side sampling; keep large gradient features
  • Compatibility with large datasets (reduced train time vs XGBoost)
  • Parallel learning supported
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you build a random forest model?

A

Steps to build a random forest model:
(1) Randomly select ‘k’ features from a total of ‘m’ features where k &laquo_space;m
(2) Among the ‘k’ features, calculate the node D using the best split point
(3) Split the node into daughter nodes using the best split
(4) Repeat steps two and three until leaf nodes are finalised
(5) Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you avoid overfitting your model?

A

Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:
(1) Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
(2) Use cross-validation techniques, such as k folds cross-validation
(3) Use regularisation techniques, such as LASSO, that penalise certain model parameters if they’re likely to cause overfitting

[4] Bagging, Boosting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Differentiate between univariate, bivariate, and multivariate analysis

A

Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data and find patterns that exist within it.

Example height of students: The patterns can be studied by drawing conclusions using mean, median, mode, dispersion or range, minimum, maximum, etc.

Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.

Example temperature and ice cream sales in the summer season: the relationship is visible from the table that temperature and sales are directly proportional to each other. The hotter the temperature, the better the sales.

Multivariate data involves three or more variables, it is categorised under multivariate. It is similar to a bivariate but contains more than one dependent variable.

Example data for house price prediction: patterns can be studied by drawing conclusions using mean, median, and mode, dispersion or range, minimum, maximum, etc. You can start describing the data and using it to guess what the price of the house will be.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. What are the feature selection methods used to select the right variables?
A

There are two main methods for feature selection, i.e, filter, and wrapper methods.

Filter Methods:
- Linear discriminant analysis (LDA)
- PCA [X^TX feature vs feature, added myself]
- ANOVA [analysis of variance, SST = SSB + SSW]
- Chi-Square [test for mutually independent features, e.g. reject if significance level is below 5%]

Wrapper Methods:
- Forward Selection: We test one feature at a time and keep adding them until we get a good fit
- Backward Selection: We test all the features and start removing them to see what works better
- Recursive Feature Elimination: Recursively looks through all the different features and how they pair together

Note: Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe ANOVA

A

ANOVA = analysis of variance

SST = SSB + SSW,
where
SST = sum of squares total
SSB = … between
SSW = … within
X \in R^(mxn)

F-statistic = (SSB/(m-1)) / (SSW/(m*(n-1))&raquo_space; 1 => highly different features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe Linear Discriminant Analysis

A

d^2/(s_1^2 + s_2^2) = ideally large/small,
where, s_i is the sample std

(1) Maximise the distance between means
(2) Minimise the variation (scatter) within each category

LDA vs PCA:
- both try to reduce dimensions
- PCA looks at the features with the most variation, hence X^TX
- LDA tries to maximise the separation of categories and minimises variation