Machine Learning Flashcards

source: https://www.edureka.co/blog/interview-questions/machine-learning-interview-questions/

1
Q

A/B Testing

A

A/B is Statistical hypothesis testing for randomized experiment with two variables A and B. It is used to compare two models that use different predictor variables in order to check which variable fits best for a given sample of data.

Consider a scenario where you’ve created two models (using different predictor variables) that can be used to recommend products for an e-commerce platform.

A/B Testing can be used to compare these two models to check which one best recommends products to a customer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bagging vs Boosting

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Classification vs Regression

A

Classification:

  • Predicting discrete class/label
  • Binary and multi-class classification

Regression:

  • Predicting continuous quantity
  • multi-input regression is called multivariate regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Cluster Sampling

A

It is a process of randomly selecting intact groups within a defined population, sharing similar characteristics.

Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Collinearity and Multicollinearity

A

Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have some correlation.

Multicollinearity occurs when more than two predictor variables (e.g., x1, x2, and x3) are inter-correlated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Confusion Matrix

A

A confusion matrix or an error matrix is a table which is used for summarizing the performance of a classification algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Gini Impurity vs Entropy in a Decision Tree

A
  • Gini Impurity and Entropy are the metrics used for deciding how to split a Decision Tree.

Gini measurement is the probability of a random sample being classified correctly if you randomly pick a label according to the distribution in the branch.

Entropy is a measurement to calculate the lack of information. You calculate the Information Gain (difference in entropies) by making a split. This measure helps to reduce the uncertainty about the output label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How Decision Tree node is split

A
  • Measures such as, Gini Index and Entropy can be used to decide which variable is best fitted for splitting the Decision Tree at the root node.
  • We can calculate Gini as following:
    • Calculate Gini for sub-nodes, using the formula – sum of square of probability for success and failure (p^2+q^2).
  • Calculate Gini for split using weighted Gini score of each node of that split
  • Entropy is the measure of impurity or randomness in the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Entropy Vs Information Gain

A

Entropy is an indicator of how messy your data is. It decreases as you reach closer to the leaf node.

The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. It keeps on increasing as you reach closer to the leaf node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Eigenvectors and Eigenvalues

A

Eigenvectors: Eigenvectors are those vectors whose direction remains unchanged even when a linear transformation is performed on them.

Eigenvalues: Eigenvalue is the scalar that is used for the transformation of an Eigenvector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Ensemble learning

A

Ensemble learning is a technique that is used to create multiple Machine Learning models, which are then combined to produce more accurate results. A general Machine Learning model is built by using the entire training data set.

However, in Ensemble Learning the training data set is split into multiple subsets, wherein each subset is used to build a separate model. After the models are trained, they are then combined to predict an outcome in such a way that the variance in the output is reduced.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True Positive

False Positive

False Negative

True Negative

A
  • True Positive:
  • False Positive:
  • False Negative:
  • True Negative:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Better False Positives vs False Negatives?

A

It depends on the question as well as on the domain for which we are trying to solve the problem.

  • If you’re using Machine Learning in the domain of medical testing, then a false negative is very risky, since the report will not show any health problem when a person is actually unwell.
  • ​Similarly, if Machine Learning is used in spam detection, then a false positive is very risky because the algorithm may classify an important email as spam.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Inductive vs Deductive learning

A

Inductive learning is the process of using observations to draw conclusions

Deductive learning is the process of using conclusions to form observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

KNN vs K-Means

A

KNN:

  • Supervised Learning model/technique
  • Classification or regression
  • K is number of label to predict

K-Means:

  • Unsupervised Learning/technique
  • Clustering (or grouping)
  • K is the number of clusters to identify/learn from the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Python libraries for Data Analysis

A
  • NumPy
  • SciPy
  • Pandas
  • SciKit
  • Matplotlib
  • Seaborn
  • Bokeh
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Deep Learning vs Machine Learning

A
  • Machine Learning is all about algorithms that parse data, learn from that data, and then apply what they’ve learned to make informed decisions.
  • Deep Learning is a form of machine learning that is inspired by the structure of the human brain and is particularly effective in feature detection.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Types of Machine Learning

A
  • Supervised Learning - uses labeled data
  • Unsupervised Learning - uses unlabeled data
  • Reinforcement Learning - actions oriented, uses rewards and penalties system
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Dealing with Missing Value

A

4 ways:

  • Drop feature/column - if 95%+ of the data of the feature is missing
  • Drop rows - if 5%- of the data is missing, it is safer to drop the rows
  • Use flag indicator - if between 50%-95% of the data is missing, replace column/feature with flag indicator(e.i. something that say missing or not missing)
  • Imputation - if less than 50% of the data is missing, imputate using method best suiited for this project:
    • mean, mode, median
    • replace missing values based on ration/distribution
    • replace missing values using model prediction
20
Q

Suppose you are given a data set which has missing values spread along 1 standard deviation from the median. What percentage of data would remain unaffected and Why?

A

Since the data is spread across the median, let’s assume it’s a normal distribution.
As you know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.

21
Q

You are given a cancer detection data set. Let’s suppose when you build a classification model you achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?

A

Might not have been trained properly.

You can do the following:

  • Add more data
  • Treat missing outlier values
  • Feature Engineering
  • Feature Selection
  • Multiple Algorithms
  • Algorithm Tuning
  • Ensemble Method
  • Cross-Validation
22
Q

Suppose you found that your model is suffering from low bias and high variance. Which algorithm you think could tackle this situation and Why?

A

Type 1: How to tackle high variance?

  • Low bias occurs when the model’s predicted values are near to actual values.
  • In this case, we can use the bagging algorithm (eg: Random Forest) to tackle high variance problem.
  • Bagging algorithm will divide the data set into its subsets with repeated randomized sampling.
  • Once divided, these samples can be used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).

Type 2: How to tackle high variance?

  • Lower the model complexity by using regularization technique, where higher model coefficients get penalized.
  • You can also use top n features from variable importance chart. It might be possible that with all the variable in the data set, the algorithm is facing difficulty in finding the meaningful signal.
23
Q

How do you map nicknames (Pete, Andy, Nick, Rob, etc) to real names?

A
  • This problem can be solved in n number of ways. Let’s assume that you’re given a data set containing 1000s of twitter interactions. You will begin by studying the relationship between two people by carefully analyzing the words used in the tweets.
  • This kind of problem statement can be solved by implementing Text Mining using Natural Language Processing techniques, wherein each word in a sentence is broken down and co-relations between various words are found.
  • NLP is actively used in understanding customer feedback, performing sentimental analysis on Twitter and Facebook. Thus, one of the ways to solve this problem is through Text Mining and Natural Language Processing techniques.
24
Q

You’re asked to build a random forest model with 10000 trees. During its training, you got training error as 0.00. But, on testing the validation error was 34.23. What is going on? Haven’t you trained your model perfectly?

A
  • The model is overfitting the data.
  • Training error of 0.00 means that the classifier has mimicked the training data patterns to an extent.
  • But when this classifier runs on the unseen sample, it was not able to find those patterns and returned the predictions with more number of errors.
  • In Random Forest, it usually happens when we use a larger number of trees than necessary. Hence, to avoid such situations, we should tune the number of trees using cross-validation.
25
Q

You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

A

Discarding correlated variables will have a substantial effect on PCA because, in the presence of correlated variables, the variance explained by a particular component gets inflated.

26
Q

How would you predict who will renew their subscription next month? What data would you need to solve this? What analysis would you do? Would you build predictive models? If so, which algorithms?

A
  • Let’s assume that we’re trying to predict renewal rate for Netflix subscription. So our problem statement is to predict which users will renew their subscription plan for the next month.
  • Next, we must understand the data that is needed to solve this problem. In this case, we need to check the number of hours the channel is active for each household, the number of adults in the household, number of kids, which channels are streamed the most, how much time is spent on each channel, how much has the watch rate varied from last month, etc. Such data is needed to predict whether or not a person will continue the subscription for the upcoming month.
  • After collecting this data, it is important that you find patterns and correlations. For example, we know that if a household has kids, then they are more likely to subscribe. Similarly, by studying the watch rate of the previous month, you can predict whether a person is still interested in a subscription. Such trends must be studied.
  • The next step is analysis. For this kind of problem statement, you must use a classification algorithm that classifies customers into 2 groups:
    • Customers who are likely to subscribe next month
    • Customers who are not likely to subscribe next month
  • Would you build predictive models? Yes, in order to achieve this you must build a predictive model that classifies the customers into 2 classes like mentioned above.
  • Which algorithms to choose? You can choose classification algorithms such as Logistic Regression, Random Forest, Support Vector Machine, etc.
  • Once you’ve opted the right algorithm, you must perform model evaluation to calculate the efficiency of the algorithm. This is followed by deployment.
27
Q

‘People who bought this also bought…’ recommendations seen on Amazon is based on which algorithm?

A

E-commerce websites like Amazon make use of Machine Learning to recommend products to their customers. The basic idea of this kind of recommendation comes from collaborative filtering. Collaborative filtering is the process of comparing users with similar shopping behaviors in order to recommend products to a new user with similar shopping behavior.

  • User Based filtering
  • Content Based filtering
28
Q

You are asked to build a multiple regression model but your model R² isn’t as good as you wanted. For improvement, you remove the intercept term now your model R² becomes 0.8 from 0.3. Is it possible? How?

A
  • The intercept term refers to model prediction without any independent variable or in other words, mean prediction
    R² = 1 – ∑(Y – Y´)²/∑(Y – Ymean)² where Y´ is the predicted value.
  • In the presence of the intercept term, R² value will evaluate your model with respect to the mean model.
  • In the absence of the intercept term (Ymean), the model can make no such evaluation,
  • With large denominator,
    Value of ∑(Y – Y´)²/∑(Y)² equation becomes smaller than actual, thereby resulting in a higher value of R².
29
Q

You are working on a time series data set. Your manager has asked you to build a high accuracy model. You start with the decision tree algorithm since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than the decision tree model. Can this happen? Why?

A
  • Time series data is based on linearity while a decision tree algorithm is known to work best to detect non-linear interactions
  • Decision tree fails to provide robust predictions. Why?
    • The reason is that it couldn’t map the linear relationship as good as a regression model did.
    • We also know that a linear regression model can provide a robust prediction only if the data set satisfies its linearity assumptions.
30
Q

Model Accuracy vs Model Performance

A
  • Model accuracy is only a subset of model performance.
  • The accuracy of the model and performance of the model are directly proportional
  • Better the performance of the model, more accurate are the predictions.
31
Q

Check accuracy of a model in python

A

use accuracy_score function:

from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, y_pred))

32
Q

NumPy and SciPy

A

NumPy is part of SciPy.

NumPy defines arrays along with some basic numerical functions like indexing, sorting, reshaping, etc.

SciPy implements computations such as numerical integration, optimization and machine learning using NumPy’s functionality.

33
Q

Dealing with Outliers

A

Methods can be used to find outliers:

  • Boxplot: A box plot represents the distribution of the data and its variability. The box plot contains the upper and lower quartiles, so the box basically spans the Inter-Quartile Range (IQR). One of the main reasons why box plots are used is to detect outliers in the data. Since the box plot spans the IQR, it detects the data points that lie outside this range. These data points are nothing but outliers.
  • Probabilistic and statistical models: Statistical models such as normal distribution and exponential distribution can be used to detect any variations in the distribution of data points. If any data point is found outside the distribution range, it is rendered as an outlier.
  • Linear models: Linear models such as logistic regression can be trained to flag outliers. In this manner, the model picks up the next outlier it sees.
  • Proximity-based models: An example of this kind of model is the K-means clustering model wherein, data points form multiple or ‘k’ number of clusters based on features such as similarity or distance. Since similar data points form clusters, the outliers also form their own cluster. In this way, proximity-based models can easily help detect outliers.

How do you handle these outliers?

  • If your data set is huge and rich then you can risk dropping the outliers.
  • However, if your data set is small then you can cap the outliers, by setting a threshold percentile. For example, the data points that are above the 95th percentile can be used to cap the outliers.
  • Lastly, based on the data exploration stage, you can narrow down some rules and impute the outliers based on those business rules.
34
Q

Overfitting

A

Over-fitting occurs when a model studies the training data to such an extent that it negatively influences the performance of the model on new data.

This means that the disturbance in the training data is recorded and learned as concepts by the model. But the problem here is that these concepts do not apply to the testing data and negatively impact the model’s ability to classify the new data, hence reducing the accuracy on the testing data.

Three main methods to avoid overfitting:

  • Collect more data so that the model can be trained with varied samples.
  • Use ensembling methods, such as Random Forest. It is based on the idea of bagging, which is used to reduce the variation in the predictions by combining the result of multiple Decision trees on different samples of the data set.
  • Choose the right algorithm.
35
Q

Pandas series vs single-column DataFrame

A

Pandas series is a 1 dimentional array with index

single-column Pandas DataFrame is a 2 dimentional structure with index and column

36
Q

Precision vs Recall

A

Recall is the ratio of the number of events you can correctly recall, to the total number of events.

Precision is the ratio of a number of events you can correctly recall, to the total number of events you can recall (mix of correct and wrong recalls).

37
Q

We have two options for serving ads within Newsfeed:
1 – out of every 25 stories, one will be an ad
2 – every story has a 4% chance of being an ad

For each option, what is the expected number of ads shown in 100 news stories?
If we go with option 2, what is the chance a user will be shown only a single ad in 100 stories? What about no ads at all?

A
  • for option 1, 100/25 = 4, ads will be shown
  • for option 2, there’s 4% chance so by default sould be 4 ads per 100 stories
  • chance of single ad:
    • can be solved by using Binomial distribution. Binomial distribution takes three parameters:
      • The probability of success and failure, which in our case is 4%.
      • The total number of cases, which is 100 in our case.
      • The probability of the outcome, which is a chance that a user will be shown only a single ad in 100 stories
    • p(single ad) = (0.96)^99*(0.04)^1
      (note: here 0.96 denotes the chance of not seeing an ad in 100 stories, 99 denotes the possibility of seeing only 1 ad, 0.04 is the probability of seeing an ad once in 100 stories )
    • In total, there are 100 positions for the ad. Therefore, 100 * p(single ad) = 7.03%
38
Q

A jar has 1000 coins, of which 999 are fair and 1 is double headed. Pick a coin at random, and toss it 10 times. Given that you see 10 heads, what is the probability that the next toss of that coin is also a head?

A
  • There are two ways of choosing a coin. One is to pick a fair coin and the other is to pick the one with two heads.
  • Probability of selecting fair coin = 999/1000 = 0.999
  • Probability of selecting unfair coin = 1/1000 = 0.001
  • Selecting 10 heads in a row = Selecting fair coin * Getting 10 heads + Selecting an unfair coin
  • P (A) = 0.999 * (1/2)^10 = 0.999 * (1/1024) = 0.000976
  • P (B) = 0.001 * 1 = 0.001
  • P( A / A + B ) = 0.000976 / (0.000976 + 0.001) = 0.4939
  • P( B / A + B ) = 0.001 / 0.001976 = 0.5061
  • Probability of selecting another head = P(A/A+B) * 0.5 + P(B/A+B) * 1 = 0.4939 * 0.5 + 0.5061 = 0.7531
39
Q

There’s a game where you are asked to roll two fair six-sided dice. If the sum of the values on the dice equals seven, then you win $21. However, you must pay $5 to play each time you roll both dice. Do you play this game? And in the follow-up: If he plays 6 times what is the probability of making money from this game?

A

Given:

  • $5 to roll/play
  • win $21 if roll equal 7 (sum of both dice)
    (make $16 profit if win)

Probability of 7:

  • (1,6), (2,5), (3,4), (4,3), (5,2) and (6,1) -> 6/36 -> 1/6 (about 17%)
  • games means has a chance of winning 1 game and pay for 6
  • $21 - 6*$5 = ($9) -> not worth playing
40
Q

Remove duplicate from dataset

A

Removing Duplicates

If dataframe, use Pandas DataFrame method .drop_duplicates()

example:

bill_data_uniq = bill_data.drop_duplicates()

41
Q

ROC curve

A

Receiver Operating Characteristic curve (or ROC curve) is a fundamental tool for diagnostic test evaluation and is a plot of the true positive rate (Sensitivity) against the false positive rate (Specificity) for the different possible cut-off points of a diagnostic test.

  • It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).
  • The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.
  • The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
  • The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test.
  • The area under the curve is a measure of test accuracy.
42
Q

plotting in Python: Seaborn or Matplotlib or Bokeh

A
  • Matplotlib: Used for basic plotting like bars, pies, lines, scatter plots, etc
  • Seaborn: Is built on top of Matplotlib and Pandas to ease data plotting. It is used for statistical visualizations like creating heatmaps or showing the distribution of your data
  • Bokeh: Used for interactive visualization. In case your data is too complex and you haven’t found any “message” in the data, then use Bokeh to create interactive visualizations that will allow your viewers to explore the data themselves
43
Q

Selection Bias

A
  • It is a statistical error that causes a bias in the sampling portion of an experiment.
  • The error causes one sampling group to be selected more often than other groups included in the experiment.
  • Selection bias may produce an inaccurate conclusion if the selection bias is not identified.
44
Q

Write an SQL query that makes recommendations using the pages that your friends liked. Assume you have two tables: a two-column table of users and their friends, and a two-column table of users and the pages they liked. It should not recommend pages you already like.

A

SELECT DISTINCT page_liked
FROM Page_Liked_Table
WHERE userID IN (SELECT friends FROM Friends_Table WHERE userID = me
AND page_liked NOT IN (SELECT page_liked FROM Page_Liked_Table WHERE userID = me)

45
Q

Type I and Type II error

A

Type I = False Positive

Type II = Flase Negative