Machine Learning Flashcards

Question

You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

Answer 1

Discarding correlated variables will have a substantial effect on PCA because, in the presence of correlated variables, the variance explained by a particular component gets inflated.

Answer 2

* Let’s assume that we’re trying to predict renewal rate for Netflix subscription. So our problem statement is to predict which users will renew their subscription plan for the next month. * Next, we must understand the data that is needed to solve this problem. In this case, we need to check the number of hours the channel is active for each household, the number of adults in the household, number of kids, which channels are streamed the most, how much time is spent on each channel, how much has the watch rate varied from last month, etc. Such data is needed to predict whether or not a person will continue the subscription for the upcoming month. * After collecting this data, it is important that you find patterns and correlations. For example, we know that if a household has kids, then they are more likely to subscribe. Similarly, by studying the watch rate of the previous month, you can predict whether a person is still interested in a subscription. Such trends must be studied. * The next step is analysis. For this kind of problem statement, you must use a classification algorithm that classifies customers into 2 groups: * Customers who are likely to subscribe next month * Customers who are not likely to subscribe next month * Would you build predictive models? Yes, in order to achieve this you must build a predictive model that classifies the customers into 2 classes like mentioned above. * Which algorithms to choose? You can choose classification algorithms such as Logistic Regression, Random Forest, Support Vector Machine, etc. * Once you’ve opted the right algorithm, you must perform model evaluation to calculate the efficiency of the algorithm. This is followed by deployment.

Answer 3

E-commerce websites like Amazon make use of Machine Learning to recommend products to their customers. The basic idea of this kind of recommendation comes from collaborative filtering. Collaborative filtering is the process of comparing users with similar shopping behaviors in order to recommend products to a new user with similar shopping behavior. * User Based filtering * Content Based filtering

Answer 4

* The intercept term refers to model prediction without any independent variable or in other words, mean prediction R² = 1 – ∑(Y – Y´)²/∑(Y – Ymean)² where Y´ is the predicted value. * In the presence of the intercept term, R² value will evaluate your model with respect to the mean model. * In the absence of the intercept term (Ymean), the model can make no such evaluation, * With large denominator, Value of ∑(Y – Y´)²/∑(Y)² equation becomes smaller than actual, thereby resulting in a higher value of R².

Answer 5

* Time series data is based on linearity while a decision tree algorithm is known to work best to detect non-linear interactions * Decision tree fails to provide robust predictions. Why? * The reason is that it couldn’t map the linear relationship as good as a regression model did. * We also know that a linear regression model can provide a robust prediction only if the data set satisfies its linearity assumptions.

Answer 6

* Model accuracy is only a subset of model performance. * The accuracy of the model and performance of the model are directly proportional * Better the performance of the model, more accurate are the predictions.

Answer 7

use **accuracy\_score** function: from sklearn.metrics import accuracy\_score print(accuracy\_score(y\_test, y\_pred))

Answer 8

*NumPy is part of SciPy.* **NumPy** defines **arrays along with some basic numerical functions** like indexing, sorting, reshaping, etc. **SciPy** implements computations such as **numerical integration, optimization and machine learning using NumPy’s functionality**.

Answer 9

Methods can be used to find outliers: * **Boxplot**: A box plot represents the distribution of the data and its variability. The box plot contains the upper and lower quartiles, so the box basically spans the *Inter-Quartile Range (IQR)*. One of the main reasons why box plots are used is to detect outliers in the data. Since the box plot spans the IQR, it detects the data points that lie outside this range. These data points are nothing but outliers. * **Probabilistic and statistical models**: Statistical models such as normal distribution and exponential distribution can be used to detect any variations in the distribution of data points. If any data point is found outside the distribution range, it is rendered as an outlier. * **Linear models**: Linear models such as logistic regression can be trained to flag outliers. In this manner, the model picks up the next outlier it sees. * **Proximity-based models**: An example of this kind of model is the K-means clustering model wherein, data points form multiple or ‘k’ number of clusters based on features such as similarity or distance. Since similar data points form clusters, the outliers also form their own cluster. In this way, proximity-based models can easily help detect outliers. How do you handle these outliers? * If your data set is huge and rich then you can risk **dropping** the outliers. * However, if your data set is small then you can cap the outliers, by **setting a threshold percentile**. For example, the data points that are above the 95th percentile can be used to cap the outliers. * Lastly, based on the data exploration stage, you can narrow down some rules and **impute** the outliers based on those business rules.

Answer 10

**Over-fitting** occurs when a model studies the training data to such an extent that it negatively influences the performance of the model on new data. This means that the disturbance in the training data is recorded and learned as concepts by the model. But the problem here is that these concepts do not apply to the testing data and negatively impact the model’s ability to classify the new data, hence reducing the accuracy on the testing data. Three main methods to avoid overfitting: * **Collect more data** so that the model can be trained with varied samples. * Use **ensembling methods**, such as Random Forest. It is based on the idea of bagging, which is used to reduce the variation in the predictions by combining the result of multiple Decision trees on different samples of the data set. * **Choose the right algorithm.**

Answer 11

Pandas series is a 1 dimentional array with index single-column Pandas DataFrame is a 2 dimentional structure with index and column

Answer 12

**Recall** is the ratio of the number of events you can correctly recall, to the total number of events. **Precision** is the ratio of a number of events you can correctly recall, to the total number of events you can recall (mix of correct and wrong recalls).

Answer 13

* for option 1, 100/25 = 4, ads will be shown * for option 2, there's 4% chance so by default sould be 4 ads per 100 stories * chance of single ad: * can be solved by using Binomial distribution. Binomial distribution takes three parameters: * The probability of success and failure, which in our case is 4%. * The total number of cases, which is 100 in our case. * The probability of the outcome, which is a chance that a user will be shown only a single ad in 100 stories * p(single ad) = (0.96)^99\*(0.04)^1 (note: here 0.96 denotes the chance of not seeing an ad in 100 stories, 99 denotes the possibility of seeing only 1 ad, 0.04 is the probability of seeing an ad once in 100 stories ) * In total, there are 100 positions for the ad. Therefore, 100 \* p(single ad) = 7.03%

Answer 14

* There are two ways of choosing a coin. One is to pick a fair coin and the other is to pick the one with two heads. * Probability of selecting fair coin = 999/1000 = 0.999 * Probability of selecting unfair coin = 1/1000 = 0.001 * Selecting 10 heads in a row = Selecting fair coin \* Getting 10 heads + Selecting an unfair coin * P (A) = 0.999 \* (1/2)^10 = 0.999 \* (1/1024) = 0.000976 * P (B) = 0.001 \* 1 = 0.001 * P( A / A + B ) = 0.000976 / (0.000976 + 0.001) = 0.4939 * P( B / A + B ) = 0.001 / 0.001976 = 0.5061 * Probability of selecting another head = P(A/A+B) \* 0.5 + P(B/A+B) \* 1 = 0.4939 \* 0.5 + 0.5061 = 0.7531

Answer 15

Given: * $5 to roll/play * win $21 if roll equal 7 (sum of both dice) (make $16 profit if win) Probability of 7: * (1,6), (2,5), (3,4), (4,3), (5,2) and (6,1) -\> 6/36 -\> 1/6 (about 17%) * games means has a chance of winning 1 game and pay for 6 * $21 - 6\*$5 = ($9) -\> not worth playing

Answer 16

If dataframe, use Pandas DataFrame method **.drop\_duplicates()** example: #Removing Duplicates bill\_data\_uniq = bill\_data.drop\_duplicates()

Answer 17

**Receiver Operating Characteristic** curve (or ROC curve) is a fundamental tool for diagnostic test evaluation and is a plot of the true positive rate (Sensitivity) against the false positive rate (Specificity) for the different possible cut-off points of a diagnostic test. * It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity). * The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. * The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. * The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test. * The area under the curve is a measure of test accuracy.

Answer 18

* **Matplotlib**: Used for **basic plotting** like bars, pies, lines, scatter plots, etc * **Seaborn**: Is built on top of Matplotlib and Pandas to ease data plotting. It is used for **statistical visualizations** like creating heatmaps or showing the distribution of your data * **Bokeh**: Used for **interactive visualization**. In case your data is too complex and you haven’t found any “message” in the data, then use Bokeh to create interactive visualizations that will **allow your viewers to explore the data themselves**

Answer 19

* It is a statistical error that causes a **bias in the sampling portion of an experiment.** * The **error causes one sampling group to be selected more often than other** groups included in the experiment. * Selection bias may **produce an inaccurate conclusion** if the selection bias is not identified.

Answer 20

SELECT DISTINCT page\_liked FROM Page\_Liked\_Table WHERE userID IN (SELECT friends FROM Friends\_Table WHERE userID = me AND page\_liked NOT IN (SELECT page\_liked FROM Page\_Liked\_Table WHERE userID = me)

Answer 21

Type I = False Positive Type II = Flase Negative

Machine Learning Flashcards

source: https://www.edureka.co/blog/interview-questions/machine-learning-interview-questions/ (45 cards)