DS Interview Qs Flashcards
(99 cards)
List the difference between supervised vs unsupervised learning
Supervised Learning: Uses known and labeled data as input, has a feedback mechanism, and most commonly are decision trees, logistic regression, support vector machine
Unsupervised Learning: Uses unlabeled data as input, has no feedback mechanism, and most commonly are k-means clustering, hierarchical clustering, and priori algorithms
How is logistic regression done?
logistic regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function (sigmoid)
Explain the steps in making a decision tree
- Take the entire dataset as input
- Calculate entropy of target variable as well as predictor attributes
- Calculate information gain of all attributes
- Choose the attribute with highest information gain as the root node
- Repeat process on every branch till the decision node of each branch is finalized.
How do you build a random forest model?
- Randomly select k features from total m features where k < m
- Among the k features, calculate the node d using the best split point
- Split the node into daughter nodes using the best split
- Repeat steps 2 and 3 until leaf nodes are finalized
- Build forest by repeating steps 1 to 4 for n number times to create n number trees
How can you avoid overfitting your model?
Three main methods to avoid overfitting:
- Keep the model simple, take into account fewer variables, thereby removing some of the noise in the training data
- Use cross-validation techniques such as k-folds cross-validation (pre-data)
- Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting(during the process)
Differentiate between univariate, bivariate, and multivariate analysis
Univariate: contains only one variable, purpose of univariate analysis is to describe the data and find patterns that exist within it, can draw conclusions using mean, median, mode, min, max, etc.
Bivariate: contains two variables, bivariate analysis deals with causes and relationships, purpose of analysis is to find out the relationship between the two variables, find proportions of one variable to another, used for description and predictions
Multivariate: contains multiple variables, purpose of multivariate analysis to do the same as bivariate but with more variables, example: data about house to predict price, descriptive, predictive, and postscriptive (change the variables to guess what the outcome is)
What are the feature selection methods to select the right variables?
Two main methods for feature selection:
1. Filter Method (bad data in bad answer out, cleaning the data, preprocessing)
- Linear Discriminant Analysis
- ANOVA
- Chi-Square (most common)
2. Wrapper Method (labor intensive)
- Forward Selection (features off to side, test one feature at a time add one in until we get a fit)
- Backward selection (all features, run test, remove one at a time til fit)
- Recursive Feature Elimination (recursively looks through all features and how they pair together)
Write a program that prints the number 1-50. For multiples of 3 print Fizz, for multiples of 5 print Buzz, and multiples of both 3 and 5 print FizzBuzz
for i in range(1, 51):
if (I%3) == 0 and (I%5) == 0:
print (“FizzBuzz”)
elif (I%3) == 0:
print(“Fizz”)
elif(I%5) == 0:
print(“Buzz”)
else:
print(i)
You are given a dataset consisting of variables having more than 30% missing values? How will you deal with them?
Ways to handle missing data:
1. If the dataset is huge, we can simply remove the rows with missing data values. Its the quickest way and we can use the rest of the data to predict values
2. We can substitute missing values with mean of rest of the data using pandas dataframe in python (df.mean() df.fillna(mean))
For the given points, how will you calculate the Eucledian Distance in python?
euclidean_distance = sqrt( (plot1[0] - plot2[0]))2 + (plot1[1] - plot2[1])2)
What is the angle between the hour and minute hands of a clock when the time is half past 6
360 / 24 = 15
Explain dimensionality reduction and list its benefits
Def: Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions(fields) to convey similar information concisely
Benefits:
1. It helps in data compressing and reducing the storage space
2. It reduces computation time as less dimensions lead to less computing
3. It removes redundant features for example: there is no point in storing a value in two different units (inches and feet)
How will you calculate Eigen values and Eigen vectors of a 3x3 matrix?
Look up eigenvalues and eigenvectors video and do a practice question
How should you maintain your deployed model?
Steps:
1. Monitor: constant monitoring of all the models is needed to determine the performance accuracy of the model
2. Evaluate: evaluation metrics of the current model is calculated to determine if new algorithm is needed
3. Compare: the new models are compared against each other to determine which model performs the best
4. Rebuild: the best performing model is re-built on current state of data
What are recommender systems?
A recommender system predicts the “rating” or “preference” a user would give to a product
There are two types:
1. Collaborative Filtering: example is a Last.fm recommends tracks that are often played by other users with similar interests
2. Content-based Filtering: Pandora uses the properties of a song to recommend music with similar properties.
How to find RMSE and MSE in linear regression model
MSE = E((Y-Y_hat)**2)
RMSE = sqrt(MSE)
Expectation meaning the sum over all Y divided by N
If it rains on Saturday with probability 0.6 and it rains on Sunday with probability 0.2 what is the probability that it rains this weekend
Total probability - P(not rain on Saturday) * P(not rain on Sunday) = 1-(1-0.6)(1-0.2) = 0.68
How can you select k for k-means?
We most commonly use the “Elbow Method”:
- The idea of the elbow method is to run k-means clustering on the dataset where k is the number of clusters
- Within sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroids
What is the significance of p-value
p-value typically <= 0.05: indicates strong evidence against the null hypothesis, so you reject the null hypothesis
p-value typically > 0.05: indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis
p-value cut-off 0.05: considered to be marginal (could go either way)
How can outlier values be treated?
- You can drop outliers only if it is a garbage value
- ex: height of adult = ‘abc’ - If the outlier have extreme values, they can be removed
- if most values are 0-10 but we have an outlier of 100
If you cannot drop outliers, try the following:
1. Try a different model, data detected as outliers by linear models can be fit by non-linear models
2. Try normalizing the data, this way the extreme data points are pulled to a similar range
3. You can use algorithms which are less affected by outliers, example: random forest
How can you say that a time series data is stationary?
We can say that a time-series is stationary when the variance and mean of the series is constant with time (imagine a consistent wavelength on the x -axis)
How can you calculate accuracy using confusion matrix?
Accuracy = (True positive + True Negative)/ Total Observations
Write the equation and calculate precision and recall rate
Precision = True Positive/ (True Positive + False Positive)
Recall Rate = True Positive/ Total Positive + False Negative)
If a drawer contains 12 red socks, 16 blue socks, and 20 white socks, how many must you pull out to be sure of a matching pair?
must pick 4 because there’s 100% chance of a match, so the most is 4