DS Interview Qs Flashcards

1
Q

List the difference between supervised vs unsupervised learning

A

Supervised Learning: Uses known and labeled data as input, has a feedback mechanism, and most commonly are decision trees, logistic regression, support vector machine

Unsupervised Learning: Uses unlabeled data as input, has no feedback mechanism, and most commonly are k-means clustering, hierarchical clustering, and priori algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is logistic regression done?

A

logistic regression measures the relationship between the dependent variable (our label, what we want to predict) and the one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function (sigmoid)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the steps in making a decision tree

A
  1. Take the entire dataset as input
  2. Calculate entropy of target variable as well as predictor attributes
  3. Calculate information gain of all attributes
  4. Choose the attribute with highest information gain as the root node
  5. Repeat process on every branch till the decision node of each branch is finalized.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you build a random forest model?

A
  1. Randomly select k features from total m features where k < m
  2. Among the k features, calculate the node d using the best split point
  3. Split the node into daughter nodes using the best split
  4. Repeat steps 2 and 3 until leaf nodes are finalized
  5. Build forest by repeating steps 1 to 4 for n number times to create n number trees
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can you avoid overfitting your model?

A

Three main methods to avoid overfitting:

  1. Keep the model simple, take into account fewer variables, thereby removing some of the noise in the training data
  2. Use cross-validation techniques such as k-folds cross-validation (pre-data)
  3. Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting(during the process)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Differentiate between univariate, bivariate, and multivariate analysis

A

Univariate: contains only one variable, purpose of univariate analysis is to describe the data and find patterns that exist within it, can draw conclusions using mean, median, mode, min, max, etc.
Bivariate: contains two variables, bivariate analysis deals with causes and relationships, purpose of analysis is to find out the relationship between the two variables, find proportions of one variable to another, used for description and predictions
Multivariate: contains multiple variables, purpose of multivariate analysis to do the same as bivariate but with more variables, example: data about house to predict price, descriptive, predictive, and postscriptive (change the variables to guess what the outcome is)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the feature selection methods to select the right variables?

A

Two main methods for feature selection:
1. Filter Method (bad data in bad answer out, cleaning the data, preprocessing)
- Linear Discriminant Analysis
- ANOVA
- Chi-Square (most common)
2. Wrapper Method (labor intensive)
- Forward Selection (features off to side, test one feature at a time add one in until we get a fit)
- Backward selection (all features, run test, remove one at a time til fit)
- Recursive Feature Elimination (recursively looks through all features and how they pair together)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Write a program that prints the number 1-50. For multiples of 3 print Fizz, for multiples of 5 print Buzz, and multiples of both 3 and 5 print FizzBuzz

A

for i in range(1, 51):
if (I%3) == 0 and (I%5) == 0:
print (“FizzBuzz”)
elif (I%3) == 0:
print(“Fizz”)
elif(I%5) == 0:
print(“Buzz”)
else:
print(i)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

You are given a dataset consisting of variables having more than 30% missing values? How will you deal with them?

A

Ways to handle missing data:
1. If the dataset is huge, we can simply remove the rows with missing data values. Its the quickest way and we can use the rest of the data to predict values
2. We can substitute missing values with mean of rest of the data using pandas dataframe in python (df.mean() df.fillna(mean))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

For the given points, how will you calculate the Eucledian Distance in python?

A

euclidean_distance = sqrt( (plot1[0] - plot2[0]))2 + (plot1[1] - plot2[1])2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the angle between the hour and minute hands of a clock when the time is half past 6

A

360 / 24 = 15

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain dimensionality reduction and list its benefits

A

Def: Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions(fields) to convey similar information concisely

Benefits:
1. It helps in data compressing and reducing the storage space
2. It reduces computation time as less dimensions lead to less computing
3. It removes redundant features for example: there is no point in storing a value in two different units (inches and feet)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How will you calculate Eigen values and Eigen vectors of a 3x3 matrix?

A

Look up eigenvalues and eigenvectors video and do a practice question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How should you maintain your deployed model?

A

Steps:
1. Monitor: constant monitoring of all the models is needed to determine the performance accuracy of the model
2. Evaluate: evaluation metrics of the current model is calculated to determine if new algorithm is needed
3. Compare: the new models are compared against each other to determine which model performs the best
4. Rebuild: the best performing model is re-built on current state of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are recommender systems?

A

A recommender system predicts the “rating” or “preference” a user would give to a product
There are two types:
1. Collaborative Filtering: example is a Last.fm recommends tracks that are often played by other users with similar interests
2. Content-based Filtering: Pandora uses the properties of a song to recommend music with similar properties.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to find RMSE and MSE in linear regression model

A

MSE = E((Y-Y_hat)**2)
RMSE = sqrt(MSE)
Expectation meaning the sum over all Y divided by N

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

If it rains on Saturday with probability 0.6 and it rains on Sunday with probability 0.2 what is the probability that it rains this weekend

A

Total probability - P(not rain on Saturday) * P(not rain on Sunday) = 1-(1-0.6)(1-0.2) = 0.68

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How can you select k for k-means?

A

We most commonly use the “Elbow Method”:
- The idea of the elbow method is to run k-means clustering on the dataset where k is the number of clusters
- Within sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the significance of p-value

A

p-value typically <= 0.05: indicates strong evidence against the null hypothesis, so you reject the null hypothesis
p-value typically > 0.05: indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis
p-value cut-off 0.05: considered to be marginal (could go either way)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How can outlier values be treated?

A
  1. You can drop outliers only if it is a garbage value
    - ex: height of adult = ‘abc’
  2. If the outlier have extreme values, they can be removed
    - if most values are 0-10 but we have an outlier of 100

If you cannot drop outliers, try the following:
1. Try a different model, data detected as outliers by linear models can be fit by non-linear models
2. Try normalizing the data, this way the extreme data points are pulled to a similar range
3. You can use algorithms which are less affected by outliers, example: random forest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How can you say that a time series data is stationary?

A

We can say that a time-series is stationary when the variance and mean of the series is constant with time (imagine a consistent wavelength on the x -axis)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How can you calculate accuracy using confusion matrix?

A

Accuracy = (True positive + True Negative)/ Total Observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Write the equation and calculate precision and recall rate

A

Precision = True Positive/ (True Positive + False Positive)
Recall Rate = True Positive/ Total Positive + False Negative)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

If a drawer contains 12 red socks, 16 blue socks, and 20 white socks, how many must you pull out to be sure of a matching pair?

A

must pick 4 because there’s 100% chance of a match, so the most is 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

‘People who bought this, also bought…” recommendations seen on Amazon is a result of which algorithm?

A

Recommendation engine is done using Collaborative Filtering not Content Filtering
Collaborative Filtering: exploits the behavior of other users and their purchase history in terms of ratings, selection, etc. It makes predictions on what might interest a person based on preferences of many other users. In this algorithm, features of the items are not known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Write a SQL query to list all order with customer information

A

Given ORDERTABLE which contains Ordeid, Customerid, OrderNumber, TotalAmount

Given CUSTOMERTABLE which contains Id, FirstName, LastName, City, Country

SELECT OrderNumber, TotalAmount, FirstName, LastName, City, Country
FROM Order
JOIN Customer
ON Order.CustomerId = Customer.Id

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the SQL query order?

A

Fill in

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

You are given a dataset on cancer detection. You’ve built a classification model and achieved an accuracy of 96% Why shouldn’t you be happy with your model performance? What can you do about it?

A

Cancer detection results in IMBALANCED DATA

In an imbalanced dataset, accuracy should not be used as a measure of performance because it is important to focus on the remaining 4%, which are the people who are wrongly diagnosed. Wrong diagnosis is of major concern because there can be people who have cancer but were not predicted so.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Which of the following machine learning algorithms can be used for inputing missing values of both categorical and continuous variables?
1. K-means clustering
2. Linear regression
3. K-NN
4. Decision trees

A

K-NN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Given a box of matches and two ropes, not necessarily identical, measure a period of 45 minutes

A

Light A from both ends and B from one end
When A is finished burning, we know that 30 minutes have elapsed and B has 30 minutes remaining. Light B from the other side and it will take 15 minutes to burn adding to 45 minutes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Below are the 8 actual values of target variable in the trial file. [0,0,0,1,1,1,1,1], what is the entropy of the target variable?

A

-(5/8 log(5/8) + 3/8 log(3/8))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

We want to predict the probability of death from heart disease based on three risk factors: age, gender, blood cholesterol. What is the most appropriate algorithm for this use case?

A

Logistic Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

After studying the behavior of a population, you have identified four specific individual types who are valuable to your study. You would like to find all users who are most similar to each individual type. What algorithm is most appropriate for this study?

A

K-means clustering
We are looking for grouping people together specifically by four different similarities, indicating the value of k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

You have run the association rules algorithm on your dataset and the two rules {banana, apple} => {grape} and {apple, orange} => {grape} have been found to be relevant. What else must be true?

A

{grape, apple} must be a frequent item set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Your organization has website where visitors randomly receive one of two coupons. It is also possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to visitors to your website has any impact on their purchase decision. Which analysis method should you use?

A

One way ANOVA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What do you understand about true positive rate and false positive rate?

A

The True Positive Rate TPR defines the probability that an actual positive will turn out to be positive and is calculated by taking the ratio of the [True Positives] and [True Positives and False Negatives] aka TPR = TP / TP + FN
The False Positive Rate FPR defines the probability that an actual negative result will be shown as a positive one aka a false alarm and is calculated by taking the ration of [False Positives’ and [True Positives and False Positives] aka FPR = FP / TN + FP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the ROC curve?

A

The graph between the True Positive Rate on the y axis and the False Positive Rate on the x axis is called the ROC curve and is used in binary classification. The area range under the ROC curve has a range between 0 and 1. A completely random model which is represented by a straight line has a 0.5 ROC. The amount of deviation a ROC has from this straight line denotes the efficiency of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is a Confusion Matrix?

A

The Confusion Matrix is the summary of prediction results of a particular problem. It is a table that is used to describe the performance of the model. The Confusion Matrix is an n*n metric that evaluates the performance of the classification model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What do you understand about the true positive rate and false positive rate?

A

The true positive rate gives the proportion of correct predictions of the positive class. It is also used to measure the percentage of actual positives that are accurately verified.

The false positive rate gives the proportion of incorrect predictions of the positive class. A false positive determines something is true when that is initially false.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How is data science different from traditional application programming?

A

The primary and vital difference between data science and traditional application programming is that in traditional programming, one has to create rule to translate the input to output. In data science the rules are automatically produced from the data.

41
Q

What is the difference between the long format data and wide format data?

A

Long format data: contains values that repeat in the first column. In this format, each row is a one-time pointer per subject

Wide format data: the data’s repeated responses will be in a single row, and each response can be recorded in separate columns

42
Q

Mention some techniques used for sampling. What is the main advantage of sampling?

A

Sampling is the selection of individual members or a subset of the population to estimate the characters of the whole population. There are two types of sampling, namely probability and non probability sampling

43
Q

Why is Python used for data cleaning in DS?

A

Data scientists and technical analysts must convert a huge amount of data into effective ones. Data cleaning includes removing malware records, outliners, inconsistent values, redundant formatting, etc. Matplotlib, Pandas, etc, are the most used python data cleaners

44
Q

What are the popular libraries used in data science?

A

Tensor Flow, Pandas, NumPy, SciPy, Scrapy, Libra, MatPlotLib

45
Q

What is variance in data science?

A

Variance is the value that depicts the individual figures in a set of data which distributes themselves about the mean and describes the difference of each value from the mean value. Data scientists use variance to understand the distribution of a data set

46
Q

What is pruning in a decision tree?

A

In data science and machine learning, pruning is a technique which is related to decision trees. Pruning simplifies the decision tree by reducing the rules. Pruning helps to avoid complexity and improves accuracy. Reduced error pruning, cost complexity pruning, etc. are the different types of pruning.

47
Q

What is entropy in a decision tree algorithm?

A

Entropy is the measure of randomness or disorder in the group of observations. It also determines how a decision tree switches to split data. Entropy is also used to check the homogeneity of the given data. If the entropy is zero, then the sample of data is entirely homogeneous, and if the entropy is one, then it indicates that the sample is equally divided

48
Q

What information is gained in a decision tree algorithm?

A

Information gain is the expected reduction in entropy. Information gain makes the decision tree smarter. Information gain includes parent node R and a set oE of K training examples. It calculates the difference between entropy before and after the split.

49
Q

What is k-fold cross-validation?

A

The k-fold cross validation is a procedure used to estimate the model’s skill in new data. In k-fold cross validation, every observation from the original dataset may appear in the training and testing set. K-fold cross validation estimates the accuracy but does not help you to improve the accuracy.

50
Q

What is a normal distribution?

A

Normal distribution is also known as the Gaussian Distribution. The normal distribution shows the data near the mean and the frequency of that particular data. When represented in graphical form, the normal distribution appears like a bell curve. The parameters included in the normal distribution are mean, standard deviation, median, etc.

51
Q

What is deep learning?

A

Deep learning is one of the essential factors in data science, including statistics. Deep learning makes us work more closely with the human brain and reliable with human thoughts. The algorithms are sincerely created to resemble the human brain. In deep learning, multiple layers are formed from the raw input to extract the high-level layer with the best features.

52
Q

What is an RNN (recurrent neural network)?

A

RNN is an algorithm that uses sequential data. RNN is used in language translation, voice recognition, image capturing, etc. There are different types of RNN networks such as one-to-one, one-to-many, many-to-one, and many-to-many. RNN is used in Google’s Voice search and Apple’s Siri.

53
Q

What are the feature vectors?

A

A feature vector is an n-dimensional vector of numerical features that represent an object. IN machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that’s easy to analyze.

54
Q

What are the steps in making a decision tree?

A
  1. Take the entire data set as input
  2. Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets
  3. Apply the split to the input data
  4. Re-apply steps one and two to the divided data
  5. Stop when you meet any stopping criteria
  6. This step is called pruning, clean up the tree if you went too far doing splits.
55
Q

What is root cause analysis?

A

Root cause analysis was initially developed to analyze industrial accidents but it now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem fault sequence averts the final undesirable event from recurring.

56
Q

What is logistic regression?

A

Logistic regression is also known as the logic model. It is a technique used to forecast the binary outcome from a linear combination of predictor variables.

57
Q

What are recommender systems?

A

Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.

58
Q

Explain cross-validation.

A

Cross validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in background where the objective is to forecast and one wants to estimate how accurately a model will accomplish in practice.

The goal of cross validation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.

59
Q

What is collaborative filtering?

A

Most recommender systems use this filtering process to find patterns and information by collaborating perspectives, numerous data sources, and several agents.

60
Q

Do gradient descent methods always converge to similar points?

A

They do not, because in some cases, they reach a local minima or a local optima point. You would no reach the global optima point. This is governed by the data and the starting conditions.

61
Q

What is the goal of A/B testing?

A

This is statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any changes to a web page to maximize or increase the outcome of a strategy.

62
Q

What are the drawback of the linear model?

A

The assumption of linearity of the errors

It can’t be used for count outcomes or binary outcomes

There are overfitting problems that it can’t solve

63
Q

What is the law of large numbers?

A

It is a theorem that describes the result of performing the same experiment very frequently. This theorem forms the basis of frequency-style thinking. It states that the sample mean, sample variance, and sample standard deviation converge to what they are trying to estimate.

64
Q

What are confounding variables?

A

There are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

65
Q

What is star schema?

A

It is a traditional database schema with a central table. Satellite table map IDs to physical names or descriptions and can be connected to central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time application, as they save a lot of memory. Sometimes, star schemas involve several layers of summarization to recover information faster.

66
Q

How regularly must an algorithm be updated?

A

You will want to update an algorithm when:
1. You want the model to evolve as data streams through infrastructure
2. The underlying data source is changed
3. There is a case of non-stationarity

67
Q

What are eigenvalues and eigenvectors?

A

Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching

Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the Eigenvectors for a correlation or covariance matrix

68
Q

Why is resampling done?

A

Resampling is done in any of these cases:
1. Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points
2. Substituting labels on data points when performing significance tests
3. Validating models by using random subsets aka bootstrapping, cross-validation

69
Q

What is selection bias?

A

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample

70
Q

What are the types of biases that can occur during sampling?

A

Selection bias, undercoverage bias, survivorship bias

71
Q

What is survivorship bias?

A

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

72
Q

How do you work towards a random forest?

A

The underlying principle of this technique is that several weak learners combine to provide a strong learner. The steps involved are:
1. Build several decision trees on bootstrapped training sample of the data
2. On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates out of all pp predictors
3. Rule of thumb: at each split m = p * sqrt(m) = p
4. Predictions: at the majority rule

73
Q

What is a bias-variance trade off?

A

While trying to get over bias in our model, we try to increase complexity of the machine learning algorithm. Though it helps in reducing the bias, after a certain points, it generates an overfitting effect on the model hence resulting in hyper sensitivity and high variance. To achieve the best performance, the main target of a supervised machine learning algorithm is to have low variance and bias.

74
Q

Describe Markov Chains.

A

Markov chains defines that a state’s future probability depends only on its current state. Markov chains belong to the stochastic process type category. Give example.

75
Q

Why is R used in data visualization?

A

R is widely used in data visualizations for the following reasons
1. We can create almost any type of graph using R
2. R has multiple libraries like lattice, ggplots2, leaflet, etc., and so many inbuilt functions as well.
3. It is easier to customize graphics in R compared to Python
4. R is used in feature engineering and in exploratory data analysis as well

76
Q

What is the difference between a box plot and a histogram?

A

The frequency of a certain feature’s values is denoted visually by both box plots and histograms.
Boxplots are more often used in comparing several datasets and compared to histogram, take less space and contain fewer details. Histograms are used to know and understand the probability distribution underlying a dataset.

77
Q

What does NLP stand for?

A

NLP is short for Natural Language Processing. It deals with the study of how computer learn a massive amount of textual data through programming. A few popular examples of NLP are stemming, sentimental analysis, tokenization, removal of stop words, etc.

78
Q

Explain the difference between an error and a residual error.

A

Error: The difference between the actual value and the predicted value is called an error. Some of the popular means of calculating data science errors are RMSE, MAE, MSE, an error is generally unobservable, an error is how actual population data and observed data differ from one another.
Residual Error: The difference between the arithmetic mean of a group of values and the observed group of values is called a residual error. A residual can be represented using a graph. A residual error is used to show how the sample population data and the observed data differ from one another.

79
Q

Explain the difference between normalization and standardization.

A

Standardization: The technique of converting data in such a way that it is normally distributed and has a standard deviation of 1 and a mean of 0. Standardization takes care that the standard normal distribution is followed by the data. Formula: X’ = (X-mu)/sigma
Normalization: The technique of converting all data values to lie between 1 and 0 is know as normalization. This is also known as min-max scaling. Formula: X’ = (X-Xmin)/(Xmax - Xmin)

80
Q

Explain the difference between point estimates and confidence intervals.

A

Confidence Intervals: A range of values likely containing the population parameter is given by the confidence interval. Further, it even tells us how likely that particular interval can contain the population parameter. The Confidence Coefficient is denoted by 1-alpha, which gives the probability or likeliness. The level of significance is given by alpha.
Point Estimates: An estimate of the population parameter is given by a particular values called the point estimate. Some popular methods used to derive population parameters point estimator are: maximum likelihood estimator MLE and the Method of Moments
To conclude, the bias and variance are inversely proportional to one another, an increase in bias results in a decrease in variance and an increase in variance results in a decrease in bias.

81
Q

One-on-One: Which is your favorite machine learning algorithm and why?

A

Answer.

82
Q

One-on-One: Which accord to you is the most important skill that makes a good data scientist?

A

Answer.

83
Q

One-on-One: Why do you think data science is so popular today?

A

Answer.

84
Q

One-on-One: Explain the most challenging data science project that you worked on.

A

Answer.

85
Q

One-on-One: How do you usually prefer working on a project - individually, small team or large team?

A

Answer.

86
Q

One-on-One: Based on your experience in the industry, tell me about your top 5 predictions for the next 10 years.

A

Answer.

87
Q

One-on-One: What are some unique skills that you can bring to the team as a data scientist.

A

Answer.

88
Q

One-on-One: Were you always in the data science field? If not, what made you change career paths.

A

Answer.

89
Q

One-on-One: If we give you a random data set, how will you figure out whether it suits the business needs or not?

A

Answer.

90
Q

One-on-One: Given a chance, if you could pick a career other than being a data scientist, what would you choose?

A

Olympic volleyball line judge

91
Q

One-on-One: Given the constant change in the data science field, how quickly can you adapt to new technologies?

A

Like Lightning McQueen

92
Q

One-on-One: Have you ever been in a conflict with your colleagues regarding different strategies to go about a project? How were you able to resolve it?

A

Answer.

93
Q

One-on-One: Can you break down an algorithm you have used on a recent project?

A

Answer.

94
Q

One-on-One: What tools did you use in your last project and why?

A

Answer.

95
Q

One-on-One: Think of the last technical problem you solved. If you had no limitations with the project’s budget, what would be the first thing you would do to solve the same problem?

A

Answer.

96
Q

One-on-One: When you are assigned multiple projects at the same time, how do you best organize your time?

A

Answer.

97
Q

One-on-One: Tell me about a time when your project didn’t go according to plan and what you learned from it.

A

Answer

98
Q

One-on-One: Have you ever created an original algorithm? How did you go about doing that and for what purpose?

A

Answer.

99
Q

One-on-One: What is your most favored strategy to clean a big data set and why?

A

Answer.