Data_Science_Interview Question Flashcards
(28 cards)
What are the feature vectors?
A feature vector is an n-dimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that’s easy to analyze.
What are the steps in making a decision tree?
Take the entire data set as input.
Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.
Apply the split to the input data (divide step).
Re-apply steps one and two to the divided data.
Stop when you meet any stopping criteria.
This step is called pruning. Clean up the tree if you went too far doing splits.
What is root cause analysis?
Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.
What is logistic regression?
Logistic regression is also known as the logit model. It is a technique used to forecast the binary outcome from a linear combination of predictor variables.
What are recommender systems?
Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.
Explain cross-validation.
Cross-validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is to forecast and one wants to estimate how accurately a model will accomplish in practice.
The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.
What is collaborative filtering?
Most recommender systems use this filtering process to find patterns and information by collaborating perspectives, numerous data sources, and several agents.
Do gradient descent methods always converge to similar points?
They do not, because in some cases, they reach a local minima or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.
What is the goal of A/B Testing?
This is statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any changes to a web page to maximize or increase the outcome of a strategy.
What are the drawbacks of the linear model?
The assumption of linearity of the errors
It can’t be used for count outcomes or binary outcomes
There are overfitting problems that it can’t solve
What is the law of large numbers?
It is a theorem that describes the result of performing the same experiment very frequently. This theorem forms the basis of frequency-style thinking. It states that the sample mean, sample variance, and sample standard deviation converge to what they are trying to estimate.
What are the confounding variables?
These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.
What is star schema?
It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes, star schemas involve several layers of summarization to recover information faster.
How regularly must an algorithm be updated?
You will want to update an algorithm when:
You want the model to evolve as data streams through infrastructure
The underlying data source is changing
There is a case of non-stationarity
What are eigenvalue and eigenvector?
Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching.
Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix.
Why is resampling done?
Resampling is done in any of these cases:
Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points
Substituting labels on data points when performing significance tests
Validating models by using random subsets (bootstrapping, cross-validation)
What is selection bias?
Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample.
What are the types of biases that can occur during sampling?
Selection bias
Undercoverage bias
Survivorship bias
What is survivorship bias?
Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.
How do you work towards a random forest?
The underlying principle of this technique is that several weak learners combine to provide a strong learner. The steps involved are:
Build several decision trees on bootstrapped training samples of data
On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates out of all pp predictors
Rule of thumb: At each split m=p√m=p
Predictions: At the majority rule
What is a bias-variance trade-off?
Bias: Due to an oversimplification of a Machine Learning Algorithm, an error occurs in our model, which is known as Bias. This can lead to an issue of underfitting and might lead to oversimplified assumptions at the model training time to make target functions easier and simpler to understand.
Some of the popular machine learning algorithms which are low on the bias scale are -
Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Decision Trees.
Algorithms that are high on the bias scale -
Logistic Regression and Linear Regression.
Variance: Because of a complex machine learning algorithm, a model performs really badly on a test data set as the model learns even noise from the training data set. This error that occurs in the Machine Learning model is called Variance and can generate overfitting and hyper-sensitivity in Machine Learning models.
While trying to get over bias in our model, we try to increase the complexity of the machine learning algorithm. Though it helps in reducing the bias, after a certain point, it generates an overfitting effect on the model hence resulting in hyper-sensitivity and high variance.
Bias-Variance trade-off: To achieve the best performance, the main target of a supervised machine learning algorithm is to have low variance and bias.
The following things are observed regarding some of the popular machine learning algorithms -
The Support Vector Machine algorithm (SVM) has high variance and low bias. In order to change the trade-off, we can increase the parameter C. The C parameter results in a decrease in the variance and an increase in bias by influencing the margin violations allowed in training datasets.
In contrast to the SVM, the K-Nearest Neighbors (KNN) Machine Learning algorithm has a high variance and low bias. To change the trade-off of this algorithm, we can increase the prediction influencing neighbors by increasing the K value, thus increasing the model bias.
Describe Markov chains?
Markov Chains defines that a state’s future probability depends only on its current state.
Markov chains belong to the Stochastic process type category .A perfect example of the Markov Chains is the system of word recommendation. In this system, the model recognizes and recommends the next word based on the immediately previous word and not anything before that. The Markov Chains take the previous paragraphs that were similar to training data-sets and generates the recommendations for the current paragraphs accordingly based on the previous word.
Why is R used in Data Visualization?
R is widely used in Data Visualizations for the following reasons-
We can create almost any type of graph using R.
R has multiple libraries like lattice, ggplot2, leaflet, etc., and so many inbuilt functions as well.
It is easier to customize graphics in R compared to Python.
R is used in feature engineering and in exploratory data analysis as well.
What is the difference between a box plot and a histogram?
The frequency of a certain feature’s values is denoted visually by both box plots
and histograms.
Boxplots are more often used in comparing several datasets and compared to histograms, take less space and contain fewer details. Histograms are used to know and understand the probability distribution underlying a dataset.