Data_Science_Interview Question Flashcards by Run 4ever

What are the feature vectors?

A feature vector is an n-dimensional vector of numerical features that represent an object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that’s easy to analyze.

How well did you know this?

Not at all

Perfectly

What are the steps in making a decision tree?

Take the entire data set as input.
Look for a split that maximizes the separation of the classes. A split is any test that divides the data into two sets.
Apply the split to the input data (divide step).
Re-apply steps one and two to the divided data.
Stop when you meet any stopping criteria.
This step is called pruning. Clean up the tree if you went too far doing splits.

How well did you know this?

Not at all

Perfectly

What is root cause analysis?

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

How well did you know this?

Not at all

Perfectly

What is logistic regression?

Logistic regression is also known as the logit model. It is a technique used to forecast the binary outcome from a linear combination of predictor variables.

How well did you know this?

Not at all

Perfectly

What are recommender systems?

Recommender systems are a subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product.

How well did you know this?

Not at all

Perfectly

Explain cross-validation.

Cross-validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. It is mainly used in backgrounds where the objective is to forecast and one wants to estimate how accurately a model will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.

How well did you know this?

Not at all

Perfectly

What is collaborative filtering?

Most recommender systems use this filtering process to find patterns and information by collaborating perspectives, numerous data sources, and several agents.

How well did you know this?

Not at all

Perfectly

Do gradient descent methods always converge to similar points?

They do not, because in some cases, they reach a local minima or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.

How well did you know this?

Not at all

Perfectly

What is the goal of A/B Testing?

This is statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any changes to a web page to maximize or increase the outcome of a strategy.

How well did you know this?

Not at all

Perfectly

What are the drawbacks of the linear model?

The assumption of linearity of the errors
It can’t be used for count outcomes or binary outcomes
There are overfitting problems that it can’t solve

How well did you know this?

Not at all

Perfectly

What is the law of large numbers?

It is a theorem that describes the result of performing the same experiment very frequently. This theorem forms the basis of frequency-style thinking. It states that the sample mean, sample variance, and sample standard deviation converge to what they are trying to estimate.

How well did you know this?

Not at all

Perfectly

What are the confounding variables?

These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

How well did you know this?

Not at all

Perfectly

What is star schema?

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes, star schemas involve several layers of summarization to recover information faster.

How well did you know this?

Not at all

Perfectly

How regularly must an algorithm be updated?

You will want to update an algorithm when:

You want the model to evolve as data streams through infrastructure
The underlying data source is changing
There is a case of non-stationarity

How well did you know this?

Not at all

Perfectly

What are eigenvalue and eigenvector?

Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching.

Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix.

How well did you know this?

Not at all

Perfectly

Why is resampling done?

Study These Flashcards

Resampling is done in any of these cases:

Estimating the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points
Substituting labels on data points when performing significance tests
Validating models by using random subsets (bootstrapping, cross-validation)

What is selection bias?

Study These Flashcards

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample.

What are the types of biases that can occur during sampling?

Study These Flashcards

Selection bias
Undercoverage bias
Survivorship bias

What is survivorship bias?

Study These Flashcards

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

How do you work towards a random forest?

Study These Flashcards

The underlying principle of this technique is that several weak learners combine to provide a strong learner. The steps involved are:

Build several decision trees on bootstrapped training samples of data
On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates out of all pp predictors
Rule of thumb: At each split m=p√m=p
Predictions: At the majority rule

What is a bias-variance trade-off?

Study These Flashcards

Bias: Due to an oversimplification of a Machine Learning Algorithm, an error occurs in our model, which is known as Bias. This can lead to an issue of underfitting and might lead to oversimplified assumptions at the model training time to make target functions easier and simpler to understand.

Some of the popular machine learning algorithms which are low on the bias scale are -

Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Decision Trees.

Algorithms that are high on the bias scale -

Logistic Regression and Linear Regression.

Variance: Because of a complex machine learning algorithm, a model performs really badly on a test data set as the model learns even noise from the training data set. This error that occurs in the Machine Learning model is called Variance and can generate overfitting and hyper-sensitivity in Machine Learning models.

While trying to get over bias in our model, we try to increase the complexity of the machine learning algorithm. Though it helps in reducing the bias, after a certain point, it generates an overfitting effect on the model hence resulting in hyper-sensitivity and high variance.
Bias-Variance trade-off: To achieve the best performance, the main target of a supervised machine learning algorithm is to have low variance and bias.

The following things are observed regarding some of the popular machine learning algorithms -

The Support Vector Machine algorithm (SVM) has high variance and low bias. In order to change the trade-off, we can increase the parameter C. The C parameter results in a decrease in the variance and an increase in bias by influencing the margin violations allowed in training datasets.
In contrast to the SVM, the K-Nearest Neighbors (KNN) Machine Learning algorithm has a high variance and low bias. To change the trade-off of this algorithm, we can increase the prediction influencing neighbors by increasing the K value, thus increasing the model bias.

Describe Markov chains?

Study These Flashcards

Markov Chains defines that a state’s future probability depends only on its current state.

Markov chains belong to the Stochastic process type category .A perfect example of the Markov Chains is the system of word recommendation. In this system, the model recognizes and recommends the next word based on the immediately previous word and not anything before that. The Markov Chains take the previous paragraphs that were similar to training data-sets and generates the recommendations for the current paragraphs accordingly based on the previous word.

Why is R used in Data Visualization?

Study These Flashcards

R is widely used in Data Visualizations for the following reasons-

We can create almost any type of graph using R.
R has multiple libraries like lattice, ggplot2, leaflet, etc., and so many inbuilt functions as well.
It is easier to customize graphics in R compared to Python.
R is used in feature engineering and in exploratory data analysis as well.

What is the difference between a box plot and a histogram?

Study These Flashcards

The frequency of a certain feature’s values is denoted visually by both box plots

and histograms.

Boxplots are more often used in comparing several datasets and compared to histograms, take less space and contain fewer details. Histograms are used to know and understand the probability distribution underlying a dataset.

What does NLP stand for?

NLP is short for Natural Language Processing. It deals with the study of how computers learn a massive amount of textual data through programming. A few popular examples of NLP are Stemming, Sentimental Analysis, Tokenization, removal of stop words, etc.

Difference between an error and a residual error

Error: The difference between the actual value and the predicted value is called an error. Some of the popular means of calculating data science errors are - Root Mean Squared Error (RMSE) Mean Absolute Error (MAE) Mean Squared Error (MSE) Errors are Usually unobservable. Residual Error: The difference between the arithmetic mean of a group of values and the observed group of values is called a residual error. A residual error can be represented using a graph

Difference between Normalisation and Standardization

Standardization: The technique of converting data in such a way that it is normally distributed and has a standard deviation of 1 and a mean of 0. (Think of the values resulting from each layer of a CNN Module, being standardized to 1 or 0 ) Normalization:The technique of converting all data values to lie between 1 and 0 is known as Normalization. This is also known as min-max scaling. (Think of pixels values for RGB being 255)

Difference between Point Estimates and Confidence Interval

Confidence Interval: A range of values likely containing the population parameter is given by the confidence interval. Further, it even tells us how likely that particular interval can contain the population parameter. The Confidence Coefficient (or Confidence level) is denoted by 1-alpha, which gives the probability or likeness. The level of significance is given by alpha. Point Estimates: An estimate of the population parameter is given by a particular value called the point estimate. Some popular methods used to derive Population Parameters’ Point estimators are - Maximum Likelihood estimator and the Method of Moments. To conclude, the bias and variance are inversely proportional to each other, i.e., an increase in bias results in a decrease in the variance, and an increase in variance results in a decrease in bias.

Data_Science_Interview Question Flashcards

(28 cards)