Basic_Knowledge Flashcards

(31 cards)

1
Q

What is supervised Learning?

A

Uses known and labeled data as input
Supervised learning has a feedback mechanism
The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Unsupervised Learning?

A

Uses unlabeled data as input
Unsupervised learning has no feedback mechanism
The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is logistic regression done?

A

Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a sigmoid function?

A

A sigmoid function is a mathematical function having a characteristic “S”-shaped curve or sigmoid curve.

A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point[1] and exactly one inflection point. A sigmoid “function” and a sigmoid “curve” refer to the same object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the steps in making a decision tree.

A

Take the entire data set as input
Calculate entropy of the target variable, as well as the predictor attributes
Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
Choose the attribute with the highest information gain as the root node
Repeat the same procedure on every branch until the decision node of each branch is finalized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a random forest model?

A

A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you build a random forest model?

A

Randomly select ‘k’ features from a total of ‘m’ features where k &laquo_space;m
Among the ‘k’ features, calculate the node D using the best split point
Split the node into daughter nodes using the best split
Repeat steps two and three until leaf nodes are finalized
Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you avoid overfitting your model?

A

Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:

Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
Use cross-validation techniques, such as k folds cross-validation
Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Differentiate between univariate, bivariate, and multivariate analysis.

A

Univariate
Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data and find patterns that exist within it.

Example: height of students, The patterns can be studied by drawing conclusions using mean, median, mode, dispersion or range, minimum, maximum, etc.
________________________
Bivariate
Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.

Example: temperature and ice cream sales in the summer season, Here, the relationship is visible from the table that temperature and sales are directly proportional to each other. The hotter the temperature, the better the sales.
________________________
Multivariate
Multivariate data involves three or more variables, it is categorized under multivariate. It is similar to a bivariate but contains more than one dependent variable.

Example: data for house price prediction, The patterns can be studied by drawing conclusions using mean, median, and mode, dispersion or range, minimum, maximum, etc. You can start describing the data and using it to guess what the price of the house will be.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the feature selection methods used to select the right variables?

A

There are two main methods for feature selection, i.e, filter, and wrapper methods.

Filter Methods
This involves:

Linear discrimination analysis
ANOVA
Chi-Square
The best analogy for selecting features is “bad data in, bad answer out.” When we’re limiting or selecting the features, it’s all about cleaning up the data coming in.

Wrapper Methods
This involves:

Forward Selection: We test one feature at a time and keep adding them until we get a good fit
Backward Selection: We test all the features and start removing them to see what works better
Recursive Feature Elimination: Recursively looks through all the different features and how they pair together
Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In your choice of language, write a program that prints the numbers ranging from one to 50.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?

A

he following are ways to handle missing data values:

If the data set is large, we can just simply remove the rows with missing data values. It is the quickest way; we use the rest of the data to predict the values.

For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas’ data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

For the given points, how will you calculate the Euclidean distance in Python?

A

plot1 = [1,3]

plot2 = [2,5]

The Euclidean distance can be calculated as follows:

euclidean_distance = sqrt( (plot1[0]-plot2[0])2 + (plot1[1]-plot2[1])2 )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are dimensionality reduction and its benefits?

A

The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.

This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there’s no point in storing a value in two different units (meters and inches).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How should you maintain a deployed model?

A

Monitor
Constant monitoring of all models is needed to determine their performance accuracy. When you change something, you want to figure out how your changes are going to affect things. This needs to be monitored to ensure it’s doing what it’s supposed to do.

Evaluate
Evaluation metrics of the current model are calculated to determine if a new algorithm is needed.

Compare
The new models are compared to each other to determine which model performs the best.

Rebuild
The best performing model is re-built on the current state of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are recommender systems?

A

A recommender system predicts what a user would rate a specific product based on their preferences. It can be split into two different areas:

Collaborative Filtering
As an example, Last.fm recommends tracks that other users with similar interests play often. This is also commonly seen on Amazon after making a purchase; customers may notice the following message accompanied by product recommendations: “Users who bought this also bought…”

Content-based Filtering
As an example: Pandora uses the properties of a song to recommend music with similar properties. Here, we look at content, instead of looking at who else is listening to music.

17
Q

ow do you find RMSE and MSE in a linear regression model?

A

RMSE and MSE are two of the most common measures of accuracy for a linear regression model.

RMSE indicates the Root Mean Square Error.

RMSE

MSE indicates the Mean Square Error.

MSE

18
Q

How can outlier values be treated?

A

You can drop outliers only if it is a garbage value.

Example: height of an adult = abc ft. This cannot be true, as the height cannot be a string value. In this case, outliers can be removed.

If the outliers have extreme values, they can be removed. For example, if all the data points are clustered between zero to 10, but one point lies at 100, then we can remove this point.

If you cannot drop outliers, you can try the following:

Try a different model. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model.
Try normalizing the data. This way, the extreme data points are pulled to a similar range.
You can use algorithms that are less affected by outliers; an example would be random forests.

19
Q

How can time-series data be declared as stationery?

A

It is stationary when the variance and mean of the series are constant with time.
In the first graph, the variance is constant with time. Here, X is the time factor and Y is the variable. The value of Y goes through the same points all the time; in other words, it is stationary.

In the second graph, the waves get bigger, which means it is non-stationary and the variance is changing with time.

20
Q

How can you calculate accuracy using a confusion matrix?

A

You can see the values for total data, actual values, and predicted values.

21
Q

What is the formula for accuracy?

A

Accuracy = (True Positive + True Negative) / Total Observations

= (262 + 347) / 650

= 609 / 650

= 0.93

As a result, we get an accuracy of 93 percent.

22
Q

How do you calculate the precision and recall rate?

A

Precision = (True positive) / (True Positive + False Positive)

= 262 / 277

= 0.94

Recall Rate = (True Positive) / (Total Positive + False Negative)

= 262 / 288

= 0.90

23
Q

What is logistic Regression?

A

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

24
Q

What is Linear Regression?

A

Linear regression analysis is used to predict the value of a variable based on the value of another variable. (For instance if X = 1 Y=2, and so on). Plotted linearly on a graph.

25
what is K-means clustering
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.“the objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.” A cluster refers to a collection of data points aggregated together because of certain similarities.
26
What is the Apriori algorithm | ?
Apriori is an algorithm used for Association Rule Mining. It searches for a series of frequent sets of items in the datasets. It builds on associations and correlations between the itemsets. It is the algorithm behind “You may also like” where you commonly saw in recommendation platforms
27
After studying the behavior of a population, you have identified four specific individual types that are valuable to your study. You would like to find all users who are most similar to each individual type. Which algorithm is most appropriate for this study?
As we are looking for grouping people together specifically by four different similarities, it indicates the value of k. Therefore, K-means clustering (answer A) is the most appropriate algorithm for this study.
28
We want to predict the probability of death from heart disease based on three risk factors: age, gender, and blood cholesterol level. What is the most appropriate algorithm for this case?
The most appropriate algorithm for this case is A, logistic regression.
29
What is True Positive Rate and what is the formula for it?
The True Positive Rate (TPR) is calculated by taking the ratio of the [True Positives (TP)] and [True Positive (TP) & False Negatives (FN) ]. The formula for the same is stated below - TPR=TP/TP+FN
30
What is the False Positive Rate and what is it's formula?
The False Positive Rate (FPR) defines the probability that an actual negative result will be shown as a positive one i.e the probability that a model will generate a false alarm. The False Positive Rate (FPR) is calculated by taking the ratio of the [False Positives (FP)] and [True Positives (TP) & False Positives(FP)]. The formula for the same is stated below - FPR=FP/TP+FP
31
What is the ROC curve?
he graph between the True Positive Rate on the y-axis and the False Positive Rate on the x-axis is called the ROC curve and is used in binary classification. The False Positive Rate (FPR) is calculated by taking the ratio between False Positives and the total number of negative samples, and the True Positive Rate (TPR) is calculated by taking the ratio between True Positives and the total number of positive samples. In order to construct the ROC curve, the TPR and FPR values are plotted on multiple threshold values. The area range under the ROC curve has a range between 0 and 1. A completely random model, which is represented by a straight line, has a 0.5 ROC. The amount of deviation a ROC has from this straight line denotes the efficiency of the model.