Basic_Knowledge Flashcards
(31 cards)
What is supervised Learning?
Uses known and labeled data as input
Supervised learning has a feedback mechanism
The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine
What is Unsupervised Learning?
Uses unlabeled data as input
Unsupervised learning has no feedback mechanism
The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm
How is logistic regression done?
Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).
What is a sigmoid function?
A sigmoid function is a mathematical function having a characteristic “S”-shaped curve or sigmoid curve.
A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point[1] and exactly one inflection point. A sigmoid “function” and a sigmoid “curve” refer to the same object.
Explain the steps in making a decision tree.
Take the entire data set as input
Calculate entropy of the target variable, as well as the predictor attributes
Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
Choose the attribute with the highest information gain as the root node
Repeat the same procedure on every branch until the decision node of each branch is finalized
What is a random forest model?
A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.
How do you build a random forest model?
Randomly select ‘k’ features from a total of ‘m’ features where k «_space;m
Among the ‘k’ features, calculate the node D using the best split point
Split the node into daughter nodes using the best split
Repeat steps two and three until leaf nodes are finalized
Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees
How can you avoid overfitting your model?
Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger picture. There are three main methods to avoid overfitting:
Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data
Use cross-validation techniques, such as k folds cross-validation
Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting
Differentiate between univariate, bivariate, and multivariate analysis.
Univariate
Univariate data contains only one variable. The purpose of the univariate analysis is to describe the data and find patterns that exist within it.
Example: height of students, The patterns can be studied by drawing conclusions using mean, median, mode, dispersion or range, minimum, maximum, etc.
________________________
Bivariate
Bivariate data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to determine the relationship between the two variables.
Example: temperature and ice cream sales in the summer season, Here, the relationship is visible from the table that temperature and sales are directly proportional to each other. The hotter the temperature, the better the sales.
________________________
Multivariate
Multivariate data involves three or more variables, it is categorized under multivariate. It is similar to a bivariate but contains more than one dependent variable.
Example: data for house price prediction, The patterns can be studied by drawing conclusions using mean, median, and mode, dispersion or range, minimum, maximum, etc. You can start describing the data and using it to guess what the price of the house will be.
What are the feature selection methods used to select the right variables?
There are two main methods for feature selection, i.e, filter, and wrapper methods.
Filter Methods
This involves:
Linear discrimination analysis
ANOVA
Chi-Square
The best analogy for selecting features is “bad data in, bad answer out.” When we’re limiting or selecting the features, it’s all about cleaning up the data coming in.
Wrapper Methods
This involves:
Forward Selection: We test one feature at a time and keep adding them until we get a good fit
Backward Selection: We test all the features and start removing them to see what works better
Recursive Feature Elimination: Recursively looks through all the different features and how they pair together
Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.
In your choice of language, write a program that prints the numbers ranging from one to 50.
You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?
he following are ways to handle missing data values:
If the data set is large, we can just simply remove the rows with missing data values. It is the quickest way; we use the rest of the data to predict the values.
For smaller data sets, we can substitute missing values with the mean or average of the rest of the data using the pandas’ data frame in python. There are different ways to do so, such as df.mean(), df.fillna(mean).
For the given points, how will you calculate the Euclidean distance in Python?
plot1 = [1,3]
plot2 = [2,5]
The Euclidean distance can be calculated as follows:
euclidean_distance = sqrt( (plot1[0]-plot2[0])2 + (plot1[1]-plot2[1])2 )
What are dimensionality reduction and its benefits?
The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.
This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there’s no point in storing a value in two different units (meters and inches).
How should you maintain a deployed model?
Monitor
Constant monitoring of all models is needed to determine their performance accuracy. When you change something, you want to figure out how your changes are going to affect things. This needs to be monitored to ensure it’s doing what it’s supposed to do.
Evaluate
Evaluation metrics of the current model are calculated to determine if a new algorithm is needed.
Compare
The new models are compared to each other to determine which model performs the best.
Rebuild
The best performing model is re-built on the current state of data.
What are recommender systems?
A recommender system predicts what a user would rate a specific product based on their preferences. It can be split into two different areas:
Collaborative Filtering
As an example, Last.fm recommends tracks that other users with similar interests play often. This is also commonly seen on Amazon after making a purchase; customers may notice the following message accompanied by product recommendations: “Users who bought this also bought…”
Content-based Filtering
As an example: Pandora uses the properties of a song to recommend music with similar properties. Here, we look at content, instead of looking at who else is listening to music.
ow do you find RMSE and MSE in a linear regression model?
RMSE and MSE are two of the most common measures of accuracy for a linear regression model.
RMSE indicates the Root Mean Square Error.
RMSE
MSE indicates the Mean Square Error.
MSE
How can outlier values be treated?
You can drop outliers only if it is a garbage value.
Example: height of an adult = abc ft. This cannot be true, as the height cannot be a string value. In this case, outliers can be removed.
If the outliers have extreme values, they can be removed. For example, if all the data points are clustered between zero to 10, but one point lies at 100, then we can remove this point.
If you cannot drop outliers, you can try the following:
Try a different model. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model.
Try normalizing the data. This way, the extreme data points are pulled to a similar range.
You can use algorithms that are less affected by outliers; an example would be random forests.
How can time-series data be declared as stationery?
It is stationary when the variance and mean of the series are constant with time.
In the first graph, the variance is constant with time. Here, X is the time factor and Y is the variable. The value of Y goes through the same points all the time; in other words, it is stationary.
In the second graph, the waves get bigger, which means it is non-stationary and the variance is changing with time.
How can you calculate accuracy using a confusion matrix?
You can see the values for total data, actual values, and predicted values.
What is the formula for accuracy?
Accuracy = (True Positive + True Negative) / Total Observations
= (262 + 347) / 650
= 609 / 650
= 0.93
As a result, we get an accuracy of 93 percent.
How do you calculate the precision and recall rate?
Precision = (True positive) / (True Positive + False Positive)
= 262 / 277
= 0.94
Recall Rate = (True Positive) / (Total Positive + False Negative)
= 262 / 288
= 0.90
What is logistic Regression?
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
What is Linear Regression?
Linear regression analysis is used to predict the value of a variable based on the value of another variable. (For instance if X = 1 Y=2, and so on). Plotted linearly on a graph.