Analysis Basics Flashcards

(71 cards)

1
Q

What do you use to visualize the distribution or spread of a variable?

A
  1. Histogram
  2. Box plot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What do you do to understand the distribution?

A

Examine the “measured of central tendency”. This refers to describing the “middle” of the data by getting the mean, median, and mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A simple average based on adding together all of the values in the sample set and then dividing the total by the number of samples.

A

Mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The value in the middle of the range of all of the sample values.

A

Median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The most commonly occurring value in the sample set

A

Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

This refers to a tie for the most common value.

A

Bimodal or Multimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

It refers to the data that we have on hand.

A

Samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

It refers to all the data that we can collect.

A

Population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which function is used to estimate the distribution of a variable for the full population?

A

Probability Density Function (PDF)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What type of distribution has the mean and mode at the center and symmetric tail?

A

Normal Distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What type of distribution has the “bell shape” characteristic?

A

Normal Distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

It refers to a tendency to select certain types of values more frequently than others, in a way that misrepresents the underlying population, or ‘real world’.

A

Bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the things to remember when examining real world data?

A
  1. Check for missing values and badly recorded data
  2. Consider removal of obvious outliers
  3. Consider what real-world factors might affect your analysis and consider if your dataset size is large enough to handle this
  4. Check for biased raw data and consider your options to fix this, if found
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

It is a value that lies significantly outside the range of the rest of the distribution.

A

Outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which type of distribution has the mass of the data on the left side of the distribution, creating a long tail to the right because of the values at the extreme high end, which pull the mean to the right.

A

Right skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you measure variability (variance) in the data?

A
  1. Range
  2. Variance
  3. Standard Deviation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

This refers to the difference between the maximum and minimum. There’s no built-in function for this, but it’s easy to calculate using the min and max functions.

A

Range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

This refers to the average of the squared difference from the mean. You can use the built-in var function to find this.

A

Variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

This refers to the square root of the variance. You can use the built-in std function to find this.

A

Standard Deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

It is a built-in method of the DataFrame object that returns the main descriptive statistics for all numeric columns.

A

df.describe()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

When comparing numeric variables, how do you deal with numeric data in different scales?

A

Normalize the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

It is a technique that distributes the values proportionally on a scale of 0 to 1.

A

MinMax scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

This indicates the strength of the relationship between variables.

A

Correlation

Values above 0 indicate a positive correlation (high values of one variable tend to coincide with high values of the other), while values below 0 indicate a negative correlation (high values of one variable tend to coincide with low values of the other).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What do you use to visualize the correlation between two numeric variables?

A
  1. Scatter plot
    2.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
It is added to a scatter plot that shows the general trend in the data.
Regression line (line of best fit)
26
What is the slope-intercept form of a linear equation?
y = mx + b Where: - y and x are the coordinate variables - m is the slope of the line - b is the y-intercept (where the line goes through the axis)
27
It is the line that gives us the lowest value for the sum of the squared errors
Least Squares Regression
28
This returns (among other things) the coefficients you need for the slope equation: slope (m) and intercept (b) based on a given pair of variable samples you want to compare.
linregress method
29
It is the process of taking a set of sample data that includes one or more features (in this case, the number of hours studied) and a known label value (in this case, the grade achieved) and use the sample data to derive a function that calculates predicted label values for any given set of features.
Machine Learning
30
It works by establishing a relationship between variables in the data that represents characteristics known as the features of the thing being observed and the variable that we’re trying to predict known as the label.
Regression
31
It is an example of a supervised machine learning technique in which you train a model to predict a numeric label based on an item's features.
Regression
32
It is the difference between a predicted label value and the actual label value as a measure of error.
Residuals
33
What are the kinds of Linear Regression algorithms?
1. Least Squares 2. Lasso 3. Ridge
34
What are the kinds of Regression algorithms?
1. Linear Regression 2. Tree-based 3. Ensemble
35
These are algorithms that build a decision tree to reach a prediction.
Tree-based Regression Algorithm
36
These are algorithms that combine the outputs of multiple base algorithms to improve generalizability.
Ensemble Algorithm
37
This algorithm work by combining multiple base estimators to produce an optimal model, either by applying an aggregate function to a collection of base models (sometimes referred to a bagging) or by building a sequence of models that build on one another to improve predictive performance (referred to as boosting).
Ensemble Algorithm
38
This algorithm work by using a tree-based approach in which the features in the dataset are examined in a series of evaluations, each of which results in a branch in a decision tree based on the feature value. At the end of each series of branches are leaf-nodes with the predicted label value based on the feature values.
Decision Tree Algorithm
39
This optimization method works by applying an aggregate function to a collection of base models.
Bagging
40
This optimization method works by building a sequence of models that build on one another to improve predictive performance
Boosting
41
This boosting algorithm is similar to a Random Forest algorithm which builds multiple trees; but instead of building them all independently and taking the average result, each tree is built on the outputs of the previous one in an attempt to incrementally reduce the loss (error) in the model.
Gradient Boosting
42
This bagging algorithm applies an averaging function to multiple Decision Tree models for a better overall model.
Random Forest
43
This refers to changes you make to your data before it's passed to the model.
Preprocessing
44
This refers to the values that you specify to affect the behavior of a training algorithm are more correctly.
Hyperparameters
45
This algorithm is an ensemble that combines multiple decision trees to create an overall predictive model.
Gradient Boosting Regressor
46
Provide preprocessing transformations to get your data for modeling.
1. Scaling numeric features 2. Encoding categorical variables
47
Provide techniques to encode categorical variables.
1. Ordinal encoding 2. One-hot encoding
48
How do you save a model?
joblib.dump(model, filename)
49
How do you load a model
model = joblib.load(filename)
50
How do you use a model to generate predictions?
model.predict
51
You have created a model object using the scikit-learn LinearRegression class. What should you do to train the model?
Call the fit() method of the model object, specifying the training feature and label arrays
52
It is a measure of how much of the variance the model can explain.
R squared metri
53
It works by establishing a relationship between variables in the data that represent characteristics—known as the features—of the thing being observed, and the variable we're trying to predict—known as the label.
Regression
54
This metric for measuring loss in a regression squares the individual residuals, sum the squares, and calculate the mean. Squaring the residuals has the effect of basing the calculation on absolute values (ignoring whether the difference is negative or positive) and giving more weight to larger differences.
Mean Squared Error or MSE
55
This metric for measuring loss in a regression is calculated by getting the square root of the MSE. This is to express the loss in the same unit of measurement as the predicted label value itself.
Root Mean Squared Error or RMSE
56
This metric for measuring loss in a regression is also known as coefficient of determination.
R squared
57
This metric for measuring loss in a regression is the correlation between x and y squared. This produces a value between 0 and 1 that measures the amount of variance that can be explained by the model. Generally, the closer this value is to 1, the better the model predicts.
R squared
58
It is a container that holds related resources for an Azure solution
Resource Group
59
What are the 4 kinds of compute resource?
1. Compute instances 2. Compute clusters 3. Inference clusters 4. Attached compute
60
These are development workstations that data scientists can use to work with data and models.
Compute instances
61
These are scalable clusters of virtual machines for on-demand processing of experiment code.
Compute clusters
62
These are deployment targets for predictive services that used your trained models.
Inference clusters
63
These are links to Azure compute resources, such as Virtual Machines or Azure Databricks clusters.
Attached compute
64
It is the variance between predicted and true values that cannot be explained by the model.
Residuals
65
It is an example of a supervised machine learning technique in which you train a model to predict a numeric label based on an item's features.
Regression
66
It is a cloud-based platform for building and operating machine learning solutions in Azure.
Microsoft Azure Machine Learning
67
It refers to the average difference between predicted values and true values. The lower this value is, the better the model is predicting.
Mean Absolute Error (MAE)
68
It refers to the square root of the mean squared difference between predicted and true values. When compared to the MAE (above), a larger difference indicates greater variance in the individual errors (for example, with some errors being very small, while others are large).
Root. Mean Squared Error (RMSE)
69
It refers to a relative metric between 0 and 1 based on the square of the differences between predicted and true values. The closer to 0 this metric is, the better the model is performing. Because this metric is relative, it can be used to compare models where the labels are in different units.
Relative Squared Error (RSE)
70
It refers to a relative metric between 0 and 1 based on the absolute differences between predicted and true values. The closer to 0 this metric is, the better the model is performing. Like RSE, this metric can be used to compare models where the labels are in different units.
Relative Absolute Error (RAE)
71
It summarizes how much of the variance between predicted and true values is explained by the model. The closer to 1 this value is, the better the model is performing
Coefficient of Determination