Analysis Basics Flashcards

1
Q

What do you use to visualize the distribution or spread of a variable?

A
  1. Histogram
  2. Box plot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What do you do to understand the distribution?

A

Examine the “measured of central tendency”. This refers to describing the “middle” of the data by getting the mean, median, and mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A simple average based on adding together all of the values in the sample set and then dividing the total by the number of samples.

A

Mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The value in the middle of the range of all of the sample values.

A

Median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The most commonly occurring value in the sample set

A

Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

This refers to a tie for the most common value.

A

Bimodal or Multimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

It refers to the data that we have on hand.

A

Samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

It refers to all the data that we can collect.

A

Population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which function is used to estimate the distribution of a variable for the full population?

A

Probability Density Function (PDF)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What type of distribution has the mean and mode at the center and symmetric tail?

A

Normal Distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What type of distribution has the “bell shape” characteristic?

A

Normal Distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

It refers to a tendency to select certain types of values more frequently than others, in a way that misrepresents the underlying population, or ‘real world’.

A

Bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the things to remember when examining real world data?

A
  1. Check for missing values and badly recorded data
  2. Consider removal of obvious outliers
  3. Consider what real-world factors might affect your analysis and consider if your dataset size is large enough to handle this
  4. Check for biased raw data and consider your options to fix this, if found
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

It is a value that lies significantly outside the range of the rest of the distribution.

A

Outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which type of distribution has the mass of the data on the left side of the distribution, creating a long tail to the right because of the values at the extreme high end, which pull the mean to the right.

A

Right skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you measure variability (variance) in the data?

A
  1. Range
  2. Variance
  3. Standard Deviation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

This refers to the difference between the maximum and minimum. There’s no built-in function for this, but it’s easy to calculate using the min and max functions.

A

Range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

This refers to the average of the squared difference from the mean. You can use the built-in var function to find this.

A

Variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

This refers to the square root of the variance. You can use the built-in std function to find this.

A

Standard Deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

It is a built-in method of the DataFrame object that returns the main descriptive statistics for all numeric columns.

A

df.describe()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

When comparing numeric variables, how do you deal with numeric data in different scales?

A

Normalize the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

It is a technique that distributes the values proportionally on a scale of 0 to 1.

A

MinMax scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

This indicates the strength of the relationship between variables.

A

Correlation

Values above 0 indicate a positive correlation (high values of one variable tend to coincide with high values of the other), while values below 0 indicate a negative correlation (high values of one variable tend to coincide with low values of the other).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What do you use to visualize the correlation between two numeric variables?

A
  1. Scatter plot
    2.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

It is added to a scatter plot that shows the general trend in the data.

A

Regression line (line of best fit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the slope-intercept form of a linear equation?

A

y = mx + b

Where:

  • y and x are the coordinate variables
  • m is the slope of the line
  • b is the y-intercept (where the line goes through the axis)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

It is the line that gives us the lowest value for the sum of the squared errors

A

Least Squares Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

This returns (among other things) the coefficients you need for the slope equation: slope (m) and intercept (b) based on a given pair of variable samples you want to compare.

A

linregress method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

It is the process of taking a set of sample data that includes one or more features (in this case, the number of hours studied) and a known label value (in this case, the grade achieved) and use the sample data to derive a function that calculates predicted label values for any given set of features.

A

Machine Learning

30
Q

It works by establishing a relationship between variables in the data that represents characteristics known as the features of the thing being observed and the variable that we’re trying to predict known as the label.

A

Regression

31
Q

It is an example of a supervised machine learning technique in which you train a model to predict a numeric label based on an item’s features.

A

Regression

32
Q

It is the difference between a predicted label value and the actual label value as a measure of error.

A

Residuals

33
Q

What are the kinds of Linear Regression algorithms?

A
  1. Least Squares
  2. Lasso
  3. Ridge
34
Q

What are the kinds of Regression algorithms?

A
  1. Linear Regression
  2. Tree-based
  3. Ensemble
35
Q

These are algorithms that build a decision tree to reach a prediction.

A

Tree-based Regression Algorithm

36
Q

These are algorithms that combine the outputs of multiple base algorithms to improve generalizability.

A

Ensemble Algorithm

37
Q

This algorithm work by combining multiple base estimators to produce an optimal model, either by applying an aggregate function to a collection of base models (sometimes referred to a bagging) or by building a sequence of models that build on one another to improve predictive performance (referred to as boosting).

A

Ensemble Algorithm

38
Q

This algorithm work by using a tree-based approach in which the features in the dataset are examined in a series of evaluations, each of which results in a branch in a decision tree based on the feature value. At the end of each series of branches are leaf-nodes with the predicted label value based on the feature values.

A

Decision Tree Algorithm

39
Q

This optimization method works by applying an aggregate function to a collection of base models.

A

Bagging

40
Q

This optimization method works by building a sequence of models that build on one another to improve predictive performance

A

Boosting

41
Q

This boosting algorithm is similar to a Random Forest algorithm which builds multiple trees; but instead of building them all independently and taking the average result, each tree is built on the outputs of the previous one in an attempt to incrementally reduce the loss (error) in the model.

A

Gradient Boosting

42
Q

This bagging algorithm applies an averaging function to multiple Decision Tree models for a better overall model.

A

Random Forest

43
Q

This refers to changes you make to your data before it’s passed to the model.

A

Preprocessing

44
Q

This refers to the values that you specify to affect the behavior of a training algorithm are more correctly.

A

Hyperparameters

45
Q

This algorithm is an ensemble that combines multiple decision trees to create an overall predictive model.

A

Gradient Boosting Regressor

46
Q

Provide preprocessing transformations to get your data for modeling.

A
  1. Scaling numeric features
  2. Encoding categorical variables
47
Q

Provide techniques to encode categorical variables.

A
  1. Ordinal encoding
  2. One-hot encoding
48
Q

How do you save a model?

A

joblib.dump(model, filename)

49
Q

How do you load a model

A

model = joblib.load(filename)

50
Q

How do you use a model to generate predictions?

A

model.predict

51
Q

You have created a model object using the scikit-learn LinearRegression class. What should you do to train the model?

A

Call the fit() method of the model object, specifying the training feature and label arrays

52
Q

It is a measure of how much of the variance the model can explain.

A

R squared metri

53
Q

It works by establishing a relationship between variables in the data that represent characteristics—known as the features—of the thing being observed, and the variable we’re trying to predict—known as the label.

A

Regression

54
Q

This metric for measuring loss in a regression squares the individual residuals, sum the squares, and calculate the mean. Squaring the residuals has the effect of basing the calculation on absolute values (ignoring whether the difference is negative or positive) and giving more weight to larger differences.

A

Mean Squared Error or MSE

55
Q

This metric for measuring loss in a regression is calculated by getting the square root of the MSE. This is to express the loss in the same unit of measurement as the predicted label value itself.

A

Root Mean Squared Error or RMSE

56
Q

This metric for measuring loss in a regression is also known as coefficient of determination.

A

R squared

57
Q

This metric for measuring loss in a regression is the correlation between x and y squared. This produces a value between 0 and 1 that measures the amount of variance that can be explained by the model. Generally, the closer this value is to 1, the better the model predicts.

A

R squared

58
Q

It is a container that holds related resources for an Azure solution

A

Resource Group

59
Q

What are the 4 kinds of compute resource?

A
  1. Compute instances
  2. Compute clusters
  3. Inference clusters
  4. Attached compute
60
Q

These are development workstations that data scientists can use to work with data and models.

A

Compute instances

61
Q

These are scalable clusters of virtual machines for on-demand processing of experiment code.

A

Compute clusters

62
Q

These are deployment targets for predictive services that used your trained models.

A

Inference clusters

63
Q

These are links to Azure compute resources, such as Virtual Machines or Azure Databricks clusters.

A

Attached compute

64
Q

It is the variance between predicted and true values that cannot be explained by the model.

A

Residuals

65
Q

It is an example of a supervised machine learning technique in which you train a model to predict a numeric label based on an item’s features.

A

Regression

66
Q

It is a cloud-based platform for building and operating machine learning solutions in Azure.

A

Microsoft Azure Machine Learning

67
Q

It refers to the average difference between predicted values and true values. The lower this value is, the better the model is predicting.

A

Mean Absolute Error (MAE)

68
Q

It refers to the square root of the mean squared difference between predicted and true values. When compared to the MAE (above), a larger difference indicates greater variance in the individual errors (for example, with some errors being very small, while others are large).

A

Root. Mean Squared Error (RMSE)

69
Q

It refers to a relative metric between 0 and 1 based on the square of the differences between predicted and true values. The closer to 0 this metric is, the better the model is performing. Because this metric is relative, it can be used to compare models where the labels are in different units.

A

Relative Squared Error (RSE)

70
Q

It refers to a relative metric between 0 and 1 based on the absolute differences between predicted and true values. The closer to 0 this metric is, the better the model is performing. Like RSE, this metric can be used to compare models where the labels are in different units.

A

Relative Absolute Error (RAE)

71
Q

It summarizes how much of the variance between predicted and true values is explained by the model. The closer to 1 this value is, the better the model is performing

A

Coefficient of Determination