Analysis Basics Flashcards by Z P

What do you use to visualize the distribution or spread of a variable?

Histogram
Box plot

How well did you know this?

Not at all

Perfectly

What do you do to understand the distribution?

Examine the “measured of central tendency”. This refers to describing the “middle” of the data by getting the mean, median, and mode.

How well did you know this?

Not at all

Perfectly

A simple average based on adding together all of the values in the sample set and then dividing the total by the number of samples.

Mean

How well did you know this?

Not at all

Perfectly

The value in the middle of the range of all of the sample values.

Median

How well did you know this?

Not at all

Perfectly

The most commonly occurring value in the sample set

Mode

How well did you know this?

Not at all

Perfectly

This refers to a tie for the most common value.

Bimodal or Multimodal

How well did you know this?

Not at all

Perfectly

It refers to the data that we have on hand.

Samples

How well did you know this?

Not at all

Perfectly

It refers to all the data that we can collect.

Population

How well did you know this?

Not at all

Perfectly

Which function is used to estimate the distribution of a variable for the full population?

Probability Density Function (PDF)

How well did you know this?

Not at all

Perfectly

What type of distribution has the mean and mode at the center and symmetric tail?

Normal Distribution

How well did you know this?

Not at all

Perfectly

What type of distribution has the “bell shape” characteristic?

Normal Distribution

How well did you know this?

Not at all

Perfectly

It refers to a tendency to select certain types of values more frequently than others, in a way that misrepresents the underlying population, or ‘real world’.

Bias

How well did you know this?

Not at all

Perfectly

What are the things to remember when examining real world data?

Check for missing values and badly recorded data
Consider removal of obvious outliers
Consider what real-world factors might affect your analysis and consider if your dataset size is large enough to handle this
Check for biased raw data and consider your options to fix this, if found

How well did you know this?

Not at all

Perfectly

It is a value that lies significantly outside the range of the rest of the distribution.

Outlier

How well did you know this?

Not at all

Perfectly

Which type of distribution has the mass of the data on the left side of the distribution, creating a long tail to the right because of the values at the extreme high end, which pull the mean to the right.

Right skewed

How well did you know this?

Not at all

Perfectly

How do you measure variability (variance) in the data?

Range
Variance
Standard Deviation

How well did you know this?

Not at all

Perfectly

This refers to the difference between the maximum and minimum. There’s no built-in function for this, but it’s easy to calculate using the min and max functions.

Range

How well did you know this?

Not at all

Perfectly

This refers to the average of the squared difference from the mean. You can use the built-in var function to find this.

Variance

How well did you know this?

Not at all

Perfectly

This refers to the square root of the variance. You can use the built-in std function to find this.

Standard Deviation

How well did you know this?

Not at all

Perfectly

It is a built-in method of the DataFrame object that returns the main descriptive statistics for all numeric columns.

df.describe()

How well did you know this?

Not at all

Perfectly

When comparing numeric variables, how do you deal with numeric data in different scales?

Normalize the data

How well did you know this?

Not at all

Perfectly

It is a technique that distributes the values proportionally on a scale of 0 to 1.

MinMax scaling

How well did you know this?

Not at all

Perfectly

This indicates the strength of the relationship between variables.

Correlation

Values above 0 indicate a positive correlation (high values of one variable tend to coincide with high values of the other), while values below 0 indicate a negative correlation (high values of one variable tend to coincide with low values of the other).

How well did you know this?

Not at all

Perfectly

What do you use to visualize the correlation between two numeric variables?

Scatter plot
2.

How well did you know this?

Not at all

Perfectly

It is added to a scatter plot that shows the general trend in the data.

Regression line (line of best fit)

What is the slope-intercept form of a linear equation?

y = mx + b Where: - y and x are the coordinate variables - m is the slope of the line - b is the y-intercept (where the line goes through the axis)

It is the line that gives us the lowest value for the sum of the squared errors

Least Squares Regression

This returns (among other things) the coefficients you need for the slope equation: slope (m) and intercept (b) based on a given pair of variable samples you want to compare.

linregress method

It is the process of taking a set of sample data that includes one or more features (in this case, the number of hours studied) and a known label value (in this case, the grade achieved) and use the sample data to derive a function that calculates predicted label values for any given set of features.

Machine Learning

It works by establishing a relationship between variables in the data that represents characteristics known as the features of the thing being observed and the variable that we’re trying to predict known as the label.

Regression

It is an example of a supervised machine learning technique in which you train a model to predict a numeric label based on an item's features.

Regression

It is the difference between a predicted label value and the actual label value as a measure of error.

Residuals

What are the kinds of Linear Regression algorithms?

1. Least Squares 2. Lasso 3. Ridge

What are the kinds of Regression algorithms?

1. Linear Regression 2. Tree-based 3. Ensemble

These are algorithms that build a decision tree to reach a prediction.

Tree-based Regression Algorithm

These are algorithms that combine the outputs of multiple base algorithms to improve generalizability.

Ensemble Algorithm

This algorithm work by combining multiple base estimators to produce an optimal model, either by applying an aggregate function to a collection of base models (sometimes referred to a bagging) or by building a sequence of models that build on one another to improve predictive performance (referred to as boosting).

Ensemble Algorithm

This algorithm work by using a tree-based approach in which the features in the dataset are examined in a series of evaluations, each of which results in a branch in a decision tree based on the feature value. At the end of each series of branches are leaf-nodes with the predicted label value based on the feature values.

Decision Tree Algorithm

This optimization method works by applying an aggregate function to a collection of base models.

Bagging

This optimization method works by building a sequence of models that build on one another to improve predictive performance

Boosting

This boosting algorithm is similar to a Random Forest algorithm which builds multiple trees; but instead of building them all independently and taking the average result, each tree is built on the outputs of the previous one in an attempt to incrementally reduce the loss (error) in the model.

Gradient Boosting

This bagging algorithm applies an averaging function to multiple Decision Tree models for a better overall model.

Random Forest

This refers to changes you make to your data before it's passed to the model.

Preprocessing

This refers to the values that you specify to affect the behavior of a training algorithm are more correctly.

Hyperparameters

This algorithm is an ensemble that combines multiple decision trees to create an overall predictive model.

Gradient Boosting Regressor

Provide preprocessing transformations to get your data for modeling.

1. Scaling numeric features 2. Encoding categorical variables

Provide techniques to encode categorical variables.

1. Ordinal encoding 2. One-hot encoding

How do you save a model?

joblib.dump(model, filename)

How do you load a model

model = joblib.load(filename)

How do you use a model to generate predictions?

model.predict

You have created a model object using the scikit-learn LinearRegression class. What should you do to train the model?

Call the fit() method of the model object, specifying the training feature and label arrays

It is a measure of how much of the variance the model can explain.

R squared metri

It works by establishing a relationship between variables in the data that represent characteristics—known as the features—of the thing being observed, and the variable we're trying to predict—known as the label.

Regression

This metric for measuring loss in a regression squares the individual residuals, sum the squares, and calculate the mean. Squaring the residuals has the effect of basing the calculation on absolute values (ignoring whether the difference is negative or positive) and giving more weight to larger differences.

Mean Squared Error or MSE

This metric for measuring loss in a regression is calculated by getting the square root of the MSE. This is to express the loss in the same unit of measurement as the predicted label value itself.

Root Mean Squared Error or RMSE

This metric for measuring loss in a regression is also known as coefficient of determination.

R squared

This metric for measuring loss in a regression is the correlation between x and y squared. This produces a value between 0 and 1 that measures the amount of variance that can be explained by the model. Generally, the closer this value is to 1, the better the model predicts.

R squared

It is a container that holds related resources for an Azure solution

Resource Group

What are the 4 kinds of compute resource?

1. Compute instances 2. Compute clusters 3. Inference clusters 4. Attached compute

These are development workstations that data scientists can use to work with data and models.

Compute instances

These are scalable clusters of virtual machines for on-demand processing of experiment code.

Compute clusters

These are deployment targets for predictive services that used your trained models.

Inference clusters

These are links to Azure compute resources, such as Virtual Machines or Azure Databricks clusters.

Attached compute

It is the variance between predicted and true values that cannot be explained by the model.

Residuals

It is an example of a supervised machine learning technique in which you train a model to predict a numeric label based on an item's features.

Regression

It is a cloud-based platform for building and operating machine learning solutions in Azure.

Microsoft Azure Machine Learning

It refers to the average difference between predicted values and true values. The lower this value is, the better the model is predicting.

Mean Absolute Error (MAE)

It refers to the square root of the mean squared difference between predicted and true values. When compared to the MAE (above), a larger difference indicates greater variance in the individual errors (for example, with some errors being very small, while others are large).

Root. Mean Squared Error (RMSE)

It refers to a relative metric between 0 and 1 based on the square of the differences between predicted and true values. The closer to 0 this metric is, the better the model is performing. Because this metric is relative, it can be used to compare models where the labels are in different units.

Relative Squared Error (RSE)

It refers to a relative metric between 0 and 1 based on the absolute differences between predicted and true values. The closer to 0 this metric is, the better the model is performing. Like RSE, this metric can be used to compare models where the labels are in different units.

Relative Absolute Error (RAE)

It summarizes how much of the variance between predicted and true values is explained by the model. The closer to 1 this value is, the better the model is performing

Coefficient of Determination

Analysis Basics Flashcards

(71 cards)