Class 1-12 Flashcards by Tom Free

What are the aims of numerical summaries of discrete variables?

Aim is to describe the distribution of the variable.
Question to address is : What are the relative frequencies of different categories? Which categories are common and which are rare?
Since a categorical variable takes a finite number of possible values, the simplest thing to do is tabulate the number of occurances of each type.

How well did you know this?

Not at all

Perfectly

What are the aims of numerical summaries of continuous variables?

Aim is to summarize the data in terms of its distribution.

* It is common to start with some descriptive statistics to get a feeling for the data.

How well did you know this?

Not at all

Perfectly

What is the standard deviation?

• Is a measure of how spread out numbers are;
it is the square root of the Variance.
• Variance is the average of the squared differences from the Mean.
a) Calculate Mean (the simple average of the numbers)
b) Then for each number: subtract the Mean and square the result (the squared difference).
c) Sum up those squared differences / (n-1)

How well did you know this?

Not at all

Perfectly

What is exploratory data analysis? (EDA)

• is the process of analyzing and visualizing the data to get a better understanding of the data and glean insight from it.

How well did you know this?

Not at all

Perfectly

How Does Exploratory Data Analysis Differ

from Summary Analysis?

Summary:
A summary analysis is a numeric reduction of a historical data set.
Quite passive and focused on
the past.

Exploratory:
Aims to gain insight into the engineering/scientific process behind the data
Active and futuristic.

How well did you know this?

Not at all

Perfectly

What is “variation”?

Is the tendency of the values of a variable to change from measurement to measurement.
• Measuring any continuous variable twice, will give two different results.
• Categorical variables can vary if you measure across different subjects (e.g., eye colors of people), or different times (e.g., the energy levels).
• Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the variable’s values.

How well did you know this?

Not at all

Perfectly

What is a “Histogram”?

A histogram is similar to a bar plot. Categorizes a continuous variable content into non-overlapping intervals for the sake of display (=binning).

How well did you know this?

Not at all

Perfectly

What is a “Density Curve”?

the y-axis represents the probability of observing any given value, such that the area under the curve equals one.

How well did you know this?

Not at all

Perfectly

What is a “Box Plot”?

Graphical representation of the five-number summary
• Depicts quartiles (i.e., the 25%, 50%, and 75% quantiles), minimum, maximum and outliers (if present).
• Conveys the shape of the data distribution, the presence of extreme values, and the ability to compare with other variables using the same scale
• Excellent tool for screening data, determining thresholds for variables and developing working hypotheses.

How well did you know this?

Not at all

Perfectly

What is a Normal Distribution and

Why Should You Care?

Many statistical methods are based on the properties of a normal distribution.
Applying certain methods to data that are not normally distributed can give misleading or incorrect results.
Most methods that assume normality is robust enough for all data except the very abnormal.

How well did you know this?

Not at all

Perfectly

What are attributes of the “Gaussian Distribution”

• Has the following properties

Gaussian distributions are symmetric around their mean.
The mean, median, and mode of a Gaussian distribution are equal.
The area under the curve is equal to 1.0.
Gaussian distributions are denser in the center and less dense in the tails.
Gaussian distributions are defined by two parameters, the mean and the standard deviation.
68% of the area under the curve is within one standard deviation of the mean.
Approximately 95% of the area of a Gaussian distribution is within two standard deviations of the mean.

How well did you know this?

Not at all

Perfectly

What is a “Scatterplot”

For continuous variables, the most common visualization technique is the scatterplot, which simply maps each variable to an x- or y-axis coordinate.

How well did you know this?

Not at all

Perfectly

When can we make use of visualization tools?

visual exploration is the first thing when dealing with a new task
when analyzing models’ performance
for sharing insights & reporting results

How well did you know this?

Not at all

Perfectly

What is the iterative process of EDA?

generate questions about the data
search for answers by visualizing, transforming, and modeling the data
use new knowledge to ask better or new questions

How well did you know this?

Not at all

Perfectly

Define “Data Science”

• deals with large volumes of comlex data from multiple sources
• aims to develop methods, tools, or services capable of
a. ingesting such data
b. generating semiautomated decision-support systems

How well did you know this?

Not at all

Perfectly

What is “Descriptive Analytics”?

goal: understand the past and present

* tools: summary statistics, correlations, visualizations

How well did you know this?

Not at all

Perfectly

What is “Predictive Analytics”?

Study These Flashcards

goal: detect patterns in the historic data to predict what will happen
tools: statistical and machine learning

What is “Prescriptive Analytics”?

Study These Flashcards

goal: extend predictive analytics, i.e., data is used to determine (prescribe) the best course of action
tools: optimization, heuristic search

Goal of a model

Study These Flashcards

The goal of a model: to provide a simple low-dimensional summary of dataset ideally it:
• captures true “signals” i.e., patterns generated by the phenomenon of interest
• ignores “noise” i.e., random variation that we are not interested in

What are supervised models?

Study These Flashcards

generate predictions via approximating the observable relationship between the data input and output
• use labeled data, i.e., we have prior knowledge of the values of our
target variable
• example: regression

What are unsupervised models?

Study These Flashcards

a.k.a. “data discovery” models
• does not have labeled outputs
• help to discover interesting relationships within the data, i.e., infer the natural structure present within a set of data points
• example: clustering

Predictive tasks/problems:

Study These Flashcards

classification of an instance to one of the categories based on its features
regression - prediction of a numerical response variable based on other features

Descriptive tasks/problems:

Study These Flashcards

clustering - identifying partitions of observations based on the features of these observations so that the members within the groups are more similar to each other than those in the other groups
anomaly detection - search for observations that are “greatly dissimilar” to the rest of the sample or to some group of instances

What is linear regression?

Study These Flashcards

• represents a method for the regression task/problem (prediction of a
numeric outcome)
• allows to model an output/response variable y as a linear additive
function of input variables x1, …, xn: y = β0 + β1x1 + β2x2 + … + βnxn

What is logistic regression?

1. Allows modeling the outcome of a binary variable • probability that an event of interest(class) happens 2. Uses a link function (transformation) to limit the outcome to the values between 0 and 1

How is accuracy calculated?

correctly_classified_instances / | total_number_of _instances

What is a classification error?

classification_error = 1 − accuracy

What is overfitting?

Overfitting refers to modeling every minor variation in the input. Note: It is way more likely that minor variation is noise than true signal!

Signs of overfitting?

* performs well on the training set (low error) | * produces high error on the test, previously unseen data

Causes of overfitting?

* high dimensionality (a large set of predictors) * fitting a model to achieve minimal errors on a training set * use of nonlinear methods (e.g.,tree-based methods)

What is bias?

Bias is errors due to the simplification of assumptions (underfitting) • a linear model is applied to non-linear data • too little data is used - lacking details

What is variance?

Variance reflects the changes as the training set changes (overfitting). • model is trained a lot on a noisy dataset • complex models like decision trees are applied

What is regularization?

Regularization • is the manifestation of the bias variance trade-off • represents alterations to the estimation process • i.e. an additional objective is introduced via adding a complexity penalty: 1. low training error or good fit (initial one) 2. low complexity (new)

What is L1 (Lasso) regression?

LASSO results in sparser models due to different ways of setting upper bounds of coefficients • L1-norm forces coefficients to take on 0 values I can be used for feature selection I mitigates the issue of multicollinearity • L2-norm can no

What is L2 (Ridge) regression?

Ridge penalizes stronger for very large coefficients | • penalizes sum of squared coefficients

What is elastic net regression?

Elastic-Net is a combination of LASSO- and Ridge-penalties. • emerged from the critique on LASSO • has the strengths of LASSO and Ridge regression • realized in glmnet package in R • the penalty can be represented as

What is the TPR (True Positive Rate)

Also called sensitivity (True Positive / True Positive + False Negative)

What is the FPR (False Positive Rate)

is calculated as 1 - specificity or 1 - (True Negativ / True Negativ + False Positive)

What is AUC (Area under the ROC curve) ?

* AUC summarizes ROC in a single number. * The higher the AUC, the better the model (closer to the optimum) * AUC of a good classifier is well above 0.5

What is a decision tree?

* easily interpretable (compared to other methods, that can be regarded as “black boxes”) * can be used for classification and regression * work by partitioning the data into smaller, more homogeneous groups * measures the impurity (chaos in the system) * makes such a split that minimizes the impurity in the resulting partitions * splitting process continues within the newly created partitions until no further improvement is possible (recursive partitioning)

What is entropy?

Entropy is the degree of chaos in the system. The higher the entropy, the less ordered the system is.

What are stopping criteria (also called Decision Tree hyper parameters)?

1. Larger values of maxdepth will produce larger trees, thus smaller bias but larger variance 2. Larger values of minsplit mean more data points per node, thus larger bias and smaller variance. • the minimum number of observations that must exist in a node in order for a split to be attempted 3. Larger values of minbucket mean more data points per terminal node, thus larger bias and smaller variance. • the minimum number of observations in any terminal or leaf node

Class 1-12 Flashcards

(42 cards)