Class 1-12 Flashcards
What are the aims of numerical summaries of discrete variables?
- Aim is to describe the distribution of the variable.
- Question to address is : What are the relative frequencies of different categories? Which categories are common and which are rare?
- Since a categorical variable takes a finite number of possible values, the simplest thing to do is tabulate the number of occurances of each type.
What are the aims of numerical summaries of continuous variables?
- Aim is to summarize the data in terms of its distribution.
* It is common to start with some descriptive statistics to get a feeling for the data.
What is the standard deviation?
• Is a measure of how spread out numbers are;
it is the square root of the Variance.
• Variance is the average of the squared differences from the Mean.
a) Calculate Mean (the simple average of the numbers)
b) Then for each number: subtract the Mean and square the result (the squared difference).
c) Sum up those squared differences / (n-1)
What is exploratory data analysis? (EDA)
• is the process of analyzing and visualizing the data to get a better understanding of the data and glean insight from it.
How Does Exploratory Data Analysis Differ
from Summary Analysis?
Summary:
A summary analysis is a numeric reduction of a historical data set.
Quite passive and focused on
the past.
Exploratory:
Aims to gain insight into the engineering/scientific process behind the data
Active and futuristic.
What is “variation”?
Is the tendency of the values of a variable to change from measurement to measurement.
• Measuring any continuous variable twice, will give two different results.
• Categorical variables can vary if you measure across different subjects (e.g., eye colors of people), or different times (e.g., the energy levels).
• Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the variable’s values.
What is a “Histogram”?
A histogram is similar to a bar plot. Categorizes a continuous variable content into non-overlapping intervals for the sake of display (=binning).
What is a “Density Curve”?
the y-axis represents the probability of observing any given value, such that the area under the curve equals one.
What is a “Box Plot”?
Graphical representation of the five-number summary
• Depicts quartiles (i.e., the 25%, 50%, and 75% quantiles), minimum, maximum and outliers (if present).
• Conveys the shape of the data distribution, the presence of extreme values, and the ability to compare with other variables using the same scale
• Excellent tool for screening data, determining thresholds for variables and developing working hypotheses.
What is a Normal Distribution and
Why Should You Care?
- Many statistical methods are based on the properties of a normal distribution.
- Applying certain methods to data that are not normally distributed can give misleading or incorrect results.
- Most methods that assume normality is robust enough for all data except the very abnormal.
What are attributes of the “Gaussian Distribution”
• Has the following properties
- Gaussian distributions are symmetric around their mean.
- The mean, median, and mode of a Gaussian distribution are equal.
- The area under the curve is equal to 1.0.
- Gaussian distributions are denser in the center and less dense in the tails.
- Gaussian distributions are defined by two parameters, the mean and the standard deviation.
- 68% of the area under the curve is within one standard deviation of the mean.
- Approximately 95% of the area of a Gaussian distribution is within two standard deviations of the mean.
What is a “Scatterplot”
For continuous variables, the most common visualization technique is the scatterplot, which simply maps each variable to an x- or y-axis coordinate.
When can we make use of visualization tools?
- visual exploration is the first thing when dealing with a new task
- when analyzing models’ performance
- for sharing insights & reporting results
What is the iterative process of EDA?
- generate questions about the data
- search for answers by visualizing, transforming, and modeling the data
- use new knowledge to ask better or new questions
Define “Data Science”
• deals with large volumes of comlex data from multiple sources
• aims to develop methods, tools, or services capable of
a. ingesting such data
b. generating semiautomated decision-support systems
What is “Descriptive Analytics”?
- goal: understand the past and present
* tools: summary statistics, correlations, visualizations
What is “Predictive Analytics”?
- goal: detect patterns in the historic data to predict what will happen
- tools: statistical and machine learning
What is “Prescriptive Analytics”?
- goal: extend predictive analytics, i.e., data is used to determine (prescribe) the best course of action
- tools: optimization, heuristic search
Goal of a model
The goal of a model: to provide a simple low-dimensional summary of dataset ideally it:
• captures true “signals” i.e., patterns generated by the phenomenon of interest
• ignores “noise” i.e., random variation that we are not interested in
What are supervised models?
generate predictions via approximating the observable relationship between the data input and output
• use labeled data, i.e., we have prior knowledge of the values of our
target variable
• example: regression
What are unsupervised models?
a.k.a. “data discovery” models
• does not have labeled outputs
• help to discover interesting relationships within the data, i.e., infer the natural structure present within a set of data points
• example: clustering
Predictive tasks/problems:
- classification of an instance to one of the categories based on its features
- regression - prediction of a numerical response variable based on other features
Descriptive tasks/problems:
- clustering - identifying partitions of observations based on the features of these observations so that the members within the groups are more similar to each other than those in the other groups
- anomaly detection - search for observations that are “greatly dissimilar” to the rest of the sample or to some group of instances
What is linear regression?
• represents a method for the regression task/problem (prediction of a
numeric outcome)
• allows to model an output/response variable y as a linear additive
function of input variables x1, …, xn: y = β0 + β1x1 + β2x2 + … + βnxn