Exam 1 Review Flashcards by Cherylle Finley

The process of extracting portions of a data set that are relevant to the analysis is called

subsetting

How well did you know this?

Not at all

Perfectly

The methodology of extracting information and knowledge from data to improve a company’s bottom line and enhance the consumer experience

business analytics

How well did you know this?

Not at all

Perfectly

How does business analytics benefit companies? (6)

develop better marketing strategies
deepen customer engagement
enhance efficient in procuremnt
uncover ways to reduce expense
identify emerging market trends
mitigate risk and fraud

How well did you know this?

Not at all

Perfectly

What topics do business analytics encompass?

statistics
computer science
information systems

How well did you know this?

Not at all

Perfectly

What questions do the 3 types of analytics techniques ask?

Descriptive: What has happened?
Predictive: What could happen in the future?
Prescriptive: What should we do?

How well did you know this?

Not at all

Perfectly

Data that have been organized, analyzed, and processed in a meaningul and purposeful way

Information

How well did you know this?

Not at all

Perfectly

Derived from a blend of data, contextual information, experience, and intuition

Knowledge

How well did you know this?

Not at all

Perfectly

Data collected by recording a characteristic of many subjects at the same point in time

cross-sectional data

How well did you know this?

Not at all

Perfectly

Data collected over several time periods

Time series data

How well did you know this?

Not at all

Perfectly

Provide examples of human-generated and machine-generated, structured and unstructured data

How well did you know this?

Not at all

Perfectly

What are the 3 characteristics of big data?

volume (immense amount)
velocity (generated at rapid speed)
variety (different types and forms of data)

How well did you know this?

Not at all

Perfectly

When a characteristic of interest differs in kind or degree among various observations

variable

How well did you know this?

Not at all

Perfectly

What are the 2 broad types of variable divisions?

Categorical (qualitative)
Numerical (quantitative)

How well did you know this?

Not at all

Perfectly

What are the 2 types of numerical variables?

provide examples

continuous
ex: weight, time, height, investment return
discrete (countable)
ex: number of points or children

How well did you know this?

Not at all

Perfectly

What are the 4 measurement scales?

Provide definitions and examples

nominal (categorical): observations just differ by name
ordinal (categorical): observations can be categorized or ranked (but differences are meaningless)
ex: ratings
interval (numerical): observations can be categorized or ranked (differences are meaningful)
ex: temperatures
ratio (numerical): observations are on interval-scale w/true zero point
ex: grades, weight, time, distance

How well did you know this?

Not at all

Perfectly

Process of retrieving, cleansing, integrating, transforming, and enriching data to support subsequent data analysis

Data wrangling

How well did you know this?

Not at all

Perfectly

What are the objectives of data wrangling? (3)

improve data quality
reduce time and effort required to perform analytics
help reveal true intelligence in the data

How well did you know this?

Not at all

Perfectly

What helps us to verify that the data set is complete or may have missing values

counting & sorting

How well did you know this?

Not at all

Perfectly

What allows us to review the range of values for each variable?

sorting data

How well did you know this?

Not at all

Perfectly

What are 2 common strategies for dealing with missing values?

Provide definitions and when to use them

omission (complete-case analysis): exclude missing values
ex: use when amount of missing values is small and expected to be randomly distributed across observations
imputation: replace missing values
ex: may replace with mean; used when variable w/missing values is deemed important

How well did you know this?

Not at all

Perfectly

Process of converting data from one format or structure to another

Provide Examples

Data transformation
ex: convert dates into seasons; convert values into natural logarithms; combine height and weight to create BMI

How well did you know this?

Not at all

Perfectly

Process of transforming numerical into categorical variables

What are the constraints?

binning

Bins must be consecutive and nonoverlapping

How well did you know this?

Not at all

Perfectly

What are 3 common approaches for transforming categorical data?

Explain/provide examples

Study These Flashcards

category reduction: combining categories
ex: Mon-Fri = Weekdays; “Other”
dummy variables: AKA indicator or binary variable that takes on value of 1 or 0 to describe two cateogires of a variable (n - 1)
category scores: ex: recode satisfaction survey to numbers
ex: used when data are ordinal and have natural, ordered categories

In addition to binning, another common approach is to create new variables through ____ transformations

Study These Flashcards

mathematical

What are the 3 common measures of central location?

- mean - median - mode

What is the measure of relative position? ## Footnote Explain how it works

percentile - Approx. p% of observations are less than the pth percentile - Approx (100-p)% of observations are greater than the pth percentile

If a variable has outliers, which measure of central location is preffered?

median is preferred over mean

What type of variable is mode useful for?

categorial variable

What are the 5 measures of dispersion? ## Footnote Define them

- Range: max - min - IRQ: Q3 - Q1 range of middle 50% of oservations - Mean absolute deviation absolute differences of all observations from mean - Variance: avg of squared differences from mean - Standard Deviation: square root of variance (lower value means obs closer to mean)

What are the measures of shape? (2) ## Footnote Define

- Skewness Coefficient: degree of distribution not symmetric about mean symmetric distribution = 0 - Kurtosis Coefficient: adnormal tails norm = 3; excess is KC - 3

What are the measures of association? (2) ## Footnote Define

Covariance: direction of linear relationship (senstitive to units of measure) Correlation Coefficient: dirent and strength of linear relationship Identifiers: 0 (no linear relation); 0.12 (weak); 0.8 (strong)

What does a box plot graphically display?

- min - Q1 - Q2 (median) - Q3 - max

How are the upper and lower fence calculated on a boxplot graph?

- lower fence: Q1 - (1.5 x IQR) - upper fence: Q3 + (1.5 x IQR) ## Footnote Anything greater or less is outlier

What does the Empirical Rule state?

- ~ 68% of all obs fall in between sample mean +/- sample SD - ~ 95% of all obs fall in between sample mean +/- 2Xsample SD - ~ 100% of all obs fall in between sample mean +/- 3Xsample SD

The population mean is referred to as a ____ and the sample mean is referred to as a _______.

1. parameter 2. statistic

What is the z-score used for? ## Footnote Provide example

- find distance of obs from mean in terms of SD ## Footnote z score of 2 -> obs is 2 SD above mean

What is standardizing? ## Footnote When is it commonly used?

converting obs into z-scores ## Footnote common when dealing w/ variabes measured using different scales

What methods are used to visualize a categorical variable? (2)

- frequency distribution - bar chart (graphical rep of frequency distribution)

What methods are used to visualize a numerical variable? (2)

- frequency distribution - histogram (helps see shape of distribution (skewness)

What methods are used to visualize two categorical variables? (2)

- contingency table (frequency for 2 categorical variables) - stacked column chart

What data visualization techniques can be used with multiple variables? (3) ## Footnote Explain

- bubble plot (3 numerical variables) - line chart (connects consecutive obs of numerical variable) (can track changes over time) - heat map (can identify combinations of categorical variables that have economic significance)

What method is used to visualize two numerical variables?

- scatter plot (shows linear relationship) (can also use for categorical variable)

Reminder: Tableau can extract data from many sources, including Excel

When the value of the repsonse variable is uniquely determined by predictor values ## Footnote Provide example

Deterministic Relationship ## Footnote ex: p = mv

When the value of the response variable is not uniquely determined due to other factors

stochastic relationship

A dummy variable can also be callled? (2)

- reference - benchmark

What is a measure that summarizes how well the sample regression equation fits the data?

Goodness-of-fit

Instead of se2,we generally report the standard deviation of the residual, denoted se, more commonly referred to as...?

the standard error of the estimate

What is the residual in linear regression?

difference btwn the observed and predicted values of variable

What are the Goodness-of-fit measures? (3) ## Footnote State ideal preferences

- Standard error of the estimate (Se) smaller Se is preffered - Coefficient of Determination (R2) never decreases as add more predictor variables to the model; closer to 1, better the fit - Adjusted Coefficient of Determination (adjusted R2) choose the model w/ the highest adjusted R2 value

We use analysis of variance (ANOVA) in the context of the linear regression model to derive R2.We denote the total variation in y as Σ(yi−y ̄)2, which is the numerator in the formula for the variance of y. What is this total variation called?

Total sum of squares

What is a good solution when confronted with multicollinearity?

- drop one of the collinear variables - obtain more data b/c the sample correlation may get weaker - sometimes, do nothing

The logistic regression model cannot be estimated with standard ordinary least squares (OLS) procedures. Instead, we rely on which method?

Maximum likelihood estimation (MLE)

In the holdout method we partition the data into two independent and mutually exclusive data sets. What are they called?

- training set - validation set

Often it is preferable to use the k-fold cross-validation method, where we partition the data into k subsets, and the one that is left out in each iteration is the ____ set.

validation

What are the other performance measures for logistic regression? ## Footnote Define them

- accuracy: making sure the #'s are accurate - sensitivity: proportion of target class cases that are classified correctly - specificity: proportion of nontarget class cases that are classified correctly

Exam 1 Review Flashcards

(56 cards)