Exam 1 Review Flashcards

(56 cards)

1
Q

The process of extracting portions of a data set that are relevant to the analysis is called

A

subsetting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The methodology of extracting information and knowledge from data to improve a company’s bottom line and enhance the consumer experience

A

business analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does business analytics benefit companies? (6)

A
  • develop better marketing strategies
  • deepen customer engagement
  • enhance efficient in procuremnt
  • uncover ways to reduce expense
  • identify emerging market trends
  • mitigate risk and fraud
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What topics do business analytics encompass?

A
  • statistics
  • computer science
  • information systems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What questions do the 3 types of analytics techniques ask?

A
  • Descriptive: What has happened?
  • Predictive: What could happen in the future?
  • Prescriptive: What should we do?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data that have been organized, analyzed, and processed in a meaningul and purposeful way

A

Information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Derived from a blend of data, contextual information, experience, and intuition

A

Knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data collected by recording a characteristic of many subjects at the same point in time

A

cross-sectional data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data collected over several time periods

A

Time series data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Provide examples of human-generated and machine-generated, structured and unstructured data

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 3 characteristics of big data?

A
  • volume (immense amount)
  • velocity (generated at rapid speed)
  • variety (different types and forms of data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When a characteristic of interest differs in kind or degree among various observations

A

variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the 2 broad types of variable divisions?

A
  • Categorical (qualitative)
  • Numerical (quantitative)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 2 types of numerical variables?

provide examples

A
  • continuous
    ex: weight, time, height, investment return
  • discrete (countable)
    ex: number of points or children
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 4 measurement scales?

Provide definitions and examples

A
  • nominal (categorical): observations just differ by name
  • ordinal (categorical): observations can be categorized or ranked (but differences are meaningless)
    ex: ratings
  • interval (numerical): observations can be categorized or ranked (differences are meaningful)
    ex: temperatures
  • ratio (numerical): observations are on interval-scale w/true zero point
    ex: grades, weight, time, distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Process of retrieving, cleansing, integrating, transforming, and enriching data to support subsequent data analysis

A

Data wrangling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the objectives of data wrangling? (3)

A
  • improve data quality
  • reduce time and effort required to perform analytics
  • help reveal true intelligence in the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What helps us to verify that the data set is complete or may have missing values

A

counting & sorting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What allows us to review the range of values for each variable?

A

sorting data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are 2 common strategies for dealing with missing values?

Provide definitions and when to use them

A
  • omission (complete-case analysis): exclude missing values
    ex: use when amount of missing values is small and expected to be randomly distributed across observations
  • imputation: replace missing values
    ex: may replace with mean; used when variable w/missing values is deemed important
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Process of converting data from one format or structure to another

Provide Examples

A

Data transformation
ex: convert dates into seasons; convert values into natural logarithms; combine height and weight to create BMI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Process of transforming numerical into categorical variables

What are the constraints?

A

binning

Bins must be consecutive and nonoverlapping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are 3 common approaches for transforming categorical data?

Explain/provide examples

A
  • category reduction: combining categories
    ex: Mon-Fri = Weekdays; “Other”
  • dummy variables: AKA indicator or binary variable that takes on value of 1 or 0 to describe two cateogires of a variable (n - 1)
  • category scores: ex: recode satisfaction survey to numbers
    ex: used when data are ordinal and have natural, ordered categories
24
Q

In addition to binning, another common approach is to create new variables through ____ transformations

25
What are the 3 common measures of central location?
- mean - median - mode
26
What is the measure of relative position? ## Footnote Explain how it works
percentile - Approx. p% of observations are less than the pth percentile - Approx (100-p)% of observations are greater than the pth percentile
27
If a variable has outliers, which measure of central location is preffered?
median is preferred over mean
28
What type of variable is mode useful for?
categorial variable
29
What are the 5 measures of dispersion? ## Footnote Define them
- Range: max - min - IRQ: Q3 - Q1 range of middle 50% of oservations - Mean absolute deviation absolute differences of all observations from mean - Variance: avg of squared differences from mean - Standard Deviation: square root of variance (lower value means obs closer to mean)
30
What are the measures of shape? (2) ## Footnote Define
- Skewness Coefficient: degree of distribution not symmetric about mean symmetric distribution = 0 - Kurtosis Coefficient: adnormal tails norm = 3; excess is KC - 3
31
What are the measures of association? (2) ## Footnote Define
Covariance: direction of linear relationship (senstitive to units of measure) Correlation Coefficient: dirent and strength of linear relationship Identifiers: 0 (no linear relation); 0.12 (weak); 0.8 (strong)
32
What does a box plot graphically display?
- min - Q1 - Q2 (median) - Q3 - max
33
How are the upper and lower fence calculated on a boxplot graph?
- lower fence: Q1 - (1.5 x IQR) - upper fence: Q3 + (1.5 x IQR) ## Footnote Anything greater or less is outlier
34
What does the Empirical Rule state?
- ~ 68% of all obs fall in between sample mean +/- sample SD - ~ 95% of all obs fall in between sample mean +/- 2Xsample SD - ~ 100% of all obs fall in between sample mean +/- 3Xsample SD
35
The population mean is referred to as a ____ and the sample mean is referred to as a _______.
1. parameter 2. statistic
36
What is the z-score used for? ## Footnote Provide example
- find distance of obs from mean in terms of SD ## Footnote z score of 2 -> obs is 2 SD above mean
37
What is standardizing? ## Footnote When is it commonly used?
converting obs into z-scores ## Footnote common when dealing w/ variabes measured using different scales
38
What methods are used to visualize a categorical variable? (2)
- frequency distribution - bar chart (graphical rep of frequency distribution)
39
What methods are used to visualize a numerical variable? (2)
- frequency distribution - histogram (helps see shape of distribution (skewness)
40
What methods are used to visualize two categorical variables? (2)
- contingency table (frequency for 2 categorical variables) - stacked column chart
41
What data visualization techniques can be used with multiple variables? (3) ## Footnote Explain
- bubble plot (3 numerical variables) - line chart (connects consecutive obs of numerical variable) (can track changes over time) - heat map (can identify combinations of categorical variables that have economic significance)
42
What method is used to visualize two numerical variables?
- scatter plot (shows linear relationship) (can also use for categorical variable)
43
Reminder: Tableau can extract data from many sources, including Excel
44
When the value of the repsonse variable is uniquely determined by predictor values ## Footnote Provide example
Deterministic Relationship ## Footnote ex: p = mv
45
When the value of the response variable is not uniquely determined due to other factors
stochastic relationship
46
A dummy variable can also be callled? (2)
- reference - benchmark
47
What is a measure that summarizes how well the sample regression equation fits the data?
Goodness-of-fit
48
Instead of se2,we generally report the standard deviation of the residual, denoted se, more commonly referred to as...?
the standard error of the estimate
49
What is the residual in linear regression?
difference btwn the observed and predicted values of variable
50
What are the Goodness-of-fit measures? (3) ## Footnote State ideal preferences
- Standard error of the estimate (Se) smaller Se is preffered - Coefficient of Determination (R2) never decreases as add more predictor variables to the model; closer to 1, better the fit - Adjusted Coefficient of Determination (adjusted R2) choose the model w/ the highest adjusted R2 value
51
We use analysis of variance (ANOVA) in the context of the linear regression model to derive R2.We denote the total variation in y as Σ(yi−y ̄)2, which is the numerator in the formula for the variance of y. What is this total variation called?
Total sum of squares
52
What is a good solution when confronted with multicollinearity?
- drop one of the collinear variables - obtain more data b/c the sample correlation may get weaker - sometimes, do nothing
53
The logistic regression model cannot be estimated with standard ordinary least squares (OLS) procedures. Instead, we rely on which method?
Maximum likelihood estimation (MLE)
54
In the holdout method we partition the data into two independent and mutually exclusive data sets. What are they called?
- training set - validation set
55
Often it is preferable to use the k-fold cross-validation method, where we partition the data into k subsets, and the one that is left out in each iteration is the ____ set.
validation
56
What are the other performance measures for logistic regression? ## Footnote Define them
- accuracy: making sure the #'s are accurate - sensitivity: proportion of target class cases that are classified correctly - specificity: proportion of nontarget class cases that are classified correctly