Chapter 1 Flashcards
(35 cards)
What is “big data”?
explosion in secondary data typified by increases in the volume, variety, and velocity of the data being made available from a myriad set of sources
What is “bivariate partial correlation”?
simple (two-variable) correlation between 2 sets of residuals (unexplained variance) that remain after the association of other independent variables is removed
What is “bootstrapping”?
approach to validating a multivariate model by drawing a large number of subsamples and estimating models for each subsample
● Doesn;t rely on statistical assumptions about the population to assess statistical significance, instead makes assessment based solely on the sample data
What is “causal inference”?
methods that move beyond statistics inference to the stronger statement of “cause and effect” in non-experimental situations
What is “cross validation”?
original sample is divided into a number of smaller-subsamples (validation samples), the validation fit is the “average” fit across all sub-samples
What are “data mining models”?
based on algorithms that are widely iused in big data applications
● Emphasis on predictive accuracy rather than statistical inference and explanation as seen in satisical/data models such as multiple regression
What is “dependence technique”?
classification of statistical techniques distinguished by having a variable or set of variables identified as the dependent variable(s) and the remaining variables as independent
● Objective = prediction of the DV(s) by IV(s)
● Depedent variable → presiumed effect of, or response to, a change in the IV(s)
● Independent variable → presumed cause of any change in the DV
What is “dimensional reduction”?
reduction of multicollinearity among variables by forming composite measures of multicollinear variable through such methods as exploratory factor analysis
What is “directed acyclic graph (DAG)”?
Graphical portrayal of causal relationships used in causal inference analysis to identify all “threats” to causal inference. Similar in some ways to path diagrams used in structural equation modeling.
What is a “dummy variable”?
non metrically measured variable transformed into a metric variable
○ Assigning a 1 or 0 to a subject
○ Always have one dummy variable less than the number of levels for the nonmetric variable
■ The omitted category is the reference category
Effect size
estimate of the degree to which the phenomenon being studied (e.g. correlation or difference in means) exists in the population
Estimation sample
portion of original sample used for model estimation in conjunction with validation sample
Validation sample
potion of the sample “held out” from estimation and then used for an independent assessment of model fit on data that wasn’t used in estimation (holdout sample)
General linear model (GLM)
Fundamental linear dependence model which can be used to estimate many model types (e.g., multiple regression, ANONA/MANOVA, discriminant analysis) with the assumption of a normally distributed dependent measure.
Generalized linera model (GLZ or GLIM)
similar in form to GLM, but able to accommodate non-normal depedent measures such as binary variables
● Logistic regression model
● Uses maximum likelihood estimation rather than ordinary least squares
Indicator
single variable used in conjunction with one or more others variables to form a
● Composite measure → combination of two or more indicators
Measurement error
inaccuracies of measuring the “true” variable values due to the fallibility of the measurement instrument, data entry errors, or respondent errors
Metric data
Also called quantitative data, interval data, or ratio data, these measurements identify or describe subjects (or objects)
not only on the possession of an attribute but also by the amount or degree to which the subject may be characterized by the
attribute. For example, a person’s age and weight are metric data.
● = Quantitative data, interval data, or ratio data
Non-metric Data
Also called qualitative data, these are attributes, characteristics, or categorical properties that identify or describe a subject or object. They differ from metric data by indicating the presence of an attribute, but not the amount.
Examples are occupation (physician, attorney, professor) or buyer status (buyer, non-buyer). Also called nominal data or
ordinal data.
● Difference from metric → these indicate the presence of an attribute, but not the amount
Multicollinearity
Extent to which a variable can be explained by the other variables in the analysis.
- As multicollinearity increases, it complicates the interpretation of the variate because it is more difficult to ascertain the effect of any single variable, owing to their interrelationships.
Mutivariate analysis
Analysis of multiple variables in a single relationship or set of relationships.
Multivariate measurement
the use of two or more variables as indicators of a single composite measure
- For example, a personality
test may provide the answers to a series of individual questions (indicators), which are then combined to form a single score
(summated scale) representing the personality trait.
Overfitting
estimation of model parameters that over-represent the characteristics of the sample at the expense of generalizability to the population
Practical significance
assessing multivariate analysis results based on the substantive findings rather than their statistical significance
● E.g. assesses whether the result is useful in achieving research objectives vs just finding whether the result is attributable to chance