module 5 Flashcards
(35 cards)
Variables and observations:
In research studies, information is collected from specific subjects, such as consumers or firms.
These data provide insight into how these subjects score on various variables of interest.
For every subject, data is collected for each variable.
Hence, we observe the scores on the variables for each subject.
Data sets:
In a data set :
rows capture observations (on, e.g., consumers or firms).
In cross-sectional studies (studies at one point in time):
the number of rows equals the number of subjects (consumers, firms, …) in the data set.
Columns display variables. A variable can take on different values for different subjects.
Why do we convert textual variables into numerical code and how
Even though variables can take on textual values, manually entering textual data in a data set is often a nuisance.
It takes a lot of time
The probability of making mistakes increases
Therefore, researchers typically recode textual values: they replace text-based answers with a numerical code.
For example, in a survey, respondents may have to indicate whether they are male or female. Males may be coded as 1, whereas females may be coded as 0.
While other numbers might be used to replace the textual values (e.g., male = 1, female = 2; or male = 6, female = 9).
Whats a Dummy variables:
Variables that only take on the values 0 or 1 are called dummy variables.
- Researchers typically use 0 and 1 because this simplifies the interpretation of statistical analyses.
Ensuring your variables match your unit of analysis
The variables in your data set must match the study’s unit of analysis. Specifically:
The dependent variable is measured at the level of the unit of analysis. So are the mediator variables.
Independent and moderator variables are measured at the level of the unit of analysis or a more aggregate level.
What sampling does:
The rows in a dataset consist of subjects (firms, customers, products, …). But how do you determine which subjects will be part of your dataset? This is where sampling comes in.
(For example:
In March 2006, the American Medical Association reported disturbing rates of binge drinking and unprotected sex among college students during spring break. The report was based on what the researchers initially claimed was a survey of “a random sample” of 644 students.
The survey results were breathlessly reported on the Today Show, the CBS Early Show, and hundreds of reports followed on local television and radio newscasts. The findings also were reported in Time magazine and the New York Times.
One problem in the spring break study: the sample was not random but self-selected: the results were based on students who volunteered to answer the question as part of an online survey panel. In addition, only about a quarter of these students had ever even gone on a spring break trip. The Times eventually published a correction explaining the misrepresentation.
Statistics are certainly useful in finding answers to research questions, but they can do more harm than good if the sample is not correct. That is, if data are not collected from the right subjects, the research outcomes might be highly misleading, as the spring break example demonstrates.
The lesson here: sampling is critical! )
What is a Sample ?
A subset of the population of interest
- Have to be careful when taking a sample as a low representativeness can lead to the properties of the population are over or underrepresented in the sample
Leading to High sampling error
What is a population ?
Entire group of people, firms, events, or things of interest for which you would like to make inferences in.
Sample size does it matter
But, Sample size and representativeness are two related, but different issues. The size of a sample is not a guarantee of its ability to accurately represent a population. Large unrepresentative samples can perform as badly as small unrepresentative samples.
But, Indeed, a higher sample size can decrease the sampling error between your sample and the population of interest! This of course only holds if you use the appropriate sampling design.
Steps in the sampling process:
1) Define the population you are interested in.
2) Determine the sampling frame. The sampling frame is the physical representation of the population through which one can reach out to that population.
3) Decide on the sampling design.
Erros during sampling frame and the solutions
When defining the sampling frame errors can occur:
Under coverage : True population members are excluded
Miss coverage : Non population members are included
Solution :
If small recognize but ignore,
If large redefine the population in terms of the sampling frame
When Deciding on the sampling design what are the classifications
sampling designs can be broken down to 2 sub categories
1) Probability and 2) Sampling
the it can be further boken down as follows
1) Probability
i)simple random sampling
ii)systematic sampling
iii)stratified sampling
iv)Cluster sampling
2) Sampling
i) Connivence sampling
ii) Quota sampling
iii) Juddgemt sampling
iv) Snowball sampling
factors that guide the selection of a suitable statistical test/technique
- The number of independent variables in the conceptual model: one vs. multiple.
- The measurement levels of these variables: metric vs. categorical.
Statistical tests in the case of a conceptual model with
one independent variable
whats the most suitable statistical test
when
both the dependent variable and the (single!) independent variable in the conceptual model are metric (interval or ratio),
When
both the dependent variable and the (single!) independent variable in the conceptual model are metric (interval or ratio),
Pearson’s correlation coefficient
is the most suitable statistical test to uncover whether the two variables are related.
Statistical tests in the case of a conceptual model with
one independent variable
when
where the dependent variable and the independent variable are categorical (nominal or ordinal),
a chi-square test is the appropriate statistical test.
Statistical tests in the case of a conceptual model with
one independent variable
When
The dependent variable is metric, but the (single) independent variable is categorical,
either a t-test or a one-way analysis of variance (one-way ANOVA) is suitable.
The selection between these two hinges on the number of levels of the independent variable.
When
the independent variable has just two levels,
a t-test is the appropriate choice.
When the independent variable comprises three or more levels,
a one-way ANOVA is the appropriate test.
Statistical techniques in the case of a conceptual model with multiple independent variables
The dependent variable is metric,
but the independent variables are typically categorical,
ANOVA analyses is the way
(ANOVA: Typically categorical but can handle metric variables.
)
Experimental studies typically use (variations of) ANOVA.
Statistical techniques in the case of a conceptual model with multiple independent variables
The dependent variable is metric,
but the independent variables are typically metric,
linear regression analysis is the way
(Regression: Typically metric but can handle categorical variables.
)
Archival and survey studies typically use (variations of) regression analysis.
Statistical techniques in the case of a conceptual model with multiple independent variables
When the dependent variable is categorical
and,
the independent variable(s) can encompass metric and/or categorical variables.
the appropriate technique is a logit analysis. (logitstic regression )
i) Pearson’s correlation coefficient: a metric DV and IV
Pearson’s correlation coefficient measures the strength of the linear relationship between two metric (interval or ratio) variables.
The possible range of values for the correlation coefficient is -1.0 to 1.0. In other words, a correlation cannot exceed 1.0 or be less than -1.0. A correlation of -1.0 indicates a perfect negative correlation, while a correlation of 1.0 indicates a perfect positive correlation.
A positive relationship exists between two variables if the correlation coefficient is greater than zero.
A positive correlation means that when one variable increases, the other one also increases.
Conversely, if the correlation coefficient is less than zero, it is a negative relationship. A negative correlation means that when one variable increases, the other one decreases.
A correlation of zero indicates that there is no linear relationship between the two variables.
Note that a correlation of zero between two variables does not mean that there is no relationship between the two variables at all. It means that there is no linear relationship. The relationship could still be non-linear.
To calculate a correlation coefficient in SPSS, click Analyze –> Correlate –> Bivariate. In the pop-up window, specify the two variables to be used in the analysis by clicking the blue arrow button.
ii) Chi-square tests: a categorical DV and IV
A chi-square test tests whether there is a relationship between two categorical variables (nominal or ordinal).
Basically, it checks whether the frequencies occurring in the sample differ significantly from the frequencies one would expect. Thus, the observed frequencies are compared with the expected frequencies and their deviations are examined.
Suppose we want to examine whether there is a relationship between gender and the highest level of education.
The frequency distribution is displayed in the displayed contingency table.
A chi-square test, tests whether gender and education are related by comparing the numbers in this table with the numbers we would expect if the two variables were independent.
How to in spss:
In SPSS, click Analyze –> Descriptives –> Crosstabs. Crosstabs creates a contingency table with the distribution of the two variables, as in the example above. Enter one categorical variable as the row variable and the other categorical variables as the column variable. To run the chi-square test, check the Chi-square box in the Statistics window.
Note that you are only allowed to perform a 𝜒2 test if you have collected enough data.
But what constitutes enough? Ideally, you want all the cells in your cross table to have an expected value of 1 or higher. Additionally, the percentage of cells with an expected count lower than 5 should not be more than 20% of the total number of cells. You can find this information just under the table in the SPSS output (in the red boxes).
iii) T-test: a metric DV and a categorical IV with 2 levels
When the dependent variable is metric and the independent variable is categorical with two levels (just two, no more), a t-test is appropriate.
T test is used to establish wether the mean between the 2 groups are significant
To compare independent samples you need a independence samples T tests called Unpaired samples t test
Independent samples T test requirement is as follows:
has to have one nominal variable with 2 levels (eg: gender) and
has to have one metric variable (eg: salary) to calculate the mean
To test 2 dependent groups you need a Dependent samples T test eg : when we sample the same groups at 2 different points of time.
and you need 2 metric variables one per time period
When P is less than 0.05 we say there is a statistically significant difference between the 2 groups
A p-value less than 0.05 is typically considered to be statistically significant, in which case the null hypothesis should be rejected.
A p-value greater than 0.05 means that deviation from the null hypothesis is not statistically significant, and the null hypothesis is not rejected.
Null Hypothesis : no mean difference exist between the groups.
Alternative hypothesis:
1) Non-Directional: A mean difference exist between the groups
2) Directional : The mean of group 1 is larger than the mean of group 2 (or vv)
iv) One-way ANOVA: a metric DV and a categorical IV with 3 or more levels
When the dependent variable is always metric and the independent variable is categorical with three levels (or more), a one-way ANOVA is appropriate.
The goal of a one way Anova is to establish if the difference between the means of 3 or more groups is significant
One way Anova is the update of the t test but this time able to go beyond only 2 groups (levels) to more groups
Null Hypothesis : no mean difference exist between the groups.
Alternative hypothesis:
At least 2 group means are significantly different from each other
Hoc test is used to test which groups means differ
- In SPSS, a one-way analysis of variance with repeated measures can be run by clicking Analyze –> General Linear Model –> Repeated Measures.
v) ANOVA or regression analysis
When the dependent variable is metric, an ANOVA or a regression analysis can be used.
Regression analysis [The dependent variable is metric, but the independent variables are typically metric,] focuses on how changes in continuous independent variables affect the dependent variable (but can also deal with categorical IVs),
while ANOVA [The dependent variable is metric, but the independent variables are typically categorical,] focuses on uncovering group differences (but can also deal with continuous IVs).