Introduction. Visualizing Data Flashcards

(34 cards)

1
Q

What is spurious correlation?

A

A spurious correlation occurs when two variables are correlated but don’t have a causal relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Omitted variable bias

A

It occurs when we do not include an independent variable in the model which has a causal effect on dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Simpson’s Paradox

A

It is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unit of analysis

A

The observation described by a set of data. For example, voters,
parties, bills, elections, voting decisions, legislative output. Very often our data have multiple levels of analysis (e.g., individuals, regions, countries), calling for different statistical techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Variables

A

Any characteristic related to the unit of analysis. A variable can take on different values for different observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Types of variables

A

e.g., nominal (e.g., political party), ordinal (e.g., school grades),
interval (e.g., GDP), ratio (e.g., duration)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data set

A

Set of variables for a given set of observations. Should come with a codebook

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Hypothesis

A

Statement about the nature of the social and political world, often
expressed as statements about relationships between variables (e.g., “The lower X,the higher Y”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cross-section data

A

Sample of voters, governments, countries, or other units, taken at a given point in time. Observations are typically assumed to be independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Time series data

A

Observations on units over time, e.g., number of conflicts in country X. Because past events can influence future events and lags in behavior are prevalent in social sciences, time is an
important dimension in such a data set. Observations are not independent across time (serial correlation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pooled time series cross-section data

A

Data consist of comparable time series data observed on
a variety of units. For instance, units are countries, and for each country we observe annual data on a variety of political and economic variables. Typically, we have few units, but long time series. Pooling the data increases the number of observations and makes it possible to control for exogenous shocks.
Observations are usually not independent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Panel data

A

A large number of the same cross-sectional units, e.g., survey respondents, are observed
repeatedly over a number of “waves” (interviews). With panel data, the time series is usually very short.
Common in studies of political behavior. For example, German Socio-Economic Panel (SOEP) or the GIP
(German Internet Panel) in Mannheim

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A histogram

A

It shows the distribution of the measurements of a variable, bar graph in which the height of the bar shows how many observations fall in particular subintervals (bins), plotted along the horizontal axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Density plot

A

Address the deficiencies
of histograms by averaging and smoothing, probability density function from the random variable X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Measures of Central Tendency

A

Mode, Median, Mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Mode

A

Most frequently occurring value
of X

17
Q

Median

A

Value of X that falls in the
middle position when the observations
are ordered from smallest to largest.
Median = 50th percentile = 2nd quartile

18
Q

Mean

A

x =∑ni=1xi/n

19
Q

When mean=mode=median

A

In a perfectly symmetric distribution, e.g., normal distribution

20
Q

In right-skewed(positive skew) distribution what is greater: median or mean?

21
Q

In left-skewed(negative skew) distribution what is greater: median or mean?

A

median > mean

22
Q

Who is sensitive to outliers: mean or median?

23
Q

Sample Variance: definition and formula

A

Average of the squared deviations from the mean
S^2=sum of all(xi-(x_hat)) / n-1

23
Q

Sample Variance: definition and formula

A

Average of the squared deviations from the mean
S^2=∑i=1^n(xi-(x_hat))^2 / n-1

24
Standard Deviation:definition and formula
Square-root of sample variance s=√s^2
25
Range: definition and formula
Difference between largest and smallest measurement: RANGE = xMax − xMin
26
Interquartile Range (IQR): definition and formula
Difference between upper and lower quartiles (range of the middle 50% of the distribution) QR = xQ3 − xQ1
27
Q1 in boxplot
25 percentile
28
Q3 in boxplot
75 percentile
29
Q2 in boxplot
Median or 50 percentile
30
Q0 in boxplot
0th percentile, lowest datapoint excluding outliers
31
Q4 in boxplot
100th percentile, highest datapoint excluding outliers
32
Lower Wisker
Q1-1.5(IQR)
33
Upper Wisker
Q3+1.5(IQR)