Week 1 Flashcards

(48 cards)

1
Q
A

a)
A parameter describes a particular characteristic of the entire population of UK
adults. A statistic, on the other hand, describes a particular characteristic of a sample
from the population. Because 54% is based only on a sample from the UK population, it
i s a statistic.

b)It is very unlikely that the new sample will contain exactly the same number of Remainers as the old sample. However, the sample size is reasonably large and we should not expect the difference between the two sample proportions to be large.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
A

a) Categorical, nominal. The responses can only be grouped into categories. The
responses are nominal because it is not possible to rank the responses.

b) Categorical, ordinal. The responses can be grouped into categories. The
measurements are ordinal because the ratings of customer service can be ranked.

c)Numerical, continuous. The outcomes correspond to actual numbers where
differences between any two values are quantitatively meaningful. The measurement is continuous because time can take any value within a given interval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

c) Indicate whether each variable in the study is numerical or categorical. If numerical,
identify it as continuous or discrete. If it is categorical, give the level of measurement.

A

a Each row of the data matrix represents a participant in the survey.

b) There were 1,691 participants in the survey.

c) ex: Categorical, nominal. The responses can only be grouped into categories. The responses
are nominal because it is not possible to rank the responses.
Age: Numerical, continuous. The outcomes correspond to actual numbers and the difference
between any two ages is quantitatively meaningful. Age is continuous because it can take
any value within a given range of possible ages. In this survey, however, ages are recorded
as whole numbers and are reported as discrete variables. Even though age is reported as a
discrete variable, the units are small enough that we would treat this as a continuous
variable.
grossIncome: Categorical, ordinal. The concept of income is continuous, but in this survey it
is reported as a categorical variable. It is ordinal because the different income categories
can be ranked.
Smoke: Categorical, nominal. The responses can only be grouped into categories. The
responses are nominal because it is not possible to rank the responses.
amtWeekends: Numerical, discrete. The outcomes correspond to actual numbers. The
responses are discrete because the number of cigarettes smoked can only be a whole number.
amtWeekdays: Numerical, discrete. The outcomes correspond to actual numbers. The
responses are discrete because the number of cigarettes smoked can only be a whole number.
Similar to Age, we might treat the last two variables as continuous in practice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
A

The two variables are positively associated. Countries in which a higher
percentage of the pop ulation have access to the i nternet also tend to have higher life
expectancies.

No, that is not a reasonable conclusion. Omitted third variables, such as level
of economic development, likely drive both internet use and life expectancy .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
A

a) the distribution is right skewed with potential outliers on the positive end. We
should expect that these positive outliers will pull the mean above the median.

b) The distribution is somewhat symmetric and has few, if any, extreme
observations, therefore we should expect that the mean and median will be similar.

c) Most Edinburgh undergraduate students are around 20 years old, and the
majority are European and will have spent almost all of their 240 months in Europe. Very
few students will have spent more than 300 or so months in Europe, but some of the
overseas students (partic ularly freshers) will have lived 0 months in Europe (at least
initially). The minority of observations clustered at the extreme lower end of the
distribution should pull the mean down below the median.

d) The distribution would be right skewed. Most employees would make
something on the order of the median salary, but we would anticipate that upper
management makes much more. The distribution would have a long right tail, and we
should expect the mean to be greater than the median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
A

Since about 35% of the data is in the first bar this is where Q1 is, between 0 and 10.
The median is in the second bar, between 10 and 20. Q3 appears to be the fifth bar, between
40 and 50.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
A

Note that when est imating the population
variance and standard deviation from a sample , we use 𝑛 βˆ’ 1 rather than 𝑛 in the
denominator.

mean = 2%

standard dev = 0.79

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
A

a) Distribution (2) has a higher mean since 20 > 13 , and a higher standard
deviation since 20 is much further from the rest of the data than 13.

b) Distribution (1) has a higher mean since -20 > -40, and distribution (2) has a
higher standard deviation since -40 is farther away from the rest of the data than -20.

c)istribution (2) has a higher mean since all values in this distribution are
higher than those in distribution (1), but both distributions have the same standard
deviation since they are equally variable around their respective means.

d) Both distributions have the same mean since they are both centred at 300 , but
distribution (2) has a higher standard deviation since the observations are farther from
the mean than in distribution (1).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
A

Because each stock price has a different mean, it is misleading to simply compare
standard deviations. In order to meaningfully compare the relative volatility of stock prices,
we should calculate the coefficient of variation, which is calculated as the ratio of the standard
deviation to the mean multiplied by 100. For Ford, Honda, and Toyota, the coeffi cients of
variation are 49.05, 26.80, and 29.68, respectively . Therefore, Ford stock prices are the most
volatile.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The mean height is 61.52 inches with a standrd deviation of 4.58 inches. Use this
information to determine if the heights approximately follow the β€œ68-95-almost all” empirical
rule.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

what is systemic sampling?

A

assumes the population list has no connection to the subject being studied. Every jth item in a population is selected
therefore there is no risk of bias if there is a hidden pattern or link between the population and study topic

25
what is a parameter?
numerical measure that describes a specific characteristic of a population
26
what is a statistic?
numerical measure that describes a specific characteristic of a sample
27
what is a sampling error?
occurs because only a subset of the population is sampled, not the entire
28
what is a non sampling error?
unrelated to the sampling method - can occur even in a full census eg. incorrect population sampled, inaccurate responses
29
what are the statistical thinking steps?
1. define the problem 2. choose a sampling method 3. collect data 4. analyse data using statistics to make deicisions about the population
30
what is a categorical variable? nominal or ordinal?
repsonses that belong to groups or categories - nominal = theres no inherent order eg. gender, country - ordinal = ordered categories but differences between ranks arent measurable
31
what is a numerical variable? discrete or continuous?
represents numerical data with measurable values - discrete - countable values eg. no students (think of steps) - continuous - measured over an interal and can take only value within a range eg. height, weight
32
what is qualitative data?
no measurable meaning to the difference in numbers includes nominal and ordinal measures
33
what is quantitative data?
measurable meaning eg. test scores includes interval and ratio levels of measurement
34
what is interval data?
ranked, measurable differences between values theres no true zero - ratios are meaningless - eg. 10 degrees and 30 degrees celsius difference by 20 degrees but its not 3x hotter
35
what is ratio data?
true zero exists ratio between values are meaningful eg. 100 lbs in half of 200 lbs.
36
what graphs are used to describe categorical variables?
- frequency distribution - bar charts - cross tables - pie charts - pareto diagrams
37
what graphs are used to describe time series data?
- time series data = measurements collected over successive time intervals. The sequence of observations matter eg. annual GDP, monthly product sales - line chart
38
What do percentiles do?
indicates the relative position of a value within a data set
39
What are the measures of variability?
range - difference between the largest and smallest observations - the greater the spread of the data the larger the range will be - interquartile range = Q3-Q1
40
what does a strong correlation NOT imply?
that one variable causes another
41
what are measurements of relationships between variables?
covariance correlation coefficient
42
what is Z score?
measures the number of standard deviations a specific value (xi) is from the mean
43
what is the population variance? how does this change for sample?
the sum of squared differences between each observation and the population mean divided by population size - divide by n-1 to make the sample variance an unbiased estimator of population variance
44
what is standard deviation?
measures the average spread (distance) between each data point and the mean - it brings back data to the orginal unit of measurement - its the square root of variance
45
what is the empirical rule?
provides an estimate of the approx % of observations that are contained within 1,2 or 3 standard deviations of the mean - 68% observations are in the interval mean + 1 st.dev - 95% between mean + 2 st.dev - almost all are in the interval mean + st.dev
46
what is the covariance?
a numerical measure of the direction of a linear relationship between 2 variables positive = direct/increasing relationship negative = decreasing linear relationship
47
what is the correlation coefficent (pearsons r)?
standardised measure of the linear relationship between 2 variables. It provides both the direction and the strenght of relationship if a relationship exists absolute value of r is >= 2/ root N
48
what is Chebyshev's Theorem?