Statistik Flashcards

(18 cards)

1
Q

If a histogram is right (or positive) skewed, which is bigger, the mean or the median? Explain why!

A

A right skewed histogram has the tail towards the right, i.e. towards higher values. The mean is pulled in the same direction as the tail of the histogram so the mean is larger than the median in this case.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

For adult Swedish males age 50 the average weight is 84 kg and the standard deviation is 12 kg. How should the standard deviation be interpreted in this setting?

A

On average the weight of a man deviates from 84 kg with 12 kg. This means that some weigh 15 kg more some 5 kg less and so on but on average the deviation is 12 kg.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain how the coefficients a and b in simple linear regression are estimated using the method of least squares. There is no need for any formulas, just explain the basics of the method of least squares.

A

We need to define a line that is as close as possible to the data. The line is determined by the two coefficients a and b. We choose the value of a and b such that the sum of the squared vertical deviations from the line is as small as possible. Such a and b always exist and are unique.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The correlation between “Exam score” and “hours spent revising” is 0.82, what does this mean? What would a correlation of 0 mean?

A

This means that there seems to be some linear relationship between the two variables. Since the correlation is positive it means that when one of the variables is big the other tends to be big as well.
If the correlation was 0 this would mean that there is no linear relationship between the variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain how the residuals for a regression model is calculated and give at least one possible source of variation in the residuals.

A

The residuals are calculated by subtracting the estimated value of the observation from the actual observation using the linear model.
Two possible sources of variation is:
Random measurement
errors and model error, i.e. using a linear model when for instance the true relationship is quadratic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The difference between the largest and the smallest data values is called the:

A

Range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The median of a sample will always equal the:

A

50𝑡ℎ percentile

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The correlation coefficient between the price of broccoli and the amount of rainfall during the growing season is calculated to be −0.878. This indicates that:

A

prices tend to be low when rainfall is high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
The following is the descriptive statistics printout for a set of data, 
Mean - 473.4615
Median - 451
Mode - 425
Standard Deviation - 210.7663
Minimum - 264
Maximum - 1049
Sum - 6155
Count - 13

Which of the following is true for this data set?

A

a. the distribution is right-skewed
b. the best measure of central tendency is the median
c. both (a) and (b) are correct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The coefficient of determination is the:

A

ratio of the explained variation to the total deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

If two variables are highly correlated, then:

A

they always vary together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The standard error of the regression measures the:

A

variability of the dependent variable relative to the regression line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A multiple regression model has the form: Y = 2 + 3X1 + 4X2. When X1 increases by 1 unit (holding X2 constant), Y will:

A

increase by 3 units

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

If the correlation coefficient r = 0.5 then the coefficient of determination is:

A

0.25

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which is a better measure of central tendency and dispersion when the data contains outliers?

A

median and inter quartile range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the difference and give examples of the four different data levels (also known as the levels of measurement).

A

Nominal data, the lowest level of data, is data that can only be classified into different categories. The categories have no natural ordering.
Examples: Gender, Eye colour etc.

Ordinal data, the second lowest level of data, is data that can classified into categories but the
categories have a natural ordering.
Examples: Grades, Attitude towards statistics etc.

Interval data, the second highest level of data, is data that is numerical but that lacks an absolute
zero. This means that we measure on a correct, equidistant scale but zero is not absolute. An
absolute zero means that when the measure is zero, there is nothing. Interval data is not very
common. Ex: Temperature measured in Fahrenheit or Celsius (however not Kelvin).

Ratio data, the highest level of data, is numerical data that has an absolute zero. Here the zero is
absolute since, for instance, a length of zero cm means there is no length. Ex: Height in cm, age, kg
etc.

17
Q

Explain under coverage and over coverage and explain which of these that is the bigger problem in a study.

A

Over coverage: Sampling frame contains elements that are not part of the population. Ex.: A
survey among unemployed. People unemployed = population. You receive a list from the
employment center of unemployed people registered there. The list = sampling frame. If there are
people on the list that are no longer unemployed your sampling frame suffers from over coverage.
Under coverage: Sampling frame lacks element of the population. Ex.: Examine the population of
homeless livening in Gothenburg. You gather a list of the homeless by contacting social service,
Salvation Army, etc. Your list (sampling frame) will probably not include all homeless in Gothenburg
and hence suffers from under coverage.
Under coverage. If you have over coverage it is only to exam the elements/persons and see if they
are in the population or not. If they are not you just choose another element/person. In the case of
under coverage there are some elements/persons that you can never reach which might influence
the study greatly.

18
Q

Explain and draw scatter plots when the correlation coefficient r = 0, r is positive and r is negative.

A

The correlation coefficient decides if a linear relationship between two variables is strong or not.

Positive values of r indicate positive correlation between X and Y, negative values indicate negative
correlation, r = 0 implies X and Y are uncorrelated.

Larger positive values of r indicate stronger positive correlation. r = 1 indicates perfect positive
correlation.

More negative values of r indicate stronger negative correlation. r = -1 indicates perfect negative
correlation.