1.5: Point Estimates, Confidence intervals, and resampling Flashcards

1
Q

The two branches of statistical inference

A

hypothesis testing and estimation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

hypothesis testing

A

seeks to find if the value of a parameter equals some specific value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Estimation

A

seeks to find the value of the parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Estimators

A

the formulas used to calculate the sample statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Estimates

A

are the particular values derived from these estimators

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

An unbiased estimator

A

one whose expected value equals the parameter it is estimating

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

An efficient unbiased estimator

A

has the smallest sampling distribution variance for a given sample size

ex:

–> Estimator A is efficient because its estimates are tightly grouped around the true value of μ (smaller standard error).

–> Estimator B is inefficient because its estimates are more spread out from the true value of μ (larger standard error)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A consistent estimator

A

gets closer to the population parameter’s value as the sample size increases

As the sample size approaches infinity, the standard error will approach zero, and the distribution will fully concentrate over the true population value

ex:

–> Estimator A is consistent because its standard error significantly narrows down when sample size increases.

–> Estimator B is inconsistent. Increasing sample size barely improves the accuracy of the estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A point estimate is unlikely to exactly equal the population parameter due to sampling error

what should we use then?

A

An interval estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A 100(1−α)% confidence interval

A

is a range that has a 1−α probability of containing the parameter, where α is the significance level

ex: using a 5% significance level creates a 95% confidence interval around the sample mean. We can be 95% confident that the population mean falls somewhere in this interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A 100(1−α)% confidence interval is calculated by:

A

Point Estimate ± Reliability Factor × Standard Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The 100(1−α)% confidence interval for a population mean from a normally distributed population with known variance is:

what does this do?

A

X¯ ± z(of)(α/2) * (σ/√n)

This produces a confidence interval with upper and lower bounds with a total of α
probability that the population mean is outside the confidence interval

z(of)(α/2) is used because α/2 represents what percent would be in each tail.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When the population variance is unknown, as is often the case, it is appropriate to use the sample standard deviation as a substitute for the population standard deviation.

what is the formula?

A

X¯ ± z(of)(α/2) * (s/√n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

the t-distribution

A

used for confidence intervals when the population variance is unknown

This is valid even when the sample size is small

Since it is more conservative (i.e., the reliability factor is bigger), the confidence interval will be wider

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The confidence interval for the population mean can use the t-distribution when the variance is unknown provided the sample is large, or the population is approximately normally distributed.

what is the formula to do so?

A

X¯ ± t(of)(α/2) * (s/√n)

degrees of freedom: n - 1

we have to use the t table and see where the level of confidence intersects with the degrees of freedom on the table to

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

which do we use between z and t distributions for:

large sample size

Unknown population variance

A

t is better

z is acceptable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

which do we use between z and t distributions for:

large sample size

known population variance

A

z

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

which do we use between z and t distributions for:

small sample size

not a normal distribution

A

not available

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

which do we use between z and t distributions for:

small sample size

normal distribution

known population variance

A

z

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

which do we use between z and t distributions for:

small sample size

normal distribution

unknown population variance

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

A point estimate is most accurately described as:

A
an expected value.

B
an expected value and a standard error.

C
an expected value and a confidence interval.

A

A
an expected value.

22
Q

A sampling model that produces an expected value of 5.0% for the equity risk premium is most likely considered to be an unbiased estimator if:

A
the population mean equity risk premium is 5.0%.

B
the standard error of the sample mean decreases as the sample size increases.

C
the standard error of the sample mean couldn’t get any smaller without increasing the sample size.

A

A
the population mean equity risk premium is 5.0%.

23
Q

An analyst reports that the equity risk premium is estimated to be 3.0%, with a 95% probability of being between 2% and 4%. The reliability factor is most likely:

A
1%.

B
95%.

C
1.96.

A

C
1.96.

24
Q

The reliability factor (RF)

A

the thing that is equal the level of confidence divided by 2 on the z table

25
Q

Resampling

A

a process that allows analysts to repeatedly draw samples from the original data set

26
Q

when is resampling important?

A

important when the sample size is too small to accurately estimate the population parameter

27
Q

two techniques for resampling

A

bootstrap resampling

jackknife resampling.

28
Q

bootstrap resampling

A

usually requires computer simulation

Using this method, each sample drawn is being replaced with an identical element for the next draw, so the sample size stays the same after each draw.

The size of each resample is also same as the size of the original sample.

Boostrap is able to determine the standard error and confidence intervals for statistics such as the median.

In addition, it produces accurate estimates without relying on any analytical formula

ex:

an analyst may want to estimate the population mean using the mean of one set of sample

The analyst may construct the distribution of the sample mean by creating multiple resamples from this single sample set

These resamples will then form a distribution that can approximate the true sampling distribution

29
Q

the standard error of the sample mean formula when using bootstrap resampling

A

sX¯ = √(1/(B−1) * ∑(θb^ − θ¯)^2)

B: number of resamples drawn from the original sample

θb^: mean of the resample

θ¯: mean of all resample means

From this formula, the greater the number of resamples, the smaller the estimated standard error of the sample mean.

30
Q

jackknife resampling

A

draws samples by leaving out one observation at a time (without replacement)

commonly used to reduce the bias of an estimator

31
Q

main differences between bootstrap resampling and jackknife resampling:

A

Results for each run:

  • Bootstrap: Different because of random sampling
  • Jackknife: Similar due to its computation procedure

Number of repetitions:

  • Bootstrap: Flexible depending on circumstances
  • Jackknife: Same as the original sample size (e.g., 4 runs for a sample of 4)
32
Q

Data snooping (or data mining)

A

refers to overusing the same data

A model is built by searching diligently for any statistically significant patterns.

Researchers tend to focus on the small number of significant patterns they find and rarely publish their many statistically insignificant results.

As noted by economist Ronald Coase, “If you torture the data long enough, it will confess.”

33
Q

To identify data snooping bias, analysts may split the data into which three separate sets?

A

Training dataset

Validation dataset

Test dataset

34
Q

Training dataset

A

Used to model and fit parameters

35
Q

Validation dataset

A

Used to evaluate model fit and tune parameters

36
Q

Test dataset

A

Used to evaluate the final model fit

37
Q

with data snooping (data mining), where is a genuine relationship found

A

it should be found in the out-of-sample test

38
Q

with data snooping (data mining), when is a model successful?

A

a model is only successful if it works in the future

39
Q

intergenerational data mining

what is it and how come it is used?

what is a bias or a con with this?

A

using results from previous studies

many researchers use the same data sets

This often leads analysts to study the same anomalies and thus exaggerate the importance

40
Q

Sample selection bias

A

occurs if certain assets or time periods are excluded from the data

ex: survivorship bias

sometimes occurs when stock price and accounting data are used

–> For example, many studies have shown the stocks of companies with low price-to-book ratios tend to outperform in future periods. This could be because companies that fail are excluded from the studies

Delisting a company’s stock from an exchange can also cause bias because it is difficult to track subsequent performance.

–> Usually, delisting occurs because of poor performance

41
Q

survivorship bias

A

occurs if only funds still in existence are included in the study

This can even occur when studying international indices if economies that do not survive are excluded

42
Q

why does a Hedge fund performance has a significant self-selection bias?

A

because the hedge fund managers voluntarily share information

–> Only managers with positive results are inclined to include results in databases

43
Q

when are Investors are also influenced by implicit selection bias?

A

when there is a threshold that enables self-selection

–> example, the NYSE has higher stock listing requirements than other smaller exchanges

–> The NYSE-listed stock investors may implicitly believe their stocks are of higher quality than those in other exchanges, although the higher listing requirements do not translate into higher expected returns

44
Q

Backfill bias

A

another variation of selection bias

When a new fund is added to an index, its past performance may be backfilled into the index’s database

–> This can inflate the index return because new funds are normally added only after they have good performance

45
Q

Look-ahead bias

A

occurs if the information is used that would not have been available on the test data

For example:

accounting information such as book value will not be available for some time after the end of the period

–> It can arise implicitly if future data is inappropriately used without realizing it.

46
Q

to mitigate look-ahead bias, what can analysts use?

A

point-in-time (PIT) data

when they are available

47
Q

point-in-time (PIT) data

A

contain information that is available at the time of recording/publication

48
Q

a time-period bias

A

Longer time periods are generally preferred but may include data from different structural periods

49
Q

An analyst collects a sample of 12 monthly return datapoints that have been drawn from a larger population. Wanting to reduce the bias of the expected value based on this small sample size, the analyst decides to resample the data using the jackknife method. Which of the following statements regarding this resampling process is most accurate?

A
Each repetition will include 11 observations

B
The process will be completed in 11 repetitions

C
The sample for each of the 12 repetitions will be drawn with replacement

A

A
Each repetition will include 11 observations

50
Q

An analyst is conducting a market liquidity study. After studying a stratified sample of dividend-paying stocks, the analyst concludes that the economy is sufficiently liquid and that the stock market may be undervalued. The analyst’s conclusion is most likely affected by:

A
time-period bias.

B
data-mining bias.

C
sample selection bias.

A

C
sample selection bias.

51
Q

n analyst randomly samples 100 small-cap stocks and 100 large-cap stocks that have been part of a broad equity index for at least 10 years and concludes that small-cap stocks have outperformed large-cap stocks on a risk-adjusted basis over the past decade and considers whether this asset class can generate positive excess returns over the next 5 five years. The analyst’s conclusion is most likely affected by:

A
look-ahead bias.

B
time-period bias.

C
survivorship bias.

A

C
survivorship bias.