Part 5. Sampling & Estimation Flashcards

1
Q

Simple Random Sampling

A

A method of selecting a sample in such a way that each item or person in the population being studied has the same likelihood of being included in the sample.

e.g. picking random numbers out of a bag.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Systematic sampling

A

Another way to form an approximately random sample, by selecting every nth member from a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Sampling error

A

The difference between a sample statistic (the mean, variance, or standard deviation of the sample) and its corresponding population parameter (the true mean, variance or standard deviation of the population).

sampling error of the mean = sample mean (x-) - population mean (u)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sampling distribution

A

(Of a sample statistic)

A probability distribution of all possible sample statistics computed from a set of equal-size samples that were randomly drawn from the same population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sampling distribution of the mean

A

Suppose a random sample of 100 bonds is selected from the population of a major municipal bond index consisting of 1000 bonds, and then the mean return of 100-bond sample is calculated.

Repeating this process many times will result in many different estimates of the population mean return.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Stratified random sampling

A

Uses a classification system to separate the population into smaller groups based on one or more distinguishing characteristics.

From each subgroup (stratum), a random sample is taken and the results are pooled, the size of the samples from each subgroup (stratum) is based on its size relative to the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Stratified Sampling Example

A

Used often in bond indexing, due to the difficulty and cost of completely replicating the entire population of bonds.

The bonds in a population are categorised (stratified) according to major bond risk factors such as duration, maturity, coupon rate, and the like.

The samples are drawn from each separate category and combined to form a final sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Time series data

A

This consists of observations taken over a period of time at specific and equally spaced time intervals.

e.g. the set of monthly returns on Microsoft stock from January 1994 to January 2004.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cross-sectional data

A

A sample of observations taken at a single point in time.

e.g. the sample of reported earnings per share of all Nasdaq companies as of Dec 31, 2004.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Longitudinal data

A

Observations over time of multiple characteristics of the same entity, such as unemployment, inflation and GDP growth rates for a country over 10 years.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Panel data

A

This contains observations over time of the same characteristic for multiple entities, such as debt/equity ratios for 20 companies over the most recent 24 quarters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Central Limit Theorem

A

For simple random samples of size n from a population with mean (u) and finite variance (sigma^2).

The sampling distribution of the sample mean (x-) approaches a normal probability distribution with mean (u), and a variance equal to (sigma^2/n) as the sample size becomes large.

Useful as the normal distribution is relatively easy to apply to hypothesis testing, and construction of confidence intervals.

Inferences about population mean can be made from sample mean, regardless of the populations distribution, as long as sample size is “sufficiently large”, usually mean n>/30.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Important properties of central limit theorem:

A
  1. If the sample size n is sufficiently large (n>/30), the sampling distribution of the sample means will be approx. normal.
    - So random samples of size n are repeatedly being taken from overall larger population, with each random sample having its own mean itself being a random variable, and this set sample means has a distribution that is approx. normal.
  2. The mean of the population (u), and the mean of the distribution of all possible sample means are equal.
  3. The variance of the distribution of sample means is sigma^2/n, the population variance divided by sample size.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Standard deviation of the means of multiple samples:

A

This is less than the standard deviation of single observations.

If standard deviation of monthly stock returns is 2%, the standard error (deviation) of the average monthly return over the next six months is 2%/root6 = 0.82%.

The average of several observations of random variable will be less widely dispersed (lower standard dev) around the expected value than will a single observation of the random variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Desirable properties of an estimator:

A
  1. Unbiasedness
  2. Efficiency
  3. Consistency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Unbiasedness

A

The one for which the expected value of the estimator is equal to the parameter you are trying to estimate.

E(x-) = u, i.e. EV of sample mean = population mean

17
Q

Efficiency

A

If the variance of its sampling distribution is smaller than all the other unbiased estimators of the parameter you are trying to estimate.

i.e. sample mean, an unbiased and efficient estimator of population mean.

18
Q

Consistency

A

The one for which the accuracy of the parameter estimate increases, meaning the standard error of sample mean falls and sampling distribution bunches more closely around the population mean.

As the sample size approaches infinity, the standard error approaches zero.

19
Q

Point estimates

A

These are single (sample) values used to estimate population parameters.

Estimator = the formula used to compute the point estimate.

20
Q

Confidence intervals

A

A range of values in which the population parameter is expected to lie.

21
Q

Student’s t-distribution

A

A bell shaped probability distribution that is symmetrical about its mean.

Appropriate for:

  1. Constructing confidence intervals based on small samples (n<30) from populations with unknown variance and normal/approx normal distribution.
  2. When population variances is unknown, and sample size is large enough that the central limit theorem will assure the sampling distribution is approx. normal.
22
Q

Properties of students t-distribution:

A
  1. Its symmetrical.
  2. Defined by a single parameter, the degrees of freedom (df) equal to the number of sample obs minus 1, n-1, for sample means.
  3. Has more probability in tails (fatter tails) than normal distribution.
  4. As degrees of freedom (sample size) gets larger, the shape of t-distribution more closely approaches a standard normal distribution.
23
Q

Student t-distribution movement

A

As number of observations increase (df increase), the t-distribution becomes more spiked and tails become thinner.

Df increases without bound, the t-dist converges to the standard normal distribution (z-distribution).

The thickness of tails relative to those z-distribution is important in hypothesis testing as thicker tails mean more observations away from center of distribution (more outliers).

Hypothesis testing using t-distribution makes it more difficult to reject the null relative to hypothesis testing using z-distribution.

24
Q

Confidence interval

A

This estimates result in a range of values within which the actual value of a parameter will lie giventhe probability of 1 - alpha (a).

These are constructed by adding or subtracting an appropriate value from the point estimate.

25
Q

Alpha (a) vs 1- alpha (a)

A

Alpha (a) = the level of significance for the confidence interval

1-alpha (a) = the degree of confidence.

e.g. we might estimate that the population mean of random variables will range from 15 to 25 with 95% degree of confidence, or at 5% level of significance.

26
Q

Formula for confidence intervals:

A

point estimate +- (reliability factor x standard error)

where:

point estimate = value of a sample statistic of the population parameter.

reliability factor = number that depends on the sampling distribution of the point estimate and the probability that point estimate falls in confidence interval (1-a).

standard error = standard error of point estimate.

27
Q

Perspectives of confidence intervals:

A
  1. Probablistic interpretation

2. Practical interpretation

28
Q

Probabilistic interpretation

A

After repeatedly taking samples of CFA candidates, administering the practice exam, and constructing confidence intervals for each samples mean.

99% of the resulting confidence intervals will in the long run include the population mean.

29
Q

Practical interpretation

A

We are 99% confident that the population mean score is between 73.55 and 86.45 for candidates from this population.

30
Q

How to look up reliability factors in t-table:

A
  1. Compute df: n-1
  2. Find the appropriate level of alpha or significance, dependent on concerning one tail (a) or two tail (a/2).

CF intervals designed for two tailed, as they compute an upper and lower limit.

  1. To find t29,2.5, find 19 df row and match it with 0.025 column, resulting in t=2.045.
31
Q

What to do if distribution is non normal?

A
  1. If distribution is nonnormal, but population variance is known, the z-statistic can be used as long as sample size is large (n>/30), this is possible as the central limit theorem assures that the distribution of sample mean is approx. normal when sample is large.
  2. If distribution is nonnormal, but population variance is unknown, the t-statistic can be used as long as the sample size is large (n>/30), but also acceptable to use z-statistic although t-statistic is more conservative.

Overall:

  • sampling from non-normal distribution we cannot create a confidence interval if sample size is less than 30, so all else equal make sure you have a sample of at least 30, the larger the better.
32
Q

Limitations of ‘larger is better’, when selecting an appropriate sample size:

A
  1. Larger samples may contain observations from a different population distribution, if we include observations coming from a different population (with different population parameter), we may not improve and even reduce the precision of our population parameter estimates.
  2. The cost of using a larger sample must be weighed against the value of increase in precision from increase in sample size.
33
Q

Data mining

A

Occurs when analysts repeatedly use the same database to search for patterns or trading rules until one that works is discovered.

i.e. evidence that value stocks appear to outperform growth stocks, arguing as product of data mining, but data set for historical stock returns is limited.

34
Q

Data mining bias

A

The results where the statistical significance of the pattern is overestimated as the results were found through data mining.

35
Q

Warning signs of data mining:

A
  • evidence that many different variables were tested, most of which are unreported until significant ones were found.
  • the loack of any economic theory that is consistent with empirical results.

solution:

  • avoid data mining to test potential profitable trading rule on a data set different from one you used to develop rule. (i.e. use out-of-sample data)
36
Q

Sample selection bias

A
  • Occurs when some data is systematically excluded from analysis, due to lack of availability.
  • This practice renders observed sample to be nonrandom, and drawn conclusions from sample cant be applied to population, as the observed sample and portion of population that was not observed are different.
37
Q

Survivorship bias

A
  • Most common form of sample selection bias.
  • A good example in investments is the study of mutual fund performance, i.e. mutual fund database Morningstar’s only include funds currently in existence, not funds ceased to exist due to closure or merger.
  • Funds that are dropped from sample have lower returns than surviving funds, this surviving sample is biased toward better funds (i.e. not random)
  • these yield results that overestimate the average mutual fund return as database only includes better performing funds.
  • solution to this bias is to use sample of funds that all started at the same time and not drop funds that have been dropped from sample.
38
Q

Look-ahead bias

A
  • Occurs when study tests a relationship using sample data that was not available on the test date.
    e. g. consider the test of trading rule based on price-to-book ratio at the end of fiscal year, the stock prices are available for all companies at same point in time, while end-of-year book values may not be available until 30 to 60 daysafter fiscal year ends.
  • to account for bias, study uses price-to-book value ratios to test trading strategies may estimate book value as reported at fiscal year ebd and market vale 2 months later.
39
Q

Time-period bias

A
  • Result if the time period over which data is gathered either too short or too long.
  • if too short, research results may reflect phenomena specific to time period or even data mining.
  • if too long, the fundamental economic relationship that underlie the results may have changed.
    e. g. findings may indicate small stocks outperform large during 1980-85, using bias relating too short time period, unsure if just an isolated occurrence.
    alt: study quantifies relationship between inflation and unemployment during 1940-2000 results in time-period bias, as period too long, and covers fundamental change in both variables occurred in 1980s. – data should be divided into 2 subsamples that span period before and after change.