Think Stats - Allen Downey Flashcards

1
Q

Why would a survey be ‘oversampled’

A

so that under represented groups in the population is large enough to draw statistical inferences

2
Q

so that under represented groups in the population is large enough to draw statistical inferences, what process would be used

A

Oversampling

3
Q

In a survey, what documents the design of the study, the survey questions, and the encoding of the responses.?

A

The codebook

4
Q

What is the codebook for?

A

It documents the design of the study, the survey questions, and the encoding of the responses.

5
Q

Where might a codebook for public source data be held?

A

It might be in github

6
Q

What is a DataFrame?

A

The fundamental data structure provided by pandas, containing a row for each record, and a column for each variable

7
Q

What is a way to access a column from a dataframe?

A

By creating a series, which is like a python list but with indices

MySeries = df2[[“columnName”, “columnName2”, “columnName3”]]

8
Q

What is a recode?

A

Example: ‘processDuration’ could be a recode calculated from processFinish - procesStart

9
Q

In Pandas, how do you add a new column to a DataFrame

A

Simply name the new column, and what it is to be populated with:

NOT dot notation, like this:

`df.totalwgt_lb = df.birthwgt_lb + df.birthwgt_oz / 16.0`
10
Q

Are histograms good for comparing two distributions against each other?

A

For example, if there are fewer data points in one distribution than the other then some of the apparent differences in the histograms will be due to sample sizes.

11
Q

In statistics, What is a parameter?

A

the parameter tells us something about the whole population.

12
Q

What does ddof stand for?

A

Delta degrees of freedom

13
Q

In statistics what is estimation?

A

inferring a parameters

of a distribution from a sample statistic.

14
Q

In statistics, what is an ‘estimator’

A

A statistic, used to estimate a parameter.

15
Q

what is anecdotal evidence?

A

Based on data that is unpublished and usually personal.

16
Q

5 steps to approach a problem using statistics?

A
```1 - Data Collection
2 - Descriptive Statistics
3 - Exploratory Analysis
4 - Estimation
5 - Hypothesis Testing```
17
Q

What is anecdotal evidence?

A

Evidence, often personal, that is collected casually rather than by a well-designed study.

18
Q

What is the population?

A

“Population” often refers to a group of people, but the term is used for other subjects, too.

19
Q

What is a cross-sectional study?

A

A study that collects data about a population at a particular point in time.

20
Q

in a study, what is a cycle?

A

In a repeated cross-sectional study, each repetition of the study is called a cycle.

21
Q

what is a longitudinal study?

A

A study that follows a population over time, collecting data from the same group repeatedly.

22
Q

In a statistical study what is a record?

A

In a dataset, a collection of information about a single person or other subject.

23
Q

In a statistical study what is a respondent?

A

A person who responds to a survey.

24
Q

In a statistical study what is a sample?

A

The subset of a population used to collect data.

25
Q

In a statistical study what does ‘representative’ mean?

A

A sample is representative if every member of the population has the same chance of being in the sample.

26
Q

In a statistical study what is oversampling?

A

The technique of increasing the representation of a subpopulation in order to avoid errors due to small sample sizes.

27
Q

In a statistical study what is raw data?

A

Values collected and recorded with little or no checking, calculation or interpretation.

28
Q

What is data cleaning?

A

Processes that include:

1) validating data
2) identifying errors
3) translating between data types and representations, etc.

29
Q

What is a distribution?

A

The values that appear in a sample and the frequency of each.

30
Q

What is a histogram

A

A mapping from values to frequencies, or a graph that shows this mapping.

31
Q

in statistics what is frequency?

A

The number of times a value appears in a sample.

32
Q

What is the mode?

A

The most frequent value in a sample, or one of the most frequent values.

33
Q

What is the normal distribution?

A

An idealization of a bell-shaped distribution; also known as a Gaussian distribution.

34
Q

What is a uniform distribution?

A

A distribution in which all values have the same frequency.

35
Q

In statistics what is a tail?

A

The part of a distribution at the high and low extremes.

36
Q

What is central tendency?

A

A characteristic of a sample or population; intuitively, it is an average or typical value.

37
Q

What is an outlier?

A

A value far from the central tendency.

38
Q

A

A measure of how spread out the values in a distribution are.

39
Q

What is a ‘summary statistic’?

A

A statistic that quantifies some aspect of a distribution, like central tendency or spread.

40
Q

In statistics what is ‘variance’?

A

A summary statistic often used to quantify spread.

41
Q

What is ‘Standard Deviation’?

A

The square root of variance, also used as a measure of spread.

42
Q

In statistics what is an ‘effect size’?

A

A summary statistic intended to quantify the size of an effect like a difference between groups.

43
Q

In statistics what does ‘clinically significant’ mean?

A

A result, like a difference between groups, that is relevant in practice.

44
Q

What is a Probability mass function (PMF)?

A

A representation of a distribution as a function that maps from values to probabilities.

45
Q

In Probability mass function (PMFs), what is probability?

A

A frequency expressed as a fraction of the sample size.

46
Q

In statistics what is ‘normalisation’?

A

The process of dividing a frequency by a sample size to get a probability.

47
Q

In Pandas, what is an ‘index’?

A

In a pandas DataFrame, the index is a special column that contains the row labels.

48
Q

In statistics what is a ‘percentile rank’?

A

The percentage of values in a distribution that are less than or equal to a given value.

49
Q

In statistics what is a ‘percentile’?

A

The value associated with a given percentile rank.

50
Q

In statistics, what is a ‘cumulative distribution function’ (CDF)?

A

CDF(x) is the fraction of the sample less than or equal to x.

51
Q

In statistics, what is an inverse CDF (inverse cumulative distribution function)?

A

A function that maps from a cumulative probability, p, to the corresponding value.

52
Q

In percentiles, what is the ‘median’?

A

The 50th percentile, often used as a measure of central tendency.

53
Q

What is the ‘interquartile range’?

A

The difference between the 75th and 25th percentiles, used as a measure of spread.

54
Q

In statistics, what is a ‘quantile’?

A

For example, the quartiles of a distribution are the 25th, 50th and 75th percentiles.

55
Q

In statistical sampling what is a ‘replacement’?

A

“without replacement” means that once a value is chosen, it is removed from the population.

56
Q

What is ‘empirical distribution’ ?

A

The distribution of values in a sample.

57
Q

What is an ‘analytical distribution’ ?

A

A distribution whose CDF (cumulative distribution function) is an analytic function.

58
Q

When considering statistical distributions, what is a ‘model’?

A

A useful simplification. Analytic distributions are often good models of more complex empirical distributions.

59
Q

What is the ‘interarrival time’?

A

The elapsed time between two events.

60
Q

What is the ‘complementary CDF (cumulative distribution function)?

A

A function that maps from a value, x, to the fraction of values that exceed x, which is 1-CDF(x).

61
Q

What is the ‘Standard Normal Distribution’?

A

The normal distribution with mean 0 and standard deviation 1.

62
Q

What is a normal probability plot?

A

A plot of the values in a sample versus random values from a standard normal distribution.

63
Q

In statistical hypothesis testing, whst is the p-value?

A

The probability of obtaining test results at least as extreme as the results observed, on the assumption that the null hypothesis is correct
————

the p-value (or probability value)