Think Stats - Allen Downey Flashcards

1
Q

Why would a survey be ‘oversampled’

A

so that under represented groups in the population is large enough to draw statistical inferences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

so that under represented groups in the population is large enough to draw statistical inferences, what process would be used

A

Oversampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In a survey, what documents the design of the study, the survey questions, and the encoding of the responses.?

A

The codebook

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the codebook for?

A

It documents the design of the study, the survey questions, and the encoding of the responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Where might a codebook for public source data be held?

A

It might be in github

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a DataFrame?

A

The fundamental data structure provided by pandas, containing a row for each record, and a column for each variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a way to access a column from a dataframe?

A

By creating a series, which is like a python list but with indices

MySeries = df2[[“columnName”, “columnName2”, “columnName3”]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a recode?

A

Example: ‘processDuration’ could be a recode calculated from processFinish - procesStart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In Pandas, how do you add a new column to a DataFrame

A

Simply name the new column, and what it is to be populated with:

NOT dot notation, like this:

df.totalwgt_lb = df.birthwgt_lb + df.birthwgt_oz / 16.0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Are histograms good for comparing two distributions against each other?

A

For example, if there are fewer data points in one distribution than the other then some of the apparent differences in the histograms will be due to sample sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In statistics, What is a parameter?

A

the parameter tells us something about the whole population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does ddof stand for?

A

Delta degrees of freedom

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In statistics what is estimation?

A

inferring a parameters

of a distribution from a sample statistic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In statistics, what is an ‘estimator’

A

A statistic, used to estimate a parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is anecdotal evidence?

A

Based on data that is unpublished and usually personal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

5 steps to approach a problem using statistics?

A
1 - Data Collection
2 - Descriptive Statistics
3 - Exploratory Analysis
4 - Estimation 
5 - Hypothesis Testing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is anecdotal evidence?

A

Evidence, often personal, that is collected casually rather than by a well-designed study.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the population?

A

“Population” often refers to a group of people, but the term is used for other subjects, too.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a cross-sectional study?

A

A study that collects data about a population at a particular point in time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

in a study, what is a cycle?

A

In a repeated cross-sectional study, each repetition of the study is called a cycle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is a longitudinal study?

A

A study that follows a population over time, collecting data from the same group repeatedly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

In a statistical study what is a record?

A

In a dataset, a collection of information about a single person or other subject.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

In a statistical study what is a respondent?

A

A person who responds to a survey.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

In a statistical study what is a sample?

A

The subset of a population used to collect data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

In a statistical study what does ‘representative’ mean?

A

A sample is representative if every member of the population has the same chance of being in the sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

In a statistical study what is oversampling?

A

The technique of increasing the representation of a subpopulation in order to avoid errors due to small sample sizes.

27
Q

In a statistical study what is raw data?

A

Values collected and recorded with little or no checking, calculation or interpretation.

28
Q

What is data cleaning?

A

Processes that include:

1) validating data
2) identifying errors
3) translating between data types and representations, etc.

29
Q

What is a distribution?

A

The values that appear in a sample and the frequency of each.

30
Q

What is a histogram

A

A mapping from values to frequencies, or a graph that shows this mapping.

31
Q

in statistics what is frequency?

A

The number of times a value appears in a sample.

32
Q

What is the mode?

A

The most frequent value in a sample, or one of the most frequent values.

33
Q

What is the normal distribution?

A

An idealization of a bell-shaped distribution; also known as a Gaussian distribution.

34
Q

What is a uniform distribution?

A

A distribution in which all values have the same frequency.

35
Q

In statistics what is a tail?

A

The part of a distribution at the high and low extremes.

36
Q

What is central tendency?

A

A characteristic of a sample or population; intuitively, it is an average or typical value.

37
Q

What is an outlier?

A

A value far from the central tendency.

38
Q

In statistics what is ‘spread’?

A

A measure of how spread out the values in a distribution are.

39
Q

What is a ‘summary statistic’?

A

A statistic that quantifies some aspect of a distribution, like central tendency or spread.

40
Q

In statistics what is ‘variance’?

A

A summary statistic often used to quantify spread.

41
Q

What is ‘Standard Deviation’?

A

The square root of variance, also used as a measure of spread.

42
Q

In statistics what is an ‘effect size’?

A

A summary statistic intended to quantify the size of an effect like a difference between groups.

43
Q

In statistics what does ‘clinically significant’ mean?

A

A result, like a difference between groups, that is relevant in practice.

44
Q

What is a Probability mass function (PMF)?

A

A representation of a distribution as a function that maps from values to probabilities.

45
Q

In Probability mass function (PMFs), what is probability?

A

A frequency expressed as a fraction of the sample size.

46
Q

In statistics what is ‘normalisation’?

A

The process of dividing a frequency by a sample size to get a probability.

47
Q

In Pandas, what is an ‘index’?

A

In a pandas DataFrame, the index is a special column that contains the row labels.

48
Q

In statistics what is a ‘percentile rank’?

A

The percentage of values in a distribution that are less than or equal to a given value.

49
Q

In statistics what is a ‘percentile’?

A

The value associated with a given percentile rank.

50
Q

In statistics, what is a ‘cumulative distribution function’ (CDF)?

A

CDF(x) is the fraction of the sample less than or equal to x.

51
Q

In statistics, what is an inverse CDF (inverse cumulative distribution function)?

A

A function that maps from a cumulative probability, p, to the corresponding value.

52
Q

In percentiles, what is the ‘median’?

A

The 50th percentile, often used as a measure of central tendency.

53
Q

What is the ‘interquartile range’?

A

The difference between the 75th and 25th percentiles, used as a measure of spread.

54
Q

In statistics, what is a ‘quantile’?

A

For example, the quartiles of a distribution are the 25th, 50th and 75th percentiles.

55
Q

In statistical sampling what is a ‘replacement’?

A

“without replacement” means that once a value is chosen, it is removed from the population.

56
Q

What is ‘empirical distribution’ ?

A

The distribution of values in a sample.

57
Q

What is an ‘analytical distribution’ ?

A

A distribution whose CDF (cumulative distribution function) is an analytic function.

58
Q

When considering statistical distributions, what is a ‘model’?

A

A useful simplification. Analytic distributions are often good models of more complex empirical distributions.

59
Q

What is the ‘interarrival time’?

A

The elapsed time between two events.

60
Q

What is the ‘complementary CDF (cumulative distribution function)?

A

A function that maps from a value, x, to the fraction of values that exceed x, which is 1-CDF(x).

61
Q

What is the ‘Standard Normal Distribution’?

A

The normal distribution with mean 0 and standard deviation 1.

62
Q

What is a normal probability plot?

A

A plot of the values in a sample versus random values from a standard normal distribution.

63
Q

In statistical hypothesis testing, whst is the p-value?

A

The probability of obtaining test results at least as extreme as the results observed, on the assumption that the null hypothesis is correct
————

the p-value (or probability value)