Basics Flashcards

Question

Conditional Distribution

Answer 1

In a 2 way table The distribution of one variable given that some condition with the other variable is met. Conditional distributions are generally represented as percentages. As in "What percent of men prefer basketball"

Answer 2

Scatter plots are used to plot bivariate relationships. Being able to fit a line to the data is a good way to determine the strength of the relationship between the two variables. The closer the line matches the data, the stronger the relationship. The realtionship can be linear or non-linear. In a linear relationship the variables are changing at roughly the same constant rate. If it's non-linear, the rate of change varies in different parts of the distribution. A line with a negative slope indicates a negative relationship between the two variables. A positive slope indicates a positive relationship.

Answer 3

Used when there are 2 or more variables Denoted by the variable r AKA the Pearson correlation coefficient Range from -1 to 1 r = 1 is a perfect positive correlation r = 0 indicates no correlation r = -1 is a perfect negative correlation The value that is considered significant varies based on the field. Social sciences = |0.3|+ Hard Sciences = |0.7|+ r values don't quantify statistical significance - only correlation Calculated as the covariance divided by the product of the standard deviations of the two variables

Answer 4

Linear regression - The process of finding the line that best fits a set of data The most common method is to try to fit a line that minimizes the square distance to each point in the data set. This is a "least-squares" regression. The equation for a linear regression is written as y=mx+b but the y will have a ˆ over it. This indicates that the y value is an estimated value. It can't be an exact value because all the stat points will not sit directly on the line.

Answer 5

A residual is the difference between the actual value of a data point and the estimated value provided by the linear regression. For a given x value, the residual is the actual value (y) minus the estimated or predicted value (yˆ) A negative residual means the actual value is below the estimated value. A positive residual means the actual value is above the estimated value. The process of finding a line of best fit is about minimizing the square of the residuals

Answer 6

A plot of the residuals in a data set. The x values stay the same as the data set but the y values become the residuals of the data set values. Residual plots are used to gauge whether a line is a good fit for a data set or not. A good fit will be indicated by the residual points being clustered above and below a y value of 0. You don't want to see trends in the residual data. If there is a trend, you might need a better linear regression line or you might need a non-linear regression.

Answer 7

Involves a dependent and independent variables with control and experimental/ treatment groups. You look for statistically significant differences between the treatment and control groups. The independent variable (x) is AKA the explanatory variable. The dependent variable (y) is AKA the response variable.

Answer 8

Involves collecting data and looking for existing patterns and correlations. Observational studies can identify correlation but not causation between variables. There are different types of observational studies Data can be backward looking, forward looking, or based on information gathered right now.

Answer 9

Samples past data to gain insights

Answer 10

Pick a sample and track the data from that sample over time. You can analyze the data at the end of some time period or as it is collected.

Answer 11

Involves taking a sample of data from a given population and gathering information on the state of things right now. Voter preferences is a good example of this.

Answer 12

Can be prospective or retrospective Involves collecting data from the same group of individuals or subjects over an extended period. The primary goal of a longitudinal study is to observe and analyze changes and trends that occur over time within the same individuals or groups. Researchers typically make repeated measurements or observations at multiple time points. This allows them to examine the long-term effects, developmental patterns, or causal relationships between variables

Answer 13

data is collected from different individuals or groups at a specific point in time. Unlike longitudinal studies, cross-sectional studies focus on a single time point and aim to gather information about the characteristics, behaviors, or opinions of different individuals or groups at that particular moment.

Answer 14

Response Bias Under coverage Voluntary response sampling Convenience Sampling Non-response

Answer 15

Question phrasing or the question itself makes it unlikely that people will answer truthfully. Ex. Have you lied to your parents in the past week? Have you ever cheated on your spouse?

Answer 16

Responses don't take into account a key constituency. Calling 100 random people in the phone book when cell phones are not included in the phone book. There might be something different about people who only own cell phones or who have chosen to be unlisted in the phone book. Under coverage will typically underestimate the % of the population with a given response.

Answer 17

Non-random sampling caused by respondents self selecting to complete a survey. Voluntary response bias will typically overestimate the % of the population with a given response.

Answer 18

Using a non-random sample because it's available to you. Typically will overestimate the percent of the population with a given response.

Answer 19

Lack of data can cause a source of bias if it's big enough

Answer 20

Throw the population in a bowl and have a blindfolded person pick a sample of the total population Use a random number generator to pick members of the population. Put the population data set in alphabetical order and assign each entry a number. Then randomly generate numbers for your sample size and match them with the population data points. Use a random digit table to pick out random numbers. You can't just think up numbers, you are not capable of being truly random. Simple random samples can inadvertently introduce bias by randomly selecting a non-representative sample. You can avoid this with Stratified Sampling and Clustered Sampling

Answer 21

Type of random sampling Take the entire population and break it into strata, or different groups and randomly sampling each strata. In a high school this might mean breaking the student population into freshman, sophomores, juniors and seniors and then taking a random sample of 25% of your total desired sample size from of each group.

Answer 22

Type of random sampling Divide the population into groups that are broadly representative and then randomly sample the groups. An example of this would be randomly sampling classrooms that have a generally representative mix of men and women. You randomly pick the classroom and survey everyone in it.

Answer 23

Type of non-random sample Bias is introduced because people are self selecting to fill out the survey

Answer 24

Type of non-random sample Bias is introduced when the most convenient sample does not happen to be representative. The first 100 people in the door are convenient, but may not represent the population.

Answer 25

Can be used when simple random sampling isn't logistically feasible. Consists of randomly sampling a sub-set of the population. Given a desired sample size of 100 You pick the first subject at random and then sample every 100th person after that initial , randomly picked, person. Systematic random sampling is not fool proof. There can still be bias if you're not careful. You need to be sure that the sample isn't being distorted in some way so that the person chosen by the interval is truly random.

Answer 26

An experiment has an explanatory variable and a response variable. The explanatory variable (x) causes the change in the response variable (y). Experiments use randomly selected samples to infer the characteristics of the population as a whole The random sample will then be split into the control/ treatment group in some way.

Answer 27

Portion of the sample that will not be manipulated.

Answer 28

Portion of the sample that will be manipulated, will undergo treatment etc. Allocation to the control and treatment groups can be random but there are other methods as well depending on the experiment.

Answer 29

Experiment Design Type Used to ensure that the control and treatment groups have to correct proportions of a specific characteristic. The specific characteristic is selected randomly but the number of random selections is controlled to make sure the proportions are correct in both treatment and control groups.

Answer 30

Experiment Design Type Participants having the same characteristics get grouped into pairs, then within each pair, 1 participant gets randomly assigned to either the treatment or the control group and the other is automatically assigned to the other group. You assign people to the treatment/ control groups and run the experiment once. Then you switch the groups and run the experiment again. Doing this helps mitigate unplanned for bias resulting from how the original group assignments were made. Once the groups are assigned, you measure the baseline in each group of the variable you're interested in before you start the experiment. Once baseline is established. You give treatment to the treatment group and a placebo to the control group.

Answer 31

The subjects don't know if they are in the treatment or control group.

Answer 32

Both subjects and experimenters are both ignorant of which group a particular participant is in.

Answer 33

Subjects and experimenters and post experiment analysts are all ignorant of which group a particular participant is in.

Answer 34

After the experiment you remeasure the baseline and compare it to the pre-experiment baseline. If there is a change in baseline, you then determine the strength of the change and the correlation between the explanatory variable and the response variable. If there is a strong relationship you can start to talk about causality. Replication of the experiment is vital to reinforcing the ideal that there is a relationship between the variables. An individual experiment is seldom perfect.

Answer 35

There is always a chance that the result of an experiment occurred on accident, and not as a result of the experiment itself. You always need to ask if the result you're seeing is occurring just by chance. One way to account for this is to re-randomize the results repeatedly and look for how many times the result of the experiment appears in the re-randomized data. If it appears often, that indicates that it could happen at random. If it's very rare, it indicates that the chances of the experiment results happening by chance are low. In most experiments, it's considered significant if the odds of the result occurring by chance are less than 5%. If a result is statistically significant, it indicates that there may be a causal relationship https://youtu.be/jLFeqQxGtOc

Answer 36

Given a treatment group and a control group and a group of 100 subjects each with a value for the dependent variable. You would take the 100 subjects and randomly assign them to a new group. You do this many many times and record the result of the random assignments in terms of the dependent variable. Compare the plotted randomized results to your actual experimental result and see how often your experimental result occurs in the distribution of the randomized chance results If the actual result does not occur often by chance - it may be statistically significant.

Answer 37

Event A is rolling an odd number on a six-sided die Event B is rolling a number greater than two. P(A or B) = P(A) + P(B) - P(A and B) You can't simply add the probabilities because you'll be double counting objects that fall into both sets A and B. We subtract the intersection of events A and B because it is included twice in the addition of P(A) and P(B). If there is no overlap in the sets then P(A and B) is 0 because the events are mutually exclusive. So in this case P(A or B) = P(A) + P(B)

Answer 38

Independent events do not affect one another and do not increase or decrease the probability of another event happening. We say two events are independent if knowing one event occurred doesn't change the probability of the other event. The probability of two independent events occurring is P(Event #1) * P(Event #2) Two events, A and B, are independent if: P(A|B) = P(A) and P(B|A) = P(B) The general formula for two events to occur simultaneously is: P(A and B) = P(A) * P(B|A)

Answer 39

Experimental probability should always be viewed with skepticism. More experiments means more data and a closer approximation to theoretical probability, but experiments are never perfect.

Answer 40

Dependent events influence the probability of other events – or their probability of occurring is affected by other events. The probability of two dependent events occurring is: P(B|A) = P(A and B) / P(A) OR P(A and B) = P(B|A) * P(A) To determine if two events are dependent, you need to ask if one event occurring makes the other event more or less likely to occur. In practice, we often assume that events are independent and test that assumption on sample data. If the probabilities are significantly different, then we conclude the events are not independent.

Answer 41

Conditional probability refers to the probability of an event A occurring given that another event B has already occurred. Calculated by: P(A|B) = P(B|A) * P(A) / P(B) AKA P(A|B) = P(A and B) / P(B) The conditional probability takes into account the information provided by event B and adjusts the probability of event A accordingly. All conditional probability problems can be solved by growing trees If events are independent then P(A|B) = P(A) P(B|A) = P(B)

Answer 42

In permutations order matters. The permutation 1234 is not the same as the permutation 4321 The number of possible permutations P when trying to put n objects in r slots is n! / (n-r)! This assumes that none of the values repeat. This would be written nPr You may have to reason through the situation the formula is only a guide. It won't work in all cases. 0! is defined as 1 This is so that nPr makes sense if r = n

Answer 43

In combinations, order doesn't matter. The combination 1234 is the same as 4321. For the same set of items, there are far more permutations than combinations nCr is the number of combinations for n items in r slots nCr = nPr / r! This is AKA 'n choose k' or the "binomial coefficient".

Answer 44

Combinations are important for probability because the probability of an event occurring is the number of possible occurrences divided by the number of total outcomes. The number of occurrences is calculated by finding how many combinations of the possible outcomes will result in the event you want the probability for. So given and event E and the total number of outcomes Z. The number of times the event can occur would be ZCE. This assumes that all the events are equally likely, as with a coin flip.

Answer 45

Populations are hard to define and hard to determine in real life. Very difficult to work with. Statistics exist because we almost never have population data. Even when we have it, it can be too much info to work with effectively. Samples are much easier wit work with and most data you will work with will be sample data

Answer 46

Collection of all items of interest. Denoted as N. Numbers obtained when using a population are parameters

Answer 47

A subset of the population. Denoted as n. Numbers obtained when using a population are statistics Samples must be both random and representative for an insight to be precise. Random means that each member of the sample is chosen strictly by chance Representative means that the sample accurately reflects the members of the entire population.

Answer 48

Data can be classified in two main ways, the data type and the data measurement level. Types = Categorical or Numerical Measurement = Discrete or Continuous *Time on a clock is discrete, but time in general is continuous Qualitative = Nominal or Ordinal Quantitative = Interval or Ratio

Answer 49

Can be nominal or ordinal Nominal = Categories (seasons, car companies) these are not numbers and can't be ordered Ordinal = group and categories that follow a strict order. Ratings from negative to positive for example.

Answer 50

Can be interval or ratio. Both represent numbers Normal numbers can be both interval or ratio depending on the context. Interval : don't have a true 0 and can't be a ratio. Not as common. Interval EX. Temperature - Fahrenheit and Celsius both measure temperature. The measurements can differ based on the scale, so a day one might be 5C or 41F and day two might be 10C or 50F. Day two is twice as warm in C but not in F. 0C and 0F are not true 0s. Kelvin does have a true 0 Ratio: have a true 0. Most things in the real world are ratios. Ratio Ex. I have 2 apples and you have 6. you have 3 times as many apples because the ratio of 6/2 is 3. Other examples would be number of objects in general, distance and time.

Answer 51

Typical visualizations = frequency distribution tables, bar chars, pie charts and Pareto diagrams Frequency distribution tables - useful but not very visual. A good starting point for other visualizations and work Bar chars - built from frequency tables and much more visually intuitive Pie charts - built using the relative frequency (how much of the total each category represents). Very visual and intuitive. They are especially useful in showing the relationship between the variables and the share each variable has of the total. Market share is almost always represented by pie charts. Pareto diagram- Just a special bar chart with categories shown in descending order of frequency. There is also a curve on the same graph showing the cumulative frequency above each category as you move from high to low. This combines the strong sides of the pie chart and the bar chart. It shows how subtotals change with each category.

Answer 52

Typical visualizations = Frequency distribution tables, Histograms Frequency distribution tables - useful but not very visual. A good starting point for other visualizations and work. When making a frequency distribution table for numerical data, it makes sense to group the data into intervals and find the frequency of the interval rather than the each individual number. This makes a summary of the data that allows for a meaningful visualization. For many analyses it's useful to calculate the relative frequency of each interval. Relative frequency = frequency/ total frequency. Once numerical data has been entered into a frequency distribution table and divided into intervals with relative frequency it can be plotted. The most common plotting method/ visualization for numerical data is a histogram.

Answer 53

Generally statisticians prefer working with between 5 and 20 intervals but this depends on the amount of data you're working with. Intervals are generally of equal width. The formula for finding the size/ width of intervals for a data set is ((largest number - smallest number) / number of desired intervals). You will generally round up or down to gets a clean interval break. It's the width of the interval that's most important, so an interval from 1-21 seems odd but it has a width of 20. A number is included in a particular interval if that number is: 1. Greater than the lower bound 2. Less than or equal to the upper bound

Answer 54

Cross tables aka contingency tables- similar to a frequency distribution table but there is a second variable along the top row. This forms a basic row and column structure. The data is placed at the intersection of the two variables. Best practice is to calculate the subtotals of each row and column as these can be useful for further analysis later on. The data in a cross table can also be visualized in a side by side bar chart.

Answer 55

Scatterplots Used to represent two numerical variables. Ex. reading and writing scores on an SAT. The x axis shows one variable and the 2nd shows another. Each data point is plotted as a dot at the appropriate intersection on the graph. Scatter plots are useful for working with lots of observations. The point is not the individual data points but the general idea, the pattern of distribution. Look or the general direction, outliers and clusters of data in a particular area.

Answer 56

This is a variation on the regular bar chart. It breaks up the variables so that you can see the relationship between one value for one variable and all the values of the other variable. There will be multiple bars with different colors for each.

Answer 57

Typically, different formulas are used for population data vs. sample data. This is because each point is know in population data A statistic is an approximation. The sample formulas have been changed to include a slightly higher level of uncertainty.

Answer 58

Used when there are 2 or more variables. Covariance Correlation Coefficient

Answer 59

Used when there are 2 or more variables

Answer 60

Inference = An interpretation that goes beyond the literal information given. Inferential stats rely on distributions and probability theory to predict population values based on sample data. We've taken one sample of size n and made some claims about the general population... What if we were to take another sample of size n? Would we get the same result? Would we make the same claims about the general population? The fundamental question: When we make some inference about a population based on our sample, how confident can we be that we've got it 'right' ?

Answer 61

When calculating the Correlation Coefficient The correlation between x and y = the correlation between y and x. The formula is symmetrical. Causality: It's important to understand the direction of causal relationships because the correlation formula is symmetrical. The formula itself only establishes the strength of the relationship, not the causality. In housing, the size causes the price. The price doesn't cause the size. Relationship direction: Some relationships only move in one direction. Ie. House size is correlated with price but a house can later increase in price without increasing in size. Correlation does not imply causation.

Answer 62

A distribution is a function that shows the possible values for a variable and how often they occur. It consists not only of the input values but of all possible values. The sum of all probabilities must equal 1 or 100%. The probability of an impossible event is 0. Distributions can be shown as either graphs or tables. The graph is just a more visual representation of the underlying probabilities.

Answer 63

Normal Distribution Binomial Distribution Uniform Distribution Poisson Distribution Bernoulli distribution

Answer 64

Most common. AKA the bell curve, AKA gaussian distribution It is symmetrical and all central tendency measures are equal. It has no skew. Notation for a normal distribution is as follows: N~(µ, σ2) N = normal ~ = distribution µ = mean σ2 = sigma squared aka the variance Add the 0 point to any graph to give perspective to the distribution

Answer 65

Distribution created by n independent possibilities, each with the same probability. IE. A coin flip binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q=1-p). A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process; for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.[1]

Answer 66

A distribution where all outcomes have an equal probability is a discrete uniform distribution. Ex. Rolling one die. A discrete variable will yield a graph of bars, a continuous distribution will be a line.

Answer 67

Used to show how many times an event is likely to occur over a specified period a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. The Poisson distribution is defined by the rate parameter, symbolized by the Greek letter lambda, λ. lambda (λ) is the mean number of events within a given interval of time or space. The variance will also equal λ let’s say we are salespeople, and after many weeks of work, we calculate our average to be 10 sales per week. If we take this value to be our λ, the distribution will be a bell curve centered on 10, with tails representing possible outliers

Answer 68

Bernoulli distribution is a discrete probability distribution. It describes the probability of achieving a “success” or “failure” from a Bernoulli trial. A Bernoulli trial is an event that has only two possible outcomes n = 1 is success with a p of 0.6

Answer 69

Standardization is the process of transforming the distribution to one with a mean of 0 and a standard deviation of 1. You're turning all the datapoints into z-scores and then comparing those. You're not comparing the actual variable values. The standardized variable is called a z-score and is equal to the original variable minus its mean, divided by its standard deviation. The key here is that the formula must be applied to each variable in the set. This will create a new data set that will have a mean of 0 and a standard deviation of 1 Remember that adding or subtracting values from all data points does not change the standard deviation

Answer 70

Every distribution can be standardized. Normal distributions result in a "normal standard distribution when standardized It makes inference and working with the data easier. useful in situations where variables are measured on different scales or have different units of measurement. Allows for meaningful comparisons between variables by putting them on a common scale. Helps in data analysis and modeling techniques that assume the variables to be normally distributed or have equal variances. Can be beneficial when dealing with outlier detection or in algorithms that utilize distance-based calculations, such as clustering or certain machine learning algorithms. By standardizing the variables, outliers are less likely to heavily influence the analysis or the results of these algorithms.

Answer 71

Standardization itself is not a requirement for any specific operation or analysis. It is primarily used as a data preprocessing step to facilitate certain statistical analyses, modeling techniques, or algorithms. While it may not be necessary for all situations, standardization can offer several advantages in various contexts. Here are a few scenarios where standardization can be particularly beneficial: Comparing Variables: When you want to compare variables that have different scales or units of measurement, standardization allows for meaningful comparisons by putting them on a common scale. Multivariate Analysis: In multivariate analysis techniques like principal component analysis (PCA) or factor analysis, standardization is often employed to ensure that variables contribute proportionally and are not dominated by differences in scale. Distance-based Algorithms: In clustering algorithms, such as k-means clustering, or distance-based algorithms like nearest neighbors, standardization can prevent features with larger scales from dominating the calculation of distances or similarities. Interpretability and Coefficient Comparisons: In linear regression or logistic regression models, standardizing variables can help in the interpretation of coefficients, making them comparable and allowing for a meaningful comparison of the magnitude and direction of the effects. Outlier Detection: Standardization can assist in identifying outliers, as extreme values can be detected more easily when the data is standardized.

Answer 72

Means of samples can vary between different samples. Taking a single value is suboptimal because you don't know which sample mean is the most correct. To get around this issue, we draw many samples and create a new data set comprised of sample means. This is a sampling distribution of the mean. If the sample means are what you're looking at, it would be a "sampling distribution of the mean". The sampling means will be different but should gather around a certain value. Each sample is an approximation of the population mean, so the value they all revolve around will be the population mean itself. Taking the mean of a sampling distribution should give a very accurate idea of the population mean. The standard deviation of the sampling distribution of the mean is given by

Answer 73

The standard deviation of the distribution formed by the sample means. Standard error is computed as sigma / the square root of n. Standard error shows the variability of the means of the samples in the sampling distribution. Standard error decreases as sample size increases. Standard error shows how well you approximated the true mean and is used for almost all statistical tests because of this.

Answer 74

An approximation based solely on sample information. A specific value is called an estimate. There are two types of estimates: Point estimates Confidence interval estimates

Answer 75

A single number located exactly in the middle of the confidence interval. Ex. The sample mean X-bar is a point estimate of the population mean µ. The estimator is x-bar, the parameter is µ and the estimate is the specific value of x-bar. Sample variance is an estimate of sigma squared in the same way. Point estimators are not very reliable because they are approximations based on samples that yield values much different from the actual value of the population parameter. There can be many estimators for the same variable and they all have two properties, bias and efficiency. More efficient and less biased estimators are preferred.

Answer 76

An unbiased estimator has an expected value equal to the population parameter. X-bar + some other value x would have a bias of x when it comes to being an estimator of µ.

Answer 77

The most efficient estimators are the ones with the least variability of outcome. I.e. the unbiased estimator with the smallest variance.

Answer 78

Statistic is the broader term. A point estimate is a statistic.

Answer 79

The range within which you expect the population parameter to be. Estimated based on the data in your sample. An interval that contains the point estimate. Confidence intervals are preferred as they provide more info than point estimates. Ex. You visit 5% of the restaurants in London and calculate the mean price of a mean for all restaurants as 22.50. This point estimate might be close to the population parameter but chances are that the true value isn't 22.50. Rather, the true value is likely something close to 22.50. So 22.50 plus or minus x. It's safer to say that the average meal in London is somewhere between 20 and 25£. This is a confidence interval around your point estimate. This is more accurate but there is still uncertainty. This uncertainty is measured using levels of confidence.

Answer 80

Used to determine the confidence interval You can never be 100% confident unless you know the real population parameter. You might say that you're 95% confident that the parameter is inside your confidence interval. A 95% confidence interval means you are sure that in 95% of the cases, the population parameter would fall within the specified interval. This leaves a 5% chance that you're wrong and the µ is outside the interval. This can happen if your sample is not representative of the population. Level of confidence for an interval is denoted by 1-α (alpha). Alpha is a value between 0 and 1. If the confidence level is 95% then alpha is 5%. If 99%, alpha is 1% etc. The formula for level of confidence is: [point estimate - reliability factor x standard error, point estimate + reliability factor x standard error]. Point estimate is the value you're working with, for example x-bar Standard error is sigma÷√n Reliability factor is the Z or T Stat of alpha/2. Alpha is divided by two because it accounts for the two tails of a normal distribution.

Answer 81

Population variance can be known and unknown. Different calculation methods are used for each situation A 100% confidence interval is generally completely useless, it tells us nothing specific. 95% is much more useful because it strikes a balance between the level of certainty and the size of the confidence interval.

Answer 82

An important assumption in this calculation is that the population is normally distributed. Even if it isn't, you should use a large sample and let the Central Limit Theorem do the normalization. There will be a trade off between estimation precision and the level of confidence. As level of confidence goes up, the confidence interval will get bigger and vice versa. A narrow confidence interval means more uncertainty.

Answer 83

In practice, the population variance is rarely known. The students T is the more common method. If population variance is unknown and the sample size is small. Use the Students-T distribution. Whether you're using the Z-table or the T-table, the logic of calculating the confidence interval is the same. The only changes at to two inputs: Instead of using the Z-stat you use degrees of freedom. Instead of using population standard deviation, you use sample standard deviation. Knowing the population variance provides a narrower confidence interval at the same level of confidence. It's a more accurate way to go. Having a smaller sample size makes the confidence interval wider. So prediction is still possible based on a small sample size and unknown population variance but it's less accurate.

Answer 84

The z-statistic and t-statistic are both statistical measures used in hypothesis testing and estimating population parameters. They differ in their underlying assumptions and applications. z-statistic is suitable for large sample sizes or known population standard deviation, t-statistic is appropriate for small sample sizes or unknown population standard deviation. The choice between the two depends on the specific characteristics of the data and the objectives of the statistical analysis.

Answer 85

Used when the sample size is large (typically greater than 30) or when the population standard deviation is known. Calculated by subtracting the population mean from the sample mean and dividing it by the population standard deviation. The z-statistic follows a standard normal distribution (mean of 0 and standard deviation of 1) under the null hypothesis. It is commonly used in situations where the population parameters are known, or the sample size is sufficiently large to approximate the population parameters.

Answer 86

The t-statistic is used when the sample size is small (typically less than 30) or when the population standard deviation is unknown. It is calculated by subtracting the population mean from the sample mean and dividing it by the sample standard deviation. The t-statistic follows a t-distribution, which is similar to the normal distribution but has thicker tails to account for the additional uncertainty associated with estimating the population standard deviation from a small sample. The shape of the t-distribution depends on the degrees of freedom (sample size minus 1), and as the sample size increases, the t-distribution approaches the shape of the standard normal distribution. The t-statistic is commonly used when the population parameters are unknown or when working with small sample sizes. After 30 degrees of freedom, the T and Z statistic tables become almost identical. Rule of thumb is to use the Z-table for samples with over 50 observations

Answer 87

Margin of error is a measure of the potential error associated with a sample estimate. The margin of error represents the precision of the estimate. Confidence interval is a range of values that provides an estimate of the likely range for the population parameter. Confidence interval indicates the range within which the true population value is expected to fall with a given level of confidence.

Answer 88

The margin of error quantifies the uncertainty or potential error associated with estimating a population parameter based on a sample. It represents the maximum expected difference between the sample estimate and the true population value. The margin of error is typically expressed as a single value or a range around the sample estimate, denoting the precision of the estimate. It is calculated using formulas that take into account the sample size, the level of confidence desired, and the variability of the data.

Answer 89

The confidence interval is a range of values within which the true population parameter is estimated to fall with a certain level of confidence. It provides a measure of the precision of the estimate by specifying a range of plausible values for the population parameter. The confidence interval is calculated using the sample data and takes into account the sample size, the variability of the data, and the desired level of confidence. It is typically reported as a range around the sample estimate, representing the lower and upper bounds of the interval.

Answer 90

Margin of Error for a Population Mean (σ Known): Margin of Error = Z * (σ / √n) Where: Z is the critical value associated with the desired level of confidence. σ is the known standard deviation of the population. n is the sample size. Margin of Error for a Population Mean (σ Unknown) Margin of Error = t * (s / √n) Where: t is the critical value associated with the desired level of confidence, determined based on the degrees of freedom (n-1). s is the sample standard deviation. n is the sample size. Smaller reliability factors (z or t) and smaller standard deviations will reduce the margin of error

Answer 91

Problem - Plan - Data - Analysis - Conclusion

Answer 92

Given data reporting deaths from a surgical procedure Negative framing would be reporting mortality Positive framing would be reporting survival Reporting actual numbers and percentages can increase the impression on the audience because it helps them picture a crowd of people. 99% of youth are not violent in London - this sounds good 1% of youth are violent in London and that 1% is approximately 10,000 people based on the population - this is the same info - seems worse. Ideally both positive and negative frames should be used to seem impartial The order in which info is displayed has a large impact as well. Where you start the x-axis is important - start at 0 vs start at 95 - same info will look very different in terms of readability etc.

Answer 93

Pie charts are useful for showing how much of the whole each category makes up - but they are confusing to look at if there are many categories Multiple Pie charts are generally a bad idea - it's hard to judge the relative sizes of the pie slices on the multiple charts. Comparisons are better based on height or length alone. Use a bar chart for comparisons instead of a pie chart.

Answer 94

Relative risk compares the risk of an outcome between exposed and unexposed groups. 18% increase in cancer for people eating processed meat daily compared to the baseline population, which already has some risk of getting cancer. Baseline risk is 6 out of 100 people 6/100 + 18% increase = 7/100 for 100 people eating processed meat everyday

Answer 95

Absolute risk refers to the actual probability of an outcome occurring in a specific group regardless of any other factors. In this context - 18% would be the change in the actual proportion of people getting cancer. To avoid confusion - use expected frequency rather than percentages or probabilities. Ask - what does this mean for 100 people. Just stating an 18% increase as above is manipulative. 1 in x is a common way to show risk - but it's hard to grasp, because a larger number (1 in 1000) means a lower risk.

Answer 96

The odds of an event happening are the ratio of the chance of the event occurring to the chance of it not occurring If six people out of 100 get cancer the odds are 6/94. If after a treatment, four people out of 100 get cancer the odds of getting cancer are 4/96 The change in the risk of getting cancer due to the treatment is also an odds ratio (6/94) / (4/96) = 1.53 or a 53% decrease in relative risk. #Odds for cancer without treatment/odds for cancer with treatment It's a 53% decrease in relative risk but only a 2% decrease in absolute risk. Odds ratios are confusing If events are rare, the odds ratio will appear close to the relative risk If the events are common, the odds ratio can be very different from the relative risk

Answer 97

A scale is which the space between 100 and 1000 is the same as the space between 1000 and 10,000. On a linear scale every unit of distance corresponds to adding the same amount, On a logarithmic scale, every unit of length corresponds to multiplying the previous value by the same amount. Hence, such a scale is nonlinear: the numbers 1, 2, 3, 4, 5, and so on, are not equally spaced. Rather, the numbers 10, 100, 1000, 10000, and 100000 would be equally spaced.

Answer 98

Each branch represents a specific set of events. The probabilities the terminal branches (all possible sets of outcomes) sum to one. We multiply across branches (using the multiplication rule!) to calculate the probability that each branch (set of outcomes) will occur.

Answer 99

Variance is represented by sigma^2 It tells us how spread out a data set is. A larger number = more spread The more spread the data, the larger the variance is in relation to the mean. Find the difference between every datapoint and the mean of the data set. Square that difference to ensure that it's positive Find the mean of those differences - this is the variance, the average distance (squared) of the the data points from the mean.

Answer 100

You have all the available data (for example murder rates). You have the population. You act as if the population data you have is actually a sample drawn from some larger imaginary space of probabilities. The population you have is just one sample of all the possible populations. This allows you to use the data you have to learn about the other possible scenarios that might have occurred. This mind set allows you to look at the real world and ask how likely it is that you'll see something similar happen in the future. With the future being another sample drawn from the imaginary population of possibilities.

Answer 101

The cumulative distribution of a value is the fraction of data points that fall at or below that value pop_heights_sorted = np.sort(pop_heights) total_cdf_test = 0 for i in range(len(pop_heights_sorted)): if pop_heights_sorted[i] <= 177: pop_heights_cdf[i] = i/len(pop_heights_sorted) total_cdf_test += 1 print(total_cdf_test/len(pop_heights_sorted))

Answer 102

Two types of statistical tests that are used to compare two groups of data Parametric tests are based on the assumption that the data is normally distributed. This means that the data is bell-shaped and symmetrical. Parametric tests are more powerful than nonparametric tests, but they are also more sensitive to violations of the assumptions. Nonparametric tests do not make any assumptions about the data. This means that they can be used with data that is not normally distributed. Nonparametric tests are less powerful than parametric tests, but they are also less sensitive to violations of the assumptions.

Basics Flashcards

(127 cards)