Data Analytics Theory Flashcards

Question

What values can a quantile take?

Answer 1

Between 0 and 1.

Answer 2

Between 0 and 100.

Answer 3

The percentile is the quantile expressed in "percent scale" of 0 to 100 ie Pth quantile = 100 x Pth percentile.

Answer 4

The percentile is the quantile expressed in "percent scale" of 0 to 100 ie Pth quantile = 100 x Pth percentile. The Pth percentile is the cutoff point that indicates that at least P percent of the observation in the dataset take on this value or less.

Answer 5

The 80th percentile is the cutoff point which indicates that 80% of observations in the dataset may be found at this point or below.

Answer 6

Quartiles are three cut off points that divide the dataset into four equal groups (Q1, Q2, Q3)

Answer 7

Q1 = 0.25th quantile = 25th percentile. This is the middle value between the smallest observation and the median. Ie it is the median of the lower half of the dataset.

Answer 8

Q2 = 0.5th quantile = 50th percentile. This is the median of the dataset (the value which splits the dataset in half).

Answer 9

Q3 = 0.75th quantile = 75% percentile. This is the middle value between the median and the highest observation in the dataset. Ie it is the median of the upper half of the dataset.

Answer 10

The range is the difference between the smallest and largest observations in a numerical variable. It is extremely sensitive to outliers and therefore not very useful as a general measure of dispersion in the data.

Answer 11

It is extremely sensitive to outliers - its calculation involves the use of extreme values.

Answer 12

This provides basic information about variability in the dataset. It consists of the 0th percentile (minimum), 25th percentile (Q1), 50th percentile (Q2), 75th percentile (Q3) and 100th percentile (maximum). Ie it is the quartiles plus the maximum and minimum values.

Answer 13

The interquartile range (IQR) measures the width of the "middle 50 percent" of the data. It is the range of values between Q1 (0.25 quantile) and Q3 (0.75 quantile). It is very resistant to outliers as it doesn't consider the extremes where outliers are present.

Answer 14

The IQR measures the range across the middle 50% of the data, and therefore unlike the range it doesn't consider the extremes where the outliers are present.

Answer 15

Sort the data in ascending order.

Answer 16

Covariance measures joint variability — the extent of variation between two random variables. It quantifies how two variables vary together.

Answer 17

R = 0 - there is no linear relationship between numerical variables x and y. R > 0 - there is a positive linear relationship between numerical variables x and y (as x increases, y increases and vice versa). R < 0 - there is a negative linear relationship between numerical variables x and y (as x increases, y decreases and vice versa)

Answer 18

R > 0 - as x increases, y increases and vice versa

Answer 19

R < 0 - as x increases, y decreases and vice versa

Answer 20

Correlation

Answer 21

Covariance can tell us if there is a relationship between two variables, but it cannot measure how strong the relationship is as there is no scale to compare the value of r to.

Answer 22

Numerical variables.

Answer 23

We cannot quantify strength of the linear relationship between two variables. There are no upper or lower limits which covariance coefficient can take.

Answer 24

The direction and strength of an association between two variables. It is used to interpret the covariance.

Answer 25

Pearson’s product-moment correlation coefficient (Pxy, Rho xy).

Answer 26

There are guidelines available to interpret the value of rho. |rho| = 0.0 – no linear relationship 0.0 < |rho| <= 0.19 – very weak L.R. 0.20 <= |rho| <= 0.39 – weak L.R. 0.40 <= |rho| <= 0.59 – moderate L.R. 0.60 <= |rho| <= 0.79 – strong L.R. 0.80 <= |rho| < 1.0 – very strong L.R. |rho| = 1.0 – perfect L.R.

Answer 27

If rho = 1, there is a perfect positive linear relationship between variables x and y. If 0 < rho < 1, there is a positive linear relationship between x and y. The closer to 1 the stronger it is. If rho = -1, there is a perfect negative linear relationship between x and y. If -1 < rho < 0, there is a negative linear relationship between x and y. The closer to -1 the stronger it is. If rho = 0, there is no linear relationship between x and y.

Answer 28

Rho is between -1 and 1.

Answer 29

It is scaled between - 1 and 1.

Answer 30

A statistical technique used to get more insight into the properties of categorical variables.

Answer 31

1 - category 2 - frequency column (F) - the number of occurrences of each categorical variable. Will total to n 3 - relative frequency (RF) - the proportion of occurrences of each categorical variable. (F/n). The sum of all relative frequencies when written as proportions must be equal to 1. 4 - percentages (P) - proportions multiplied by 100. The sum of this column must equal 100.

Answer 32

They help us to summarise large amounts of data and display this information clearly. We can see the most/least common variables and can calculate proportions.

Answer 33

A contingency table summarises data for two categorical variables (table of counts by category). Each value in the table represents the number of times a particular combination of variable outcomes occurred.

Answer 34

Both tables are used to summarise information on categorical variables. A frequency table is used to summarise information on a single categorical variables whereas contingency tables summarise the data for two categorical variable.

Answer 35

Two categorical variables - contingency table

Answer 36

Categorical variables. This can be represented as frequency or proportion.

Answer 37

Bar charts - this can be by frequency or proportion.

Answer 38

The x-axis represents the different symbols (categories) of a categorical variable. The y-axis represents the frequency or proportion of the occurrence of each category.

Answer 39

A graphical representation of the information in a contingency table. It is similar to a bar plot.

Answer 40

A mosaic plot can be used to visualise one or two categorical variables from a contingency table.

Answer 41

Mosaic plots use box areas to represent the number of observations that that box represents.

Answer 42

A mosaic plot

Answer 43

One category (x) is used to create an initial one variable mosaic plot where the area represents the number of observations for that category. The second category (y) is represented by splitting each bar proportionally according to the fractions of y.

Answer 44

Numerical variables

Answer 45

A plot that provides a case-by-case view of data for two numerical variables.

Answer 46

Scatterplots are helpful in quickly spotting associations between two numerical variables.

Answer 47

A visualisation technique used for explaining important features of the distribution of the target numerical variable. It provides insight into centrality, spread, skewness and possible outliers.

Answer 48

Centrality (mean), spread (quartiles), skewness and possible outliers.

Answer 49

No, the whiskers may not capture the maximum and minimum values. The whiskers are determined differently dependent on the software package used. Eg 1.5 the IQR

Answer 50

Identifying outliers.

Answer 51

Right-skewed

Answer 52

Left-skewed

Answer 53

Suspected outliers are the observations beyond the maximum reach of the whiskers.

Answer 54

An outlier is an observation that appears extreme relative to the rest of the data

Answer 55

- To identify a strong skew in the distribution - To identify data collection or entry errors - To get an insight into interesting properties of the data

Answer 56

Side-by-side box plots is a traditional tool for comparing numerical observations across categories. It is particularly useful for comparing centrality and spread of numerical observations between categories.

Answer 57

Side-by-side box plots

Answer 58

Comparison of centrality and spread of numerical observations between categories.

Answer 59

- Describe what you see - Relate this to the question (ie what does this mean in real life) - Support with figures from the graph

Answer 60

Histograms are plots that are used for describing the shape of the data distribution of the target numerical variable. They also provide a view of the data density of the target numerical variable (higher bars represent where data is more common).

Answer 61

Where the data are relatively more common.

Answer 62

Histogram - where higher bars represent where the data are relatively more common.

Answer 63

They use bars to represent frequencies / they both measure frequencies.

Answer 64

- Histograms re used for displaying distributions of numerical variables while bar charts are used for categorical variables. - Both measure frequencies, but in histograms, observations first need to be "binned"

Answer 65

A defined interval (used to group individual numerical values). The number of observations that fall within each interval are counted and this frequency is used to determine the height of the bar for that interval.

Answer 66

The chosen bin width can alter the story that the histogram is telling. Increasing the bin widths may decrease the number of modes available.

Answer 67

1 - define the bins and bin sizes (software may determine this) 2 - once defined, count how many observations fall into each interval 3 - plot

Answer 68

The mode is represented by a prominent peak in the distribution.

Answer 69

Histograms can show how many and what the modes of a distribution are. - Unimodal / bimodal / multimodal

Answer 70

When data trails off to the right ie observations are clustered on the left of the axis and there is a long tail to the right.

Answer 71

When data trails off to the left ie observations are clustered on the right of the axis and there is a long tail to the left.

Answer 72

Right-skewed

Answer 73

Left-skewed

Answer 74

A dataset that shows roughly equal trailing off in both directions.

Answer 75

A lot of statistical inference relies on data being normally distributed.

Answer 76

Mean and standard deviation

Answer 77

Median and IQR - they are robust to outliers.

Answer 78

mean ~ median ~ mode

Answer 79

mode < median < mean

Answer 80

mean < median < mode

Answer 81

The mean is pulled in the direction of the tail, towards the extremes. The mode is pulled in the opposite direction (where the data is clustered)

Answer 82

y = sqrt(x) y = ln(x) y = -1/x In increasing order of skewness severity

Answer 83

y = x^2 y = x^3 In increasing order of skewness severity

Answer 84

Depending on bin size, the story the graph tells can change. If the bin size is too wide, it may mislead you into thinking that the data is normally distributed.

Answer 85

Absolute frequency or relative frequency (F/n)

Answer 86

They have the same shape. The difference is the Y-axis and the fact that the areas of the bars of the relative frequency histogram add up to one.

Answer 87

The absolute frequency divided by the Toal number of observations

Answer 88

Use the relative frequency histogram when we want to investigate whether the proportion is less than or greater than a certain value. Ie we want to look at proportion rather than frequency.

Answer 89

Can't determine an exact answer with these bin widths, we can only estimate. To answer accurately we need to have a narrower histogram (one with smaller bins)

Answer 90

The histogram forms a more smooth curve, approaching the density curve.

Answer 91

A density curve is a smoothed version of the relative frequency histogram. It is used for the visualisation of continuous variables or very large populations. It also represents a probability density function. The area under the curve is equal to 1.

Answer 92

A continuous variable.

Answer 93

The area corresponds to measuring probabilities. The total area is equal to 1. Similar to the bars in a relative frequency diagram.

Answer 94

The probability that x is equal to some value from the continuous distribution is ALWAYS equal to 0. This happens because a single point on the density curve diagram has a width of 0 and therefore we can't obtain the area underneath the curve at a single point.

Answer 95

The normal curve or normal distribution.

Answer 96

- It is unimodal and symmetric around its mean bell-shaped curve - Mean, mode and median are equal - It is determined by two parameters (mu and sigma), usually denoted as N(mu, sigma) - The area under the normal curve is 1

Answer 97

Mu and sigma - N(mu, sigma)

Answer 98

A normal distribution where mu = 0 and sigma = 1, represented as N(0,1)

Answer 99

The standard normal distribution

Answer 100

Transform our dataset onto the standard normal distribution. This enables us to refer to the standardised tables.

Answer 101

Mu (mean) - the centre of the curve, changing mu shifts the curve left / right Sigma (standard deviation) - the width of the curve. Changing sigma stretches or constricts the curve

Answer 102

68-95-99.7 Rule - 68% of observations lie within 1 SD away from the mean in the normal distribution - 95% of observations lie within 2 SDs - 99.7% of observations lie within 3 SDs

Answer 103

68%, 95%, 99.7%

Answer 104

We should convert available observations into the standard deviation units and measure their distances from the mean. To perform this type of conversion we use the standardisation technique called Z-score.

Answer 105

The Z-score of an observation is the number of standard deviations it falls above or below the mean. It is used to analyse normally distributed data.

Answer 106

For an observation x that follows the normal distribution N(u,o) Z = (x-u) / o By calculating a Z-score we "convert" the data value for its normal distribution N(u,o) to a value from the normal standard distribution N(0,1) in such a way that it maintains all the properties of the original dataset.

Answer 107

The observation is one standard deviation away from the mean? (above)

Answer 108

You can use Z-scores to roughly identify which observations are more unusual than others. If the absolute value of the Z-score is larger, it is more unusual - |z1| > |z2| means z1 is more unusual.

Answer 109

The more unusual observation will have a larger Z score, ie it will be more standard deviations away from the mean.

Answer 110

Magnitude - the number of standard deviations away from the mean the observation is. Value - whether this number of standard deviations away is above or below the mean.

Answer 111

Z ~ N(0,1) It follows that it is normally distributed once transformed.

Answer 112

We transform it to the standard normal distribution (Z scores) and use the N(0,1) percentiles, which are listed in a normal probability table to determine the percentile based on the Z score.

Answer 113

1 – draw and label a picture of the normal distribution (doesn’t need to be exact) 2 – shade in the region of interest 3 – calculate the Z-score of the cutoff value 4 – look up the percentile for the Z-score in the normal probability table 5 – do you need to subtract from 1? Always verify that the final answer makes sense with the picture you drew.

Answer 114

Z-score is a statistical measurement that describes a value's relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean.

Answer 115

Nominal categorical variable - have no implied order Ordinal categorical variable - have a natural ordering

Answer 116

- Statistical tests - Visualisation techniques Ideal to use both

Answer 117

- Shapiro-Wilk test - Kolmogorov – Smirnov test - Anderson – Darling test, etc.

Answer 118

Statistical tests are very sensitive to the presence of outliers. If a certain number of outliers are present in a normally distributed data set, statistical tests may report that the data set is not drawn from a normal distribution. Visualisation techniques may help overcome this problem.

Answer 119

- Histograms with the best fitting normal curve overlaid on the plot - The normal probability plot (quantile-quantile plot or QQ plot)

Answer 120

The normal probability plot.

Answer 121

This is used to visualise normality assessment. The sample mean and SD are used as the parameters for the best fitting normal curve. The closer the curve is to the histogram, the more reasonable the normal model assumption is.

Answer 122

This is used to visualise normality assessment. Data are plotted on the y-axis of the plot and theoretical quantiles (following normal distribution) are plotted on the x-axis. The closer the points are to a perfect straight line, the more confident we can be that the data follow the normal model.

Answer 123

A smaller sample size will show more variability around the curve. A larger sample size increases the confidence.

Answer 124

A curve closer to the histogram means it is more reasonable to assume the data is normally distributed.

Answer 125

Points bend up and to the left of the line.

Answer 126

Points bend down and to the right of the line.

Answer 127

Perform further analysis, eg different visualisations or investigating if and why there are outliers.

Answer 128

Short tails (narrower than the normal distribution) - points follow an S-shaped curve.

Answer 129

Long tails (wider than the normal distribution) - points start below the line, bend to follow it, and end above it.

Answer 130

To draw conclusions about and assess population parameters for a specific population based on a sample of data taken from that population.

Answer 131

Sample statistics (mean, proportions etc) are used was point estimates for the unknown population parameters of interest, as it is difficult (or impossible) to collect data from the complete population.

Answer 132

In statistics, a point estimate is a single value that is calculated from sample data to estimate an unknown population parameter. It is a "best guess" or "best estimate" of the population parameter. They generally vary from one sample to another and this sampling variation suggests our estimates may be close, but not exactly the true population parameter.

Answer 133

This sampling variation suggests that the estimate is not exactly equal to the true population parameter.

Answer 134

The distribution of point estimates based on samples of a fixed size from a certain population.

Answer 135

The central "balance" point of a sampling distribution is its mean. The standard deviation of a sampling distribution is referred to as a standard error.

Answer 136

The standard deviation of a sampling distribution. Reflects the fact that probabilities are no longer tied to raw measurements/observations, but rather to a quantity calculated from a sample of such observations. The standard error of an estimate describes how far the point estimate is from the true population parameter eg how far the typical estimate is away from the actual population mean.

Answer 137

The standard deviation measures the variability of individual data points inside the sample The standard error measures how far the point estimate is from the population parameter.

Answer 138

If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is approximated well by the normal distribution. The central limit theorem says that the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough. Regardless of whether the population has a normal, Poisson, binomial, or any other distribution, the sampling distribution of the mean will be normal.1

Answer 139

[See flashcard]

Answer 140

Independence - sample observations must be independent. Sample size/skew - either the population distribution is normal, or if the population distribution is skewed, the sample size is large.

Answer 141

- Random sampling / assignment is used - If sampling without replacement, n is less than 10% of the population

Answer 142

The more skewed the population distribution, the larger sample size we need to apply for the CLT. For moderately skewed distributions, n > 30 is a widely used rule of thumb.

Answer 143

We can check it using the sample data and assume that the sample mirrors the population.

Answer 144

- Independence - sampled observations must be independent. - Sample size / skew - at least 10 success and 10 failure observations. eg for the marathon example, at least 10 who ran < 2 hours and 10 who ran > 2 hours

Answer 145

It is very likely that we will not capture the exact population parameter. Instead, if we report a range of the plausible values, we have a good chance to capture a true population parameter. A plausible range of values for the population parameter is called a confidence interval.

Answer 146

A plausible range of values for the population parameter. They may be constructed in different ways, depending on the type of statistic and therefore shape of the corresponding sample distribution.

Answer 147

[See flashcard]

Answer 148

Z* is the critical value and can have a different value depending on the confidence level.

Answer 149

The margin of error. For a given sample the margin of error changes as the confidence level changes.

Answer 150

Adjust Z* in the formula

Answer 151

95% confidence interval, Z* = 1.96 99% confidence interval, Z* = 2.58

Answer 152

Use the normal Z-table. eg how do we be 96% confident?

Answer 153

The confidence interval needs to increase ie become wider. This will increase our confidence level. Too wide an interval may not be very informative.

Answer 154

It may not be very informative eg weather example.

Answer 155

We are XY% (eg 95%) confident that the true population parameter is between the lower bound (l) and upper bound (u) of our confidence interval.

Answer 156

Confidence intervals try to capture the population parameter - they say nothing about the confidence of capturing individual observations, a proportion of observations or about capturing point estimates.

Answer 157

1 - formulation of the practical problem in terms of statistical hypotheses 2 - construction of a test statistic 3 - description of a critical region and/or the calculation of the p-value 4 - significance level or size of the test 5 - further assessment

Answer 158

The null hypothesis H0 represents what we currently hold as true. H0 is basically a standard with which the evidence for HA can be compared. One-sample: there is no difference from our previous knowledge (maintenance of status quo) Two-sample: there is no difference between the populations being compared.

Answer 159

HA represents what we want to test. It expresses the range of situations that we wish the test to be able to diagnose. Depending upon the outcome of the test we may take action.

Answer 160

Language - is there enough evidence to reject the null hypothesis (we never accept it). "H0 is rejected in favour of HA" "There is insufficient evidence to reject H0 in favour of HA"

Answer 161

The test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely your observed data match the distribution expected under the null hypothesis of that statistical test. The test statistic is used to calculate the p value of your results, helping to decide whether to reject your null hypothesis. It is a function of the data plus the information in the hypothesis H0.

Answer 162

1 - its probability distribution must be calculable (at least approximately) under the assumption that H0 is true 2 - it should behave differently when H0 is true from when HA is true

Answer 163

A region of values of the test statistic t which support our preference for HA rather than H0

Answer 164

We reject H0 in favour of HA Otherwise, we are unable to reject H0 in favour of HA

Answer 165

We are unable to reject H0 in favour of HA.

Answer 166

So that the lack of information, particularly too little data, tends to result in non-critical values of the test statistic. Hence, it is unwise to talk positively about "accepting H0". Lack of strong evidence to reject H0 in favour of HA may indicate that we have not collected enough data to reject it.

Answer 167

Lack of information, particularly too little data, tends to result in non-critical values of the test statistic. Lack of strong evidence to reject H0 in favour of HA may indicate that we have not collected enough data to reject it.

Answer 168

A p-value, or probability value, is a number describing the likelihood of obtaining the observed data under the null hypothesis of a statistical test. The p-value quantifies the strength of the evidence against the null hypothesis H0 and in favour of the alternative hypothesis HA.

Answer 169

H0 is true and an improbable event has occurred HA is true

Answer 170

If the p-value is small, H0 is rejected in favour of HA If the p-value is not "small", the evidence does not support the reject of H0 in favour of HA.

Answer 171

Calculate the p-value Investigate t-statistic and the critical region

Answer 172

Type 1 - false positive. H0 is rejected when in fact it is true. Type 2 - false negative. H0 is not rejected when it is true.

Answer 173

Would choose a smaller significance level - we would rather have 1 in 100 errors than 5 in 100 errors.

Answer 174

The significance level of an event (such as a statistical test) is the probability that the event could have occurred by chance. It is the probability of rejecting H0 when in fact it is true, ie committing a Type 1 error.

Answer 175

Depends on the particular problem and how serious it is a true H0 is rejected (false positive) eg medical trials

Answer 176

We will allow 5 incorrect rejections of H0 from every 100 we make. There is a 5% chance that the result is due to chance.

Answer 177

P <= 5% (p <= 0.05) – the test is significant at 5% level and H0 is rejected in favour of HA P > 5% (p > 0.05) – the test is not significant at the 5% level and H0 is not rejected in favour of HA

Answer 178

- P > 10% - there is no (or very little) evidence for rejecting H0 in favour of HA - 5% < P <= 10% - on the available evidence, we cannot reject H0 is in favour of HA but we have some suspicion (ie we would like to obtain more evidence) Eg you didn’t reject the null due to a small dataset - 1% < p <= 5% - significant at 5% level and H0 is rejected in favour of HA. If the decision to change is important, we should probably seek further evidence - 0.1% < p <= 1% - highly significant at the 5% level. There is considerable evidence for rejection of H0 in favour of HA - P <= 0.1% - very highly significant at the 5% level. We are very confident that HA is to be preferred to H0

Answer 179

[See flashcard]

Answer 180

[See flashcard]

Answer 181

[See flashcard]

Answer 182

[See flashcard]

Answer 183

The t-distribution, also known as the Student’s t-distribution, is a statistical function that creates a probability distribution. The t-distribution is similar to the normal distribution, with its bell shape, but it has heavier tails. It is used for estimating population parameters for small sample sizes or unknown variances. T-distributions have a greater chance for extreme values than normal distributions, and as a result have fatter tails.

Answer 184

They are both bell-shaped curves centred at 0. The t-distribution has fatter tails, meaning observations are more likely to fall further away from the mean (over 2 SDs from the mean). The thicker tails are helpful for resolving our problem with a less reliable estimate of the standard error (since n is small).

Answer 185

When the population SD is unknown and we have a small data sample (n<30) we address the uncertainty of the standard error using the t distribution.

Answer 186

It is centred at zero and influenced by one parameter, the degrees of freedom (df). The larger the degrees of freedom, the more closely the t-distribution resembles the standard normal model. When df >= 30, it is nearly indistinguishable from the normal distribution.

Answer 187

Degrees of freedom are the maximum number of logically independent values, which may vary in a data sample. Degrees of freedom are calculated by subtracting one from the number of items within the data sample.

Answer 188

n < 30 - for n >= 30, the t-distribution and the normal distribution are nearly indistinguishable

Answer 189

A t table is a reference statistical table that contains critical values of the t distribution, also known as the t score or t value. Each row represent a t-distribution with different degrees of freedom. The columns correspond to tail probabilities.

Answer 190

[See flashcard]

Answer 191

The Paired Samples t Test compares the means of two measurements taken from the same individual, object, or related units. Each subject has two observations.

Answer 192

[See flashcard]

Answer 193

[See flashcard]

Answer 194

Use the pooled variance in the calculations

Answer 195

[See flashcard]

Answer 196

Goodness-of-fit test for classified data - The distribution of a categorical variable in a sample often needs to be compared with the distribution of a categorical variable in another sample. A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables are independent in influencing the test statistic.

Answer 197

In chi squared tests we don't assume normal distribution.

Answer 198

Each observation is classified into k mutually exclusive and exhaustive classes ie each observation belongs to one and only one class.

Answer 199

The critical region lies in the right hand tail only. This is because, if H0 is not true, we would expect the Eis to be quite different from the Ois, resulting in a larger than expected phi squared value. Small phi squared results when Eis and Ois are in good agreement - we wouldn't want to reject H0 in this case.

Answer 200

The exact distribution of phi squared is discrete and is approximated by the continuous chi squared distribution. o For this approximation to be reasonable, Ei should be > 5 for each class o If not, combine adjacent classes with resultant loss of one or more degrees of freedom

Answer 201

Ei should be > 5 for each class. If not, combine adjacent classes with the resultant loss of one or more degrees of freedom.

Answer 202

Combine adjacent classes with the resultant loss of one or more degrees of freedom.

Answer 203

[See flashcard]

Answer 204

The Yates' Continuity Correction - add magnitude and -1/2 [See flashcard]

Answer 205

The chi squared distribution has just one parameter called the degrees of freedom (df) which influence the shape, centre and spread of the distribution.

Answer 206

Higher degrees of freedom – the distribution shifts to the right and becomes flatter

Answer 207

One important difference from the t-table is that the chi-square table only provides upper tail values

Answer 208

ANOVA, or Analysis of Variance, is a test used to determine differences between results from three or more unrelated samples or groups. ANOVA is used to assess whether the mean of the outcome variable is different for different levels of a categorical variable.

Answer 209

- 2 groups: Z or a T statistic - 3 groups: test Analysis of Variance (ANOVA) and a new statistic called F

Answer 210

1 - The observations should be independent within and between groups. If the data are a simple random from less than 10% of the population, the condition is satisfied. Eg no pairing 2 - The observations within each group should be nearly normal (important when sample sizes are small) 3 - The variability across the groups should be about equal (especially important when the sample sizes differ between groups).

Answer 211

F statistic

Answer 212

Compare to see whether they are so far apart that the observed difference cannot reasonably be attributed to sampling variability.

Answer 213

They are equivalent, but only if we use a pooled standard variance in the denominator of the test statistic.

Answer 214

An overall grand mean

Answer 215

F = variability between sample groups / variability within sample groups

Answer 216

A large F statistic is needed for the p-value to be small to reject the H0. A large F statistic means the variability between sample groups is greater than the variability within sample groups.

Answer 217

Group - k - 1 Total - n - 1 Error - dft - dfg ie the difference between the total and the grouped degrees of freedom

Answer 218

SSG - sum of squares between groups, measures the variability between the groups [see flashcard] SST - sum squares total, measures the total variability in the dataset [see flashcard] SSE - sum squares error, measures variability within groups SSE = SST - SSG

Answer 219

From F-tables, find the F* value as the value from the column dfg and the row dfe. If F > F*, it is in the critical region therefore it is significant and at least one mean is different (different for at least one group). The P value can be computed. A large F value correlates to a smaller P value, therefore if F > F* P < 0.05 (alpha).

Answer 220

The mean square error. Calculated for the group and error row as Sum of squares / degrees of freedom

Answer 221

Use common variance (MSE from the ANOVA table) instead of each group's variances in the calculation of the SE. Use common degrees of freedom (dfE from the ANOVA table). Use a modified significance level, this resolves the issue of increasing the type I error rate if we run too many tests (false positives).

Answer 222

Multiple comparisons

Answer 223

The Bonferroni correction, which is a more stringent significance level. alpha* = alpha / K K - number of comparisons being considered K = k(k-1) / 2

Answer 224

alpha* = alpha / K K - number of comparisons being considered K = k(k-1) / 2

Answer 225

[see flashcard]

Answer 226

Linear regression is a statistical technique that can be used for prediction and evaluating whether there is a linear relationship between two numerical variables x and y. Linear regression assumes that the relationship between two variables can be modelled by a straight line

Answer 227

y = B0 + B1x x - predictor variable (explanatory variable, independent variable) y - response variable (dependent variable) B0 - intercept (expected value of the response variable when the predictor is 0) B1 - slope parameter (the change in the mean response for each one-unit increase in the predictor)

Answer 228

The predictor x has no effect on the value of the response y

Answer 229

Using data - these are point estimates b0 and b1

Answer 230

y_hat indicates it is a collection of estimated (predicted) observations of observed variable y, based on the input collection of predictor observations x

Answer 231

y_hat = b0 + b1x

Answer 232

Residuals (epsilon) n is the same, the same number of points

Answer 233

The differences between the observed and estimated values.

Answer 234

The difference of the observed response (yi) and the response we would predict based on the model fit (y_hati) Ei = yi - y_hati

Answer 235

The residuals are pretty small. The best fitting regression line (line that has the smallest possible residuals). A poor fitting regression line has large residuals.

Answer 236

Ordinary least squares regression (OLS)

Answer 237

OLS - ordinary least squares regression (OLS) Goal is to find the line that minimises the least square criterion ie minimises the sum of the squared residuals [see flashcard] The line that minimises this least squares criterion is usually called the least squares line

Answer 238

The least squares line

Answer 239

[see flashcard]

Answer 240

The data should show a linear trend. If there is a nonlinear trend, an advanced regression method should be applied.

Answer 241

- Linearity - Nearly normal residuals - Constant variability

Answer 242

We can use input values of x to get predicted values y_bar With a fitted simple linear model, you’re able to calculate a point estimate y_hati of the mean response value yi

Answer 243

Generally, the residuals must be nearly normal. When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points Residuals are normally distributed if they are scattered around 0 with uniform variance.

Answer 244

The variability of the points around the least squares line remains roughly constant

Answer 245

We want to determine how good our model is. One approach is using the coefficient of determination R^2. R^2 describes the proportion of the variation in the response that can be attributed to the predictor ie is explained by the least squares line. Formula [ see flashcard ] If we can calculate how much variance is due to the residual variable, we can calculate how much is due to the outcome variable

Answer 246

We want to determine how good our model is. One approach is using the coefficient of determination R^2. R^2 describes the proportion of the variation in the response that can be attributed to the predictor ie is explained by the least squares line.

Answer 247

Descriptive analysis - this helps to understand how the data is distributed and provides important information for further steps.

Answer 248

By position or by name

Answer 249

Turning raw data into understanding, insight and knowledge

Answer 250

A quantity, quality or property that you can measure. (values may vary from measurement to measurement)

Answer 251

- Table column - Field - Attribute - Property - Feature - Vector - Dimension

Answer 252

Numeric Categorical

Answer 253

Variables whose values are recorded as numbers (integer or real values)

Answer 254

Variables whose values are recorded as symbols. Eg - gender Eg - countries

Answer 255

Discrete - numeric values may only take on certain (distinct) numeric variables. Usually obtained by counting eg people in a class. Synonyms: integer, count. Continuous - numeric variables that may take any real value in some interval. Synonyms: float, double, interval, numeric

Answer 256

Discrete - numeric values may only take on certain (distinct) numeric variables. Usually obtained by counting eg people in a class. Synonyms: integer, count.

Answer 257

Continuous - numeric variables that may take any real value in some interval. Synonyms: float, double, interval, numeric

Answer 258

Ordinal - categorical variables whose values can be naturally ranked (eg eduction levels, driving speed categories). Nominal - categorical variables whose values cannot be naturally ranked (eg eye colour, gender)

Answer 259

Ordinal - categorical variables whose values can be naturally ranked (eg eduction levels, driving speed categories).

Answer 260

Nominal - categorical variables whose values cannot be naturally ranked (eg eye colour, gender)

Answer 261

How we store collections of variables

Answer 262

Univariate dataset – dataset consisted of measurements that correspond to the single variable Multivariate dataset – dataset consisted of measurements that correspond to two or more variables. Most relevant when individual components aren't as useful when considered on their own. eg spatial coordinates. Allows us to think about two or more variables Corresponding data analysis Univariate data analysis – the analysis performed on a single variable Multivariate data analysis – the simultaneous analysis of two or more variables

Answer 263

Measurements made under similar conditions

Answer 264

A set of values, each associated with a variable and an observation. Variables are table columns. Observations are table rows.

Answer 265

Tabular data - a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own "cell" - each variable in its own column, each observation in its own row.

Answer 266

Defined by the number of observations (rows) in the table

Answer 267

Defined by the number of variables (columns) in the table

Answer 268

Size - observations (row) Dimensionality - variables (columns)

Answer 269

The (usually) large pool of observational units that we are interested in.

Answer 270

A smaller collection of observational units selected from the population.

Answer 271

Sampling refers to the process of selecting observations from a population. Simple random sampling Stratified sampling Cluster sampling Multistage sampling

Answer 272

- Simple random sampling - Stratified sampling - Cluster sampling - Multistage sampling

Answer 273

It doesn't make sense to collect data for the whole population - it is probably impossible to collect and calculate the actual population mean so we need a sample.

Answer 274

A sample is said to be a representative sample if the characteristics of the observational units selected are a good approximation of the characteristics form the original population. Meal analogy.

Answer 275

Bias corresponds to a favouring of one group in a population over another group

Answer 276

Generalisability refers to the largest group in which it makes sense to make inferences about from the sample collected. This is directly related to how the sample was selected.

Answer 277

Parameters and statistics are calculations based on the population and sample respectively. - Population - parameter - Greek letters - Sample - statistic - Arabic - The differences are denoted in the notation used

Answer 278

A calculation based on one or more variables measured in the population. Denoted by greek letters.

Answer 279

A calculation based on one or more variables measured in the sample. Denoted by lower case arabic letters (sometimes in combination with other symbols)

Answer 280

A sampling strategy where the individuals are selected from the list of units in the population, by means of some random process, in such a way that each individual has equal chance to be selected. Eg random number tables or pseudo-random number generators. Selection can be performed sequentially (one at a time without replacement, so that at each stage, remaining individuals in the population have the same probability of being selected).

Answer 281

In simple random sampling, selection can be performed sequentially. Individuals can be selected from the population one at a time without replacement, so that each stage, remaining individuals in the population have the same probability of being selected.

Answer 282

There is usually an assumption that all observations are independent of each other - replacing them would lose this.

Answer 283

Stratified sampling is a divide-and-conquer sampling strategy. The population is divided into groups called strata. The sample of individuals is then drawn from each stratum using some other random sampling process, usually simple random sampling. Strata are chosen so that units in each stratum are as alike as possible and units in different strata are as different is possible. This sampling strategy is used in cases when it is known that the population is heterogeneous with respect to one or more variables which may have a bearing on the factor being studied. Eg if there was a difference in height by gender, you know to take it into consideration. This ensures things are well represented.

Answer 284

Stratified sampling

Answer 285

Strata are chosen so that units in each stratum are as alike as possible and units in different strata are as different is possible.

Answer 286

1 - to increase the accuracy and precision of the overall population estimates. 2 - to ensure that domains of study are adequately represented.

Answer 287

A sampling strategy where the population is divided into many groups, called clusters, and then we sample a fixed number of clusters and include all observations from each of those clusters in the sample. [Strata are separated based on convenience, not a measure of interest ie the measure of interest is not why you're in that cluster] Eg divide the class into tables and pick a sample of two tables.

Answer 288

A sampling strategy where the population is divided into many groups, called clusters, and then we collect a random sample within each cluster. Similar to cluster sampling (but rather than keeping all observations in each cluster, we collect a random sample within each selected cluster)

Answer 289

Sometimes it can be more economical than the alternative sampling techniques. They are most helpful when there is a lot of case-to-case variability within the cluster, but the clusters themselves don't look very different from one another eg neighbourhoods as clusters

Answer 290

More advanced analysis techniques are typically required.

Answer 291

The situation, time and money. Simple random sampling may be the best to get representation but it can be expensive. Multistage sampling can reduce the costs without reducing reliability.

Answer 292

Collect data, process it and clean it. EXDA and use of machine learning, algorithms and statistical models Communicate, visualisations and report findings. [Which leads to making decisions] Build data product. Data is a cyclical process - once you build the data product, more data becomes viable.

Answer 293

A creative process of exploring data sets for patterns and relationships. Starting with lots of visualisations and summaries is a good idea.

Answer 294

1 - Develop an understanding about data by formulating questions 2 - Search for answers using visualisation techniques and summary statistics 3 - use answers obtained to refine questions and/or generate new questions

Answer 295

Using visualisations and summary techniques - Visualise distributions of all variables (using box plots and histograms) - Visualise time series of data - Investigate all pairwise relationships between variables using scatterplots - Perform data cleaning and variable transformation - Perform summary statistics (mean, median, lower and upper quartiles, minimum and maximum values, identify missing data, errors and outliers)

Answer 296

Start simple, it is difficult to ask revealing questions at the start of analysis as you do not know what insights are hidden in your dataset. There are no universal rules of questions to ask to guide research. Useful starting points - What type of variation occurs within my variables? - What is the relationship between variables

Answer 297

Statistics used to quantitatively describe a collection of measurements by summarising them in the form of a single variable

Answer 298

Summary statistics: - Measures of centrality (mean, mode, median) ie the most typical values - Measures of variability (variance, standard deviation, range, quantiles, five number summary) ie the spread of the data Visualisation techniques: - Histograms - Boxplots

Answer 299

Summary statistics and visualisation techniques Numeric: - Measures of centrality - Measures of variability - Histograms and box plots Categorical: - Counts - Percentage - Proportions - Bar charts

Answer 300

Summary statistics - Counts - Percentages - Proportions Visualisation techniques - Bar charts

Answer 301

Summary statistics - Covariance and correlation (N-N) - Contingency tables (C-C) Visualisation techniques - Scatterplots (N-N) - Paired boxplots (N-C) - Paired histograms (N-C) - Mosaic plots (C-C)

Data Analytics Theory Flashcards

(333 cards)