data summary Flashcards

1
Q

what is quantitative data

A

Quantitative data measure some quantity resulting in a numerical value, e.g. weight, salary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is qualitative data

A

Qualitative data measure the quality of something resulting in a value that does not have a numerical meaning, e.g. colour, religion, season.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is discrete quantitative data

A

Discrete: data with distinct values and possible values take only a distinct series of numbers (e.g. number of traffic accidents, number of children born to a women)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is continuous quantitative data

A

Continuous: a value that can be measured evermore precisely and hence become essentially continuous (e.g. height, speed).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is ordinal qualitative data

A

Ordinal: non-numeric value but the values have some natural ordering; e.g. poor, fair, good, excellent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is nominal qualitative data

A

Nominal: unordered, distinct by name only; e.g. retail, construction, manufacturing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are frequency distribution

A

A frequency distribution summarizes discrete variables or qualitative data by counting how often each value occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is the mode

A

The mode is the most frequently occurring value in a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a bimodal distribution?

A

A bimodal distribution has two distinct peaks in the frequency of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 3 measures of centre in statistics?

A

mode
mean
median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

4 measures of spread

A

range
interquartile range (IQR)
sample variance
standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is it important to know both the centre and spread of a dataset?

A

Knowing both provides a better understanding of the data’s behavior. The center gives us a “typical” value, while the spread tells us how much variability or dispersion exists in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is the population mean and sample mean

A

The population mean is a parameter (𝜇) which is typically unknown

we take a sample and obtain an estimate (𝜇̂), the sample mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how to find the position of an even and odd sample median

A

even: (𝑛 + 2)/2
odd: (𝑛 + 1)/2
𝑛 - sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is the range

A

The range is the difference between the maximum and minimum value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

one disadvantage of range

A

can be misleading if one number is different to the rest. (outlier)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is an outlier

A

An outlier is a value that is very different to the other values recorded.

18
Q

What are percentiles and how are they used?

A

Percentiles: Values that divide the dataset into 100 equal parts.

25th percentile (lower quartile or 1st quartile): 25% of data lies below it.

75th percentile (upper quartile or 3rd quartile): 75% of data lies below it.

19
Q

what is the interquartile range

A

The difference between the 75th percentile and 25th percentile, representing the spread of the middle 50% of data.

20
Q

population variance formula

A

𝜎² = ∑(𝑦𝑖 - 𝜇)² / 𝑁

𝑁: population size
𝑦𝑖: each value.

21
Q

what does variance measure

A

Measures the spread of data from the population mean (𝜇).

22
Q

What is sample variance and how is it different from population variance?

A

Measures the spread of data from the sample mean (𝜇̂).
Sample variance divides by (𝑛 - 1) instead of 𝑁 to correct for bias in estimating population variance

23
Q

sample variance formula

A

𝑠² = ∑(𝑦𝑖 - 𝜇̂)² / (𝑛 - 1)

where n-1 is the degrees of freedom

24
Q

why do we use standard deviation

A

unit of variance give a squared answer so we want to root them

25
standard deviation formula
𝑠 = √(𝑠²)
26
What is a bar plot and when is it used?
A bar plot represents frequency information across discrete categories or groups. The height of each bar corresponds to the count or proportion of observations
27
why are pie charts useful
Pie charts are useful for displaying frequency distributions across different groups.
28
What is a histogram and what does it show?
A histogram is used to display continuous data by grouping values into bins. The x-axis represents data bins, and the y-axis represents frequency. It helps visualize the center, spread, and skewness of the data.
29
how to find the median in a histogram
the median is the point where 50% of the area of a histogram is to the left and 50% to the right
30
what is skewness
skewness is a measure of asymmetry about the mean.
31
How can you tell if data is skewed using a histogram?
Right (positive) skewed: Long right tail, mean > median. Left (negative) skewed: Long left tail, mean < median. Symmetric distribution: Mean = median.
32
How do you convert frequency to density in a histogram?
Density in interval 𝑖 = Frequency in interval / (Bin interval × Total number of observations) This standardizes the histogram so that the total area sums to 1, making it easier to compare different distributions.
33
What information does a box plot convey?
the lower limit of the box is the 25th percentile, the upper limit is the 75th percentile the box spans the IQR Median is a line inside the box. Whiskers extend to extreme values (or 1.5×IQR beyond the box). Outliers are plotted beyond the whiskers.
34
what do notched box plots include
Notched box plots show a confidence interval for the median.
35
what are violin plots
A violin plot combines a box plot with a smoothed, sideways histogram: Displays the median (red dot) and quartiles (box). Shows the distribution shape to understand data spread.
36
When should you use a cross-tabulation?
Used when both variables are qualitative or discrete with a small number of values. Helps summarize relationships between categorical variables.
37
How can histograms or box plots compare two variables?
If one variable is continuous and the other is discrete, use side-by-side histograms or box plots to compare groups.
38
What is a scatter plot used for?
A scatter plot is used to visualize the relationship between two continuous variables by plotting: Response variable on the y-axis Explanatory variable on the x-axis This helps identify trends, correlations, and patterns in data.
39
What is a quilt plot?
A quilt plot is used for summarizing relationships between three continuous variables The x and y axes form a grid of sections. Each grid square is colored based on the average value of a third variable (e.g., water depth). Useful for spatial analysis and heat maps.
40
what can be seen from a random component
values might follow a recognisable distribution (e.g. Normal) used to decide if the chosen fixed component is useful
41
What are the two components of data partitioning in linear models?
Fixed Component Represents the systematic part of the data Can be complex (e.g., includes multiple predictors) Random Component Represents random variation or error Often follows a recognizable distribution (e.g., Normal) Helps assess whether the fixed component is useful Measurement = Fitted Value ± Residual
42
What are key visual summaries in data analysis?
Histograms (distribution) Box plots (spread and outliers) Scatter plots (relationships)