Topic 3: Numerical Summaries Flashcards

1
Q

What are the main features of numerical summaries?

A
  • Max & min
  • Centre (mean, median)
  • Spread (standard deviation, range, interquartile range)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we need to pair centre features with spread?

A

If there is only centre points included, it can lead to misleading intepretation and instant assumptions regarding the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why do we need numerical summaries?

A

Numerical summeries reduce all of the data to 1 point. Even though this leads to a loss of lots of information, it makes communication and comparison much easier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is mean and how is it calculated?

A

Mean is the balancing point of the data set and takes into account of the whole data. High and lower readings than the mean cancel each other out.

Mean = sum/size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is median and how is it calculated?

A

Median is the middle point of the dataset, which takes into account of only 1 or 2 central points.

If the dataset has odd number of readings, the median is unique.
If the dataset has even number of readings, the median is anywhere between the 2 middle points (usually take the average).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When to use mean vs median?

A

Mean is used for fairly symmetric data.
Median is used for skewed and large data with outliers.
If the data graph is bimodal, neither one is suitable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is standard deviation and how is it calculated?

A

Standard deviation is used to measure how spread the data is compared to the mean.

RMS of gaps from the mean = sqrt[mean of (gaps from the mean)^2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How much percentage of the dataset is presented if taking account 1SD, 2SD, and 3SD?

A

1SD: 68%
2SD: 95%
3SD: 99.7%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the IQR and how is it calcualted?
What does it represent in boxplot?

A

IQR is the interquartile range or the range of the middle 50%.

IQR = Q3-Q1 = 75% percentile - 25% percentile

IQR is the length of the box in boxplot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is the mean compared to the median in different dataset?

A

In symmetric data, mean is quite near median.

In left skewed data, smaller data points drag the mean down
–> mean < median

In right skewed data, higher data points drag the mean up
–> mean > median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the difference between using (mean,SD) and (median,IQR)?

A

(Median, IQR) is more robust as they are barely affected by outliers and suitable for skewed data compared to (mean,SD)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is standard units and how is it calculated?

A

Standard units measure how many SD is one data point above or below the mean.

Standard units = (data point - mean)/SD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is coefficient of variation and how is it calculated?

A

Coefficient of variation is a relative measure of deviation (or combining those two values into 1 summary).

CoV = SD/mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What features are included in a boxplot?

A
  • Q2: meadian
  • Q1, Q3: 25%, 75%
  • Lower threashold: LT = Q1 - 1.5*IQR
  • Upper threashold: UT = Q3 + 1.5*IQR
  • Data points lying outside the threashold are outliers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe the differences between quantile and quartile.

A

A set of q quantiles divides data into (q-1) equal size sets (in terms of the percentage of data)

Quartile divides data into quarters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some steps in data wrangling?

A

Sourcing: the reliability, integrity, and original source of the data

Scraping: extracting data from any source (web scraping: from websites)

Cleaning and tidying: produce neat datasets

17
Q

What can be classified as neat datasets?

A
  • Each variable is a column.
  • Each subject/observation is a row.
  • Each type of observational unit forms a table.