Notes Flashcards Preview

HBS-Business Analytics > Notes > Flashcards

Flashcards in Notes Deck (56)
Loading flashcards...

One of the most useful and commonly used graphical representations of data is

A histogram.


What does a histogram display?

Frequency, or number, of data points (often called observations) that fall within specified bins.


What are the advantages of histograms?

Allow us to quickly discern trends or patterns in a data set and are easy to construct using programs such as Excel


Key concepts of a histogram

On the horizontal axis, are a series of single values, each of which represents a bin, or range of possible values.
On the vertical axis, is the frequency of the observations in each bin.
By convention, Excel includes in the range the number represented by the bin label. For example, bin 1 includes all countries with oil consumption less than or equal to 1 million barrels per day (x<=1); bin 2 includes all countries with oil consumption greater than 1 but less than or equal to 2 million barrels per day (1


What impact do the bins have on that a histogram reveals about the underlying data?

Using larger bins simplifies our graph, but provides less detail about the distribution. Large bins can prevent us from seeing interesting trends in the data.
Very small bins can create graphs that show such low frequencies that it can also be difficult to discern patterns.


What does it mean that the histogram is skewed?

It means that the histogram has a tail that extends out to one side. The tail is the part of a graph that appears long or "flattens", and has bins with lower frequencies.


What does skewness measure?

The degree of asymmetry of a distribution.


Definition of "right-tailed" and "left-tailed"

If the right tail is longer, we say the distribution is skewed to the right or "right-tailed."
If the left tail is longer, we say the distribution is skewed to the left or "left-tailed."


Definition of outlier

Data points that fall far from the rest of the data.
A data point is more than a specific distance below the lower quartile or above the upper quartile of a data set.
A data point is less than Q1 - 1.5(IQR) or greater than Q3 + 1.5(IQR).


Why does outlier exist?

1. An unusual but valid data point
2. Data entry error
3. Outlier was collected in a different manner / at a different time than the rest of the data


Three approaches to deal with outliers

1. Leave it as is
2. Change it to a corrected value
3. Remove it from the data set (very rarely)


The lower quartile

Q1, the 25th percentile--by definition, 25% of all observations fall below Q1.


The upper quartile

Q3, the 75th percentile--by definition, 75% of all observations fall below Q3.


The interquartile range (IQR)

The distance between the upper and lower quartiles.
IQR = Q3 - Q1


The appropriate range

1.5(IQR) = 1.5(Q3-Q1)


What are graphs very useful for providing insight into?

A data set's patterns, trends and outliers.


Descriptive statistics (summary statistics)

Summary a data set numerically.
Describe the data with just one or two numbers.
Provide a quick overview of a data set without showing every data point.


"Central tendency" of a data set

An indication of where the "center" of the data set lie.


The most common measurement of central tendency



Mean (average)

The "average" of a set of numbers.
The sum of all of the data points in a set, divided by n, the number of data points.



The middle value of a data set.
The same number of data points fall above and below the median.


How to find the median?

First arrange the values in order of magnitude. If the total number of data points is odd, the median is the value that lies in the middle. If the total number is even, the median is the average of the two middle values.



The value that occurs most frequently in a data set.
If a data set has more than one value with the highest frequency, that data set has more than one mode.



If the distribution has two clearly defined peaks (two points with very high frequency).
The two peaks may have equal frequency and hence be true modes, or one peak may be a mode and the other peak may simply have a very high (but not the highest) frequency.



Distributions with multiple peaks.


Conditional mean

The mean of a specific subset of data.
We apply a condition and calculate the mean for values that meet that condition.


Calculate a conditional mean in Excel

=AVERAGE(range, criteria, [average_range])
* range contains the one or more cells to which we want to apply the criteria or condition.
* criteria is the condition that is to be applied to the range.
* [average_range] is the range of cells containing the data we wish to average.



The value beneath which a certain percentage of the data lie.


The 25th percentile (the first quartile)

The smallest value that is greater than or equal to 25% of the data points.


What percentile does the mean represent?

The answer cannot be determined without further information.
The mean's location depends upon the distribution of the data set. Recall how the location of the mean differs for a symmetrical distribution and a skewed distribution. Therefore, there is no way to determine the percentile of the mean without more information about the data set.