After you have collected and normalized your data, the next step is analyzing it
This module
will focus on the basic principles and tools of data analysis, both numerical and graphical, the
application of which is a crucial first step in developing a cost estimate or risk analysis.
Data Analysis Overview
Key Ideas
Visual Display of Information
Central Tendency of Data
Dispersion (Spread) of Data
Data accumulation
Outliers
Data Analysis Overview
Analytical Constructs
Descriptive statistics
* Mean, median, mode
* Variance, std deviation,
CV
* Functional forms
Data Analysis Overview
Practical Applications
Making sense of your data
Data Analysis Overview
Related topics
Parametric
Distributions
Probability and Statistics
Data Analysis Within The
Cost Estimating Framework
Past
Understanding your
historical data
Present
Developing estimating
tools
Mean = $34.19
Average cost
Future
Estimating the new
system
Confidence Interval = +/-$5.76
Confidence Intervals
Here we present univariate cost data, in this case, monthly natural gas bills for a condominium over
about a six-year period.
The past data that have accumulated are displayed as a histogram, which is
a common way to show the density of univariate data. (This and other methods, such as the box
plots and stem-and-leaf graphs shown in the Related and Advanced Topics section, are all
essentially variants of plotting numbers on a number line.) We will revisit this particular graph, but
the standard Excel labels for histograms indicate that the data points in each bar are less than the
value shown on the x-axis, so that the bar labeled 30 gives the frequency of gas bills that are greater
than or equal to $15.00 but less that $30.00.
In the present, we develop estimating tools, in this case, a simple average: about $34.19 per month.
To apply this average for estimating or budgeting purposes, we would like to get a sense of its
precision, which we can obtain by calculating a confidence interval.
In this case, we see that the true
mean of future monthly gas bills is very likely to be within plus or minus $5.76 of the calculated
sample mean. Note, however, that most individual observations are outside this interval: the hot
summer months are much cheaper, and the cold winter months are much more expensive. Thus,
knowing the average precisely helps us (for household budgeting purposes, say) only if we have a
cost-smoothing deal with our utility company. If we want to accurately predict individual bills from
month to month, we need to seek out a cost-driver variable (such as mean monthly temperature).
This would lead to a set of bivariate graphs such as those shown on the Cost Estimating Framework
slide in Module 8 Regression Analysis.
Note that estimating tasks are by definition always conducted in the present. We don’t travel back in
time to collect data, nor forward in time to make our estimates. This “triptych” Cost Estimating
Framework is simply meant to remind us that data, the basis for our estimates, always originates
Data Analysis Outline
Core Knowledge
* Types of Data
* Univariate Data Analysis
* Scatter Plots
* Variables
* Axes and Function Types
* Data Validation
* Descriptive Statistics
* Outliers
* Rules of Thumbs
Summary
Resources
Related and Advanced Topics
This module will cover various types of data sets and the functional relationships that may be present
therein; how to uncover these relationships by scatter plotting the correct variables on the correct set
of axes; and how to perform elementary data validation by examining descriptive statistics,
appropriately treating potential outliers, and applying rules of thumb. As with all modules, we’ll
conclude the Core Knowledge section with a summary and present resources for reference and
further study.
Types of Data
Univariate
Bivariate
Multivariate
Time Series
The first step in data analysis is to think about what type of data you have. While we present this as
a linear process, you’ll need to revisit this step each time you get new data. We’ll cover univariate,
bivariate, multivariate, and time series data sets.
Univariate
‐ Single variable
‐ Use descriptive and inferential statistics
Bivariate
‐ One independent variable and one dependent
variable (i.e., y is a function of x)
‐ Use descriptive and inferential statistics
Multivariate
‐ Several independent variables and one dependent
variable (i.e., y is a function of x1, x2, and x3)
‐ Use descriptive and inferential statistics
Univariate data consists of a single variable, such as cost data for a single element or a set of
historical cost growth factors for various programs in a given phase. It can be displayed graphically
using histograms (shown), stem-and-leaf plots, or boxplots.
In this case, descriptive statistics (mean,
median, standard deviation, Coefficient of Variation (CV), etc.) should be used to find the central
tendency and dispersion of the data.
Inferential statistics are not as frequently used with univariate data, but can be used to assess whether the data set seems to match a certain mean, variance, or
distribution.
Note that true univariate data sets are quite rare in cost estimating, but we start with them in our first-principles approach. As soon as you ascribe categories to data or add other metadata (such as the year of the costs), you are arguably in the realm of bivariate or multivariate
data
Bivariate data has one independent variable and one dependent variable. For example, software
development cost as a function of the number of lines of code. It is generally displayed using a
scatter plot (shown).
In this case, both descriptive and inferential statistics (regression, t and F statistics, etc.) should be used. Descriptive statistics are calculated to find the central tendency and dispersion of the dependent variable.
Then, inferential statistics are used to test the relationship between the independent and dependent variable, e.g., is the number of lines of code a good
predictor of software development cost?
Multivariate data has several independent variables and one dependent variable. For example, the
cost of supplies as a function of both crew size and hours underway. It can be displayed using a 3-D
plot (shown) or pairwise plots of the dependent variables against the various independent variables.
In this case, both descriptive and inferential statistics should be used.
As with the other data types,
descriptive statistics give an idea of the central tendency and dispersion of the dependent variable.
Inferential statistics – such as multiple regression – are used to test the relationship between the
independent variables and the dependent variable, e.g., do crew size and hours underway together
give a good prediction of supplies cost?
Types of Data
Time as the independent variable
* Interval matters! Make sure you use an XY (Scatter) and
not a Line Chart in Excel unless intervals are equally
spaced
Smooth trends are rarely found in time series
Possible rare exceptions (e.g., corrosion over time)
“Standard” trends such as investment and inflation
Look for paradigm shifts, cycles, autocorrelation
Use moving averages, divide data into groups and
compare descriptive statistics
Regression is often not useful as it only
picks up smooth trends unless AR1/ARIMA
ANOVA and mean comparisons are more useful
Time series data are quite different from univariate, bivariate, and multivariate data, and thus, a
somewhat different approach must be used.
Time series data are generally bivariate data with time
(in the form of year, quarter, month, etc.) as the independent variable: for example, cost growth as a
function of the year of program initiation, or worker productivity measured by quarter.
(For examples
like the latter, tracking production metrics over time, see Module 11 Manufacturing Cost Estimating.)
Time series data can also be used to check CERs by including a time variable; time data can reflect
different points in the life cycle of system. As with any bivariate data set, you can plot it on a standard
Cartesian xy-plane, but the time intervals must be plotted correctly on the x-axis.
In Excel, you
should generally choose the XY (Scatter) Chart type instead of Line; the latter may treat the x-axis as
Category instead of Time-scale, which will force the even-spacing of your data points. If your data are
equally spaced, you may choose to use a line chart.
Unlike other data types, we rarely see smooth trends in time series data. Something like corrosion
over time may be an example of a rare exception. There are also expected “standard” trends, such
as the fact that investments and inflation generally follow an exponential function over time due to the
compounding of rates (and one would hope that the former outpaces the latter!). Instead, we usually
see evidence of paradigm shifts, cycles, or autocorrelation. A paradigm shift would show a marked
change in the nature of the data occurring at some point or over some period. An example of a
paradigm shift is finding lower cost growth in programs entering production after 1986 (at the end of
the Reagan ramp-up period) than before 1986.
Since regression only picks up smooth trends, it will not detect shifts and cycles and therefore is
often not a useful tool in time series data. Instead, we should use scatter plots and moving averages
to look for possible paradigm shifts and cycles. In addition, we can divide the data into subgroups
(e.g., 1980-85, 1985-90, 1990-95, 1995-2000, and 2000-05) and compare descriptive statistics.
Analysis of Variance (ANOVA) can be used to test for significant differences between subgroups.
AR1/ARIMA (Auto Regressive Integrated Moving Average) models are models that forecast a value
in a time series as a linear combination of the past values, past errors and current and past values of
other time series.
Univariate Data Analysis
Visual Display of Information
* Histogram, stem-and-leaf, box plot
What does it look like?
Measures of Central Tendency
* Mean (or median or mode)
What’s your best guess?
Measures of Variability
* Standard deviation (or variance),
coefficient of variation (CV)
How much remains un-explained?
Measures of Uncertainty
* Confidence Interval (CI)
How precise are
you?
Statistical Tests
* t test, chi square test, Kolmogorov-Smirnov
(K-S) test
How can you be
sure?
This analysis framework is mirrored in
bivariate and multivariate analysis.
When conducting univariate data analysis, the following questions should be addressed (preferably in
the provided order). We will address methods used to answer these questions in the first section of
this module.
*What does it look like? Some useful visual displays in univariate data analysis are histograms,
stem-and-leaf plots, and box plots. Histograms, the most common graphical tool in univariate data
analysis, are discussed in the slides that follow; an explanation of stem-and-leaf plots and box plots
can be found in the Related and Advanced Topics section of this module.
*What’s your best guess? Measures of Central Tendency, such as the mean, median, or mode, are
useful when describing your “best guess” or probable outcome of a dataset. These measures are
single points used to represent the total data set.
*How much remains unexplained? You must also address the variability around your point
estimate (i.e., your mean). Common tools for measuring the variability include the standard deviation
(or variance) and the coefficient of variation.
*How precise are you? Confidence intervals are used to measure the certainty (or uncertainty)
around your point estimate. Confidence intervals are introduced in this module, and are discussed in
depth in Module 10 Probability and Statistics.
*How can you be sure? Statistical tests can be conducted with univariate data sets. Some tests are
introduced in this module; the “how” is left for Module 10 Probability and Statistics.
Visual Display - Histograms
Histograms should be used to give an idea of the distribution of the data
Skew-right distribution, possibly
Exponential, Triangular, or Lognormal
Tip: Create histogram manually using Chart type
Column so that results do update when data
chang
One useful type of graph is the histogram. Here we have graphed six years’ worth of personal
monthly gas bills, measured in dollars (as first seen on the Cost Estimating Framework slide earlier).
This data, and the corresponding data on gas used per month (measured in units called therms), will
show up occasionally throughout this module. Though we would not use this specific data in day-to-
day cost estimating, we find it helpful here, since characteristics of these data can be used to
illustrate several of the topics we will explore (for example, time series and identification of outliers).
Histograms group data into several “bins” and plot the bins on the horizontal axis with the frequency
(or relative frequency) on the vertical axis. That is, the vertical column displays the number of
observations (or percentage of observations) that fall within that bin. By convention, the bin labels
indicate the upper end of the bin, so that the first bar represents the number of monthly bills less than
$15.00, the second bar between $15.00 and $30.00, and so on. Histograms give a good sense of the
distribution of the data, since they are essentially depictions of an empirical probability density
function (pdf), and can be useful in identifying potential outliers.
In the above histogram, a skew-right distribution is evident, leading us to investigate fitting a
triangular or lognormal (or even exponential) distribution to the data. The data points on the far right
are possible outliers.
Histograms are the primary means for cost estimators to visually display univariate data (and, as we
shall continue to emphasize throughout this module, it is extremely important to look at your data).
There are other possible methods, such as the stem-and-leaf plots and boxplots that are shown in
the Related and Advanced Topics section. Remember, of course, that monthly data is, strictly
speaking, not univariate – it has cost and month!
Visual Display – Histograms and their bins
It is necessary to choose bins carefully. The two histograms shown use the same data set, but
different bin sizes. The histogram on the left allows Excel to automatically choose the number and
size of bins. Almost all of the data ends up in one bin, so we do not get a good idea of the
distribution.
In the histogram on the right, the analyst specified the number and size of bins to be
used. Here, the distribution is clearly skewed right with a potential outlier. In this case, however, the
lowest bin specified is $15, so we lose the fact that there is only one bill less than $11.15, and it is no
longer clear that the distribution has a “hump.” The bins on the right hide or distort, because they
create the impression that the distribution is exponential, when in fact, the data show a void under
some value … in other words, there are few-to-no near-zero months. If the distribution were truly
exponential, as the right histogram suggests, lower values around zero would be the most common.
As demonstrated in this example, poor choices of histogram bins can hide important information!
“Sometimes a beautiful graph is an orchid, sometimes it’s a Venus Flytrap.”
Because statistical samples have an inescapable random element, there is always a tradeoff in the
numbers of bins between texture (more bins) and smoothness (fewer bins). In general, you should
play around with the intervals for your bins to get a sense of the data before settling on the final
display. Follow the link to the Related and Advanced Topics section for some possible rules for
determining number of bins and bin width.
Central Tendency - Mean
The mean is the Expected Value of a random variable
In Excel, use the “AVERAGE( )” function
Means of example data sets:
Gas bill (74 months), $26.52
Therms used (74 months), 14.8
Throughout our discussion of descriptive statistics, we will refer to sample statistics (here, we
calculate the sample mean). Section on Probability and Statistics will further discuss the relationship
between your sample and the population from which it comes. Because we can never have perfect
knowledge of a population, our statistics generally represent best estimates of population parameters
based on our sample.
The arithmetic mean or average of a data set is simply the sum of the data values divided by the
number of data points. In this case, we add up the costs of all the bills (and corresponding therms
used) and divide by the number of bills, 74.
You can use the =AVERAGE() function in Excel, which
automatically skips blanks in a range of cells.
Follow the link to the Related and Advanced Topics section if you want to learn a mental math trick
for more easily calculating means (or at least approximations) in your head.
Recall that the arithmetic mean (AM) is distinct from the geometric mean (GM) and harmonic mean
(HM) introduced in Module 5 Inflation and Index Numbers.
Central Tendency - Median
The sample median is the “middle” data point, with 50% of the remaining observations falling under that
point, and 50% above
If a data set has an odd number of points, the middle value is the median
The median of the data set {2,5,7,9,25} is 7
If a data set has an even number of points, the two middle values are averaged
The median of the set {3, 6, 8, 11, 13, 30} is 9.5 (average of 8 and 11
In general, the kth percentile is the point with k% of the data below and (100-k)% of the data above
Quartiles (25, 50, 75), deciles (10, 20,…, 80, 90), icosatiles (5, 10, 15,…, 95)
When there are extreme data points, the median may be more representative than the mean because robust
outliers impact the mean more than the median
“Representative” is a descriptive term, not a mathematical term
There are many mathematical reasons to prefer mean over median
Mean, Median, and Skew
The mean and the median are equal if the distribution is symmetric
Unequal means and medians are an indication of skewness
Median < Mean
Skew(ed) Right
Median = Mean
Symmetric
Median > Mean
Skew(ed) Left
If the distribution or data set in question is symmetric, the mean and the median will be equal (as
illustrated by the normal distribution in the center of the slide). While the other two relationships
illustrated above are not unfailingly true, they generally hold for the “regular” types of distributions
common in cost estimating and risk analysis. (Check out the Wikipedia article on skewness for the
“legal disclaimer.”) Inequality of mean and median is a general indication that the distribution or data
set is skewed.
If the median is lower than (to the left of) the mean, as illustrated by the lognormal
distribution on the left, this is an indication that the distribution or data set is skewed right or skew
right, since it stretches out to the right.
A median higher than (to the right of) the mean, as illustrated
by the beta distribution on the right, is an indication that the distribution or data set is skewed left or
skew left. Notice that the direction of skew follows the “tail” of the data and not the “hump”.
In these continuous distributions, the blue median line splits the area under the curve exactly in half.
The mean, shown in red, is the point at which the x-axis would balance if the pdf indicated its linear
density.
The mode, which we’ll introduce next, is the x-value at which the peak of the distribution
occurs. Note that for unimodal distributions such as the ones shown, the mode falls on the opposite
side of the median from the mean.
Central Tendency - Mode
The sample mode is the most frequent point to occur in a data set
The mode of a distribution is its peak
Value with the greatest probability mass (or density)
The mode of the set {2,4,4,7,9,9,9} is 9
The mode is a descriptive metric answering the question “what happens most frequently?”
It can help give a visual idea of what the distribution looks like
Most useful in discrete data
The mode is defined as the most frequently occurring point. This is the least used of the three
measures of central tendency. The measure is used to answer the question “what happens most
frequently?” so it’s only useful when the mode is fairly common. Note that the mode is simply a
plurality, not a majority. If the mode is a point that only occurs three times in a set of 100 data points,
it’s not really common enough for us to expect it to happen. More often we look at the modal bin for a
histogram, indicating the most likely range of values when compared with other adjacent non-
overlapping intervals of equal width.
The mode of a distribution is its peak, or the value where it attains its greatest probability mass (in the
discrete case) or density (in the continuous case). More on that in Module 10 Probability and
Statistics.
The mode is often most useful in discrete sets, particularly qualitative or categorical ones. For
example, the “mean color” of cows is useless, but the mode may be black-and-white piebald. This is
a most-frequent and discrete-case example, which makes the mode a good measure to use. A
quantitative discrete example would be a roll of a pair of dice, where the sum of seven (7) appears
most frequently, a key fact for you craps players out there! In continuous distributions, since the
probability of any point is zero, the “most common” idea is less useful: what you “expect” out of a
continuous random variable is the mean (the “expected value”), not the mode (the humped part of the
pdf).
The mode can be a good parameter to describe a distribution; it helps to get a sense of what the
picture looks like. The bottom line is that the mode is a visual parameter, one way or the other, not a
mathematically useful parameter. It is most useful in discrete data, and most useful when highly
Variability –
Variance / Standard Deviation
The sample variance measures the deviation of the data points from their mean
In Excel, use the “VAR( )” function
The sample standard deviation is simply
The standard deviation is expressed in the same units as the original data
In Excel, use the “STDEV( )” function
The variance is the average squared distance of the data points from their mean; it is a measure of
the spread of a distribution. A lower variance indicates less dispersion (tighter data).
The standard deviation of a distribution is simply the square root of the variance and measures the absolute
distance of the data points from their mean. When the exact population distribution is not known,
which is always the case in practical applications, you can find the variance of the sample; the
sample standard deviation is again the square root of the variance.
When we find the sample variance, we divide the sum of the squares of the distances from the mean by (n-1) instead of n. This
is because the variance and standard deviation measure the distance from the mean, which is itself
calculated from the data; if we have collected n data points, the number of data points that can vary
independently from the mean is (n-1), since the nth data point is exactly determined by the mean and
the values of the other data points. This idea is called “degrees of freedom.” This denominator also
ensures that sample standard deviation is an unbiased estimator of population standard deviation. In
particular, because the sample mean minimizes the sum of squared deviations (an exercise for the
student!), the numerator is almost certainly smaller than if the true population mean were used, so
the denominator needs to be adjusted accordingly.
Note that while the formula on the left is easier to remember – it is essentially the definition – the
formula on the right is easier to calculate, since it involves fewer computational operations (squaring
each data point instead of having to take the delta from the mean first and then square it). In fact,
you can compute it with a simple two-column table, with the xs and the x squareds, presaging the
four-column table that we’ll see in Module 8 Regression Analysis. We recommend that you know
both formulae (and be able to derive the latter from the former in a pinch!).
The units of measure for variance are squared units, which is not a useful measure. Because of this,
the standard deviation is often the statistic reported as the measure of the spread of a data set. In
our earlier example (therms of natural gas used), the variance would be reported in squared therms;
the standard deviation would be reported in therms.
Variance and standard deviation will be revisited in
Module 10 Probability and Statistics. For now, try
to remember that sample statistics, like s, are generally the estimators of the corresponding
population parameters, usually denoted by the counterpart Greek letter. In this case, the sample
standard deviation, s, is the estimate for the population standard deviation, sigma (σ), which is the
Variability - Coefficient of Variation
The Coefficient of Variation (CV) expresses the standard deviation as a percent of the mean
Tip: Low CV indicates less
dispersion, i.e., tighter data.
15% or less is desired
Large CVs indicate that the mean is a poor estimator
Consider regression on cost drivers
Examine data for multiple populations (outliers)
CVs of example data sets:
Gas bill, 74.4% (69.2%)
Therms used, 104.2% (102.5%)
Note that sums and averages tend to have smaller
variances
The coefficient of variation (CV), usually expressed as a percentage, is a measure of the size of the
standard deviation relative to the mean. This descriptive statistic is unit-less and therefore allows an
analyst to compare the variability across distributions.
In practice, a low coefficient of variation (say, 5%) would indicate that the average (mean) of the cost
data is a useful description of the data set. On the other hand, if the CV is much higher (say, greater
than 15%), there should be a cost driver in the data set that causes the cost to vary. This should
prompt the analyst to develop CERs in order to find the cost driver. If after running CERs the
coefficient of variation is not significantly reduced, you may have incorrectly identified the cost driver.
It is important to keep in mind that some data are inherently more noisy than other data, so the lack
of noise reduction may be the fault of the data rather than a misidentification of the cost driver. The
analogous calculation for CV for CERs is presented in
Module 8 Regression Analysis.
The CVs for the sample data set are shown, with the first number being the CV of all the individual
months and the second number (in parentheses) being the CV of the 12 monthly averages. This
illustrates the principle that sums and averages tend to have relatively smaller variances, as indicated
by the CV. In this case, the CVs aren’t decreased much, since it is the month-to-month variation in
temperatures that is driving the change in demand for gas, not the year-to-year variations for each
given month.
It is also interesting to note that the variation in demand (therms) is greater than the variation in cost.
This implies that the unit price for gas must be fluctuating in opposition to demand (negative
correlation) to provide a damping effect of sorts. This runs contrary to what we learned in Econ 101:
increased demand is supposed to drive prices up, not down!
Dispersion and CV
These two data sets have the same mean, but different standard deviations
This data has a lower CV (17%)
and is more tightly distributed
This data has a higher CV
(38%) and has more
dispersion
It is always important to look at the dispersion present in the data. Two different data sets may have
the same mean, but much different spreads. The data sets shown here illustrate this point. Both
sets have the same mean. However, the data on the left is more tightly distributed around the mean,
while the data on the right shows more dispersion. While both data sets have the same expected
value (mean), we would expect a significantly wider range when predicting based on the right-hand
data.