Data analysis Flashcards
(42 cards)
Key forms of data analysis
Descriptive
Inferential
Predicative
Descriptive analysis
Presents data in a simpler format that is more easily understood and by the user
Describes the data actually presented
Key measures/parameters used in a descriptive analysis
Measure of central tendency
Measure of the dispersion
(Also the shape of the (empirical) distribution)
Measurements of central tendency
Mean
Median
Mode
Measurements of the dispersion
Standard deviation
Ranges such as the interquartile range
Inferential analysis
Gather data in respect of a sample which is used to represent the wider population
Measures/Paramaters of inferential analysis
Measure of central tendency
Measure of the dispersion
(Testing Hypothesis)
Predictive analysis
Extends the principles behind inferencial analysis in order for the user to analyse past data and make predictions about future events
How is predictive analysis used to make projections
It uses an existing set of data with known attributes/featues (training set) in order to discover potentially predictive relationships.
Those relationships are tested using a different set of data (test set) to assess the strength of those relationships
Typical example of a predictive analysis
Regression analysis
Linear regression
The relationship between a scalar dependant variable and an explanatory or independent variable is assumed to be linear and the training set is used to determine the slope and intercept of the line
Eg a car’s speed and braking distance
Data Analysis Process
Develop a well-defined set of objectives
Identify the data items required for the analysis
Collection of the data from appropriate sources
Processing and formatting data for analysis
Cleaning data
Exploratory data analysis (despriptive/ inferential/ predictive)
Modelling the data
Communicating the results
Monitoring the process, update the data and repeat if necessary (actuarial control cycle)
The modelling team throughout the data analysis process
Ensure that any relevant professional guidance has been complied with
Ensure any relevant legal requirements are complied with
Possible issues with the data collection process that the analyst should be aware of
Whether the process was manual or automated
Limitations on the precision of the data collected
Whether there was any validation at source
If data was not collected automatically, how was it it converted to an electronic form
Why is randomisation used?
Reduce the effect of bias
Reduce the effect of confounding variables (a variable that influences both the dependent variable and independent variable causing a false association)
Random sampling schemes
Simple random sampling
Stratisfied sampling
Another Sampling method
Simple random sampling
Each item in the sample space has an equal chance of being selected
Stratisfied sampling
The sample space would first be divided into groups defined by specific criteria, before items are randomly selected from each group
Why would stratisfied sampling be used instead of random sampling
It aims to overcome the issued with random sampling as random sampling does not fully reflect the characteristics of the population
A common example of pre-processing
Grouping
Why was grouping used in the past
To reduce the ammount of storage space required
To make the number if calculations managable
Why is data currently grouped
To anonymise the data
To remove the possibility of extracting sensitive (or commercially sensitive) details
Other aspects of data which are determined by the collection process which affect the way it is analysed
Cross-sectional data
Longitudinal data
Censored data
Truncated data
Cross-sectional data
Involves recording values of the variables of interest for each case in the sample at a single moment in time
Eg the amount spent by each of the members of a loyalty card scheme this week