Midterm Flashcards
(54 cards)
Data Analytics
The process of evaluating data with the purpose of drawing conclusions to address business questions
Big Data
Data sets that are too large and complex for business’s existing systems to handle
IMPACT cycle steps
Identify business Questions- what is our target
Master the Data- ETL, How to access data, reliability, data normalization, clean data
Perform the test plan- Pick methods to approach data
Address and Refine Results- reasons why results happened
Communicate insights
Track outcomes
8 methods of approaching data
Classification
Regression
Clustering
Similarity matching
Profiling- characterizing typical behavior of a group
Link prediction-predicting a relationship btwn 2 items
Data reduction- focusing only on critical items
Co-occurrence grouping- “customers also bought…”
4 V’s of Big Data
Volume = Amount of data + Population Velocity = changes Variety = different types of data, structured or unstructured Veracity = reliability of accuracy
5 steps of ETL process (mastering the data)
Determine the purpose and scope of data request
Obtain the data- data request form or obtain yourself
Validate the data for integrity and completeness
Clean the data- find out why data is missing
Load the data in preparation for data analysis
4 steps of validating data
Compare number of extracted records to number of database records
Calculate and compare min, max, avg, med
Convert and validate Date/Time fields
Compare string limits for text fields
Ways to deal with missing data
Drop data (if too many aren’t dropped)
Average the missing data
Impute- assume someone who is in a cluster has the same data as others in the cluster
3 types of missing data
MCAR (completely at random)
MAR- pattern within the missing data but doesn’t impact analysis
MNAR- (not at random) pattern that can directly impact analysis
flat file
a means of storing data in one place (Excel spreadsheet) as opposed to multiple tables
4 ways to clean data
Remove headings and subtotals
Clean leading zeroes
Format negative numbers (parentheses)
Correct inconsistencies across data
Similarity matching
Identifying similar individuals based on data known about them
Clustering
Dividing individuals into groups in a useful or meaningful way
Co-occurrence grouping
Discovering associations between individuals based on transactions involving them
Profiling
Characterizing typical behavior of an individual
Link prediction
Predicting a relationship between two data items
Unsupervised approach
Exploring the data for potential patterns of interest, no specific target
Profiling, co-occurrence, data validation, fuzzy cluster
Supervised approach
Using historical data to predict a future outcome
Classification, similarity matching, regression, QCA
Target
An expected attribute or value that we want to evaluate
Class
A manually assigned category applied to a record based on an event
Fuzzy cluster
Like a venn diagram, there are some people in between clusters or could belong to either cluster
Configurational vs Regression
Configurational- idea that multiple different paths lead to job desirability
Regression- There is only one path to success
2 methods to data reduction
Data- focus on which records to focus on and which to drop
Variable- focus on which fields to focus on and which to drop
(Protiviti) 4 types of transformations
Customer Engagement
Digitizing Products and services
Better informed decisions
Performance measurement