Chapter 3: performing the Test Plan and Analyzing the Results Flashcards by Lars Van Olmen

What are the 4 main types of data analytics?

Descriptive

Diagnostic

Predictive

Prescriptive

How well did you know this?

Not at all

Perfectly

What is Descriptive Analytics?

procedures that summarize existing data to determine WHAT HAS HAPPEND IN THE PAST

How well did you know this?

Not at all

Perfectly

What are common tools in Descriptive Analytics?

Summary statistics (mean, median, etc.)

Data reduction/filtering (fuzzy matching)

How well did you know this?

Not at all

Perfectly

summary statistics

describes set of data in terms of location, range, shape

helps to quicly see how data is distributed

How well did you know this?

Not at all

Perfectly

(desriptive ) data reduction or filtering

reduce observations, focus on relevant items –> reducing large set in samller set

helps to isolate high risk items, so can focus on what matters

ex: filtering large vendor list to only include those with transactions over 10k

How well did you know this?

Not at all

Perfectly

What is Diagnostic Analytics?

exploring current data to determine WHY SOMETHING HAS HAPPENED, typically comparing data to benchmark

How well did you know this?

Not at all

Perfectly

What are the 4 key methods in Diagnostic Analytics?

Profiling

Clustering

Similarity Matching

Co-occurrence Grouping

How well did you know this?

Not at all

Perfectly

(diagnostic) profiling

identifies typical behaviour of individual/group by compiling summary statistics
–> COMPARING INDIVIDUALS TO POPULATION

shows typical behaviour of a group –> helps to detect abnormal patterns (for ex: fraud)

How well did you know this?

Not at all

Perfectly

(diagnostic) clustering

identifies groups that have common underlying characteristics –> reveal hidden relationships

Grouping data into clusters based on natural patterns, without pre-defined labels.
finds groups without prior labels

How well did you know this?

Not at all

Perfectly

(diagnostic) similarity matching

measures how alike two items are and used to group data in clusters
find things that look alike

for ex: detect suspicious entries or fraud

How well did you know this?

Not at all

Perfectly

(diagnostic) co-occurence grouping

find items that often appear/happen together
“people who buy x, also buy y”

helps reveal recurring transactional patterns

How well did you know this?

Not at all

Perfectly

What is Predictive Analytics?

It uses historical data and models to forecast future outcomes.

procedures used to generate a model that can determine WHAT IS LIKELY TO HAPPEN IN FUTURE

How well did you know this?

Not at all

Perfectly

What are 3 common predictive methods?

Regression

Classification

Link Prediction

How well did you know this?

Not at all

Perfectly

(predictive methods) regression

A: A statistical model used to predict a number (e.g., sales, income) based on other variables.

predicts a number (ex:sales) show how variables are related

good when, high R2 (close to 1)
statistically significant coeffcients (p<005)

How well did you know this?

Not at all

Perfectly

(predictive methods) classification

predicts class/category for new observation based on manual identification of classes from previous observations

How well did you know this?

Not at all

Perfectly

(predictive methods) link prediction

Study These Flashcards

PREDICTS which new connections or relationships are likely to form in network based on existing data

predicts connections between items –> like suggesting friend a social media

Q: What is Prescriptive Analytics?

Study These Flashcards

procesdures to identify the best possible options for WHAT SHOULD BE DONE IN THE FUTURE

What tools are used in Prescriptive Analytics?

Study These Flashcards

Decision support systems
= helps users make decisions by combining data and analysis to recommend best action

machine learning and AI
= they recommend a course of action + model adapts to new external data

Decision support systems

Study These Flashcards

rule based systems, that gather data and recommend actions based on input

for ex: cashflow forecasting and management toolsma

(prescriptive) machine learning and AI

Study These Flashcards

learns from data to improve suggestions

helps continuously scanning transactions for example flagging suspicous vendors

Q: (Similarity Matching) What is fuzzy matching?

Study These Flashcards

finds text entries that are similar but not exactly the same

technique for detecting suspicious records in imperfect data

Q: (Similarity Matching) When do we use fuzzy matching?

Study These Flashcards

uses probability to identify likely simir data
when data has inconsistencies, with imperfect data
like “123 Main St.” vs. “123 Main Street.”

Q: (Similarity Matching) What’s the risk of using a high fuzzy matching threshold?

also for low

Study These Flashcards

A: Fewer false positives but more false negatives (you miss real matches).

A: More false positives (many things look matched, but aren’t real).

Q: (Classification) What is pruning in decision trees?

Study These Flashcards

remove branches from decision tree to avoid overfitting

model works to good in training data, will not work great in test data

Q: (Classification) Why do we prune a decision tree? .

make model simpler, more generalizable to new data

Q: (Classification) What is a linear classifier?

A: A model that separates data using a straight line (or flat boundary in higher dimensions) based on input variables.

Q: (Classification) Why use linear classifiers?

simple, fast, interpretable results good when data can be separated linearly

Q: (Classification) What’s a limitation of linear classifiers?

performs bad when nonlinear or complex relationship between variables

Q: (Classification) What is a Support Vector Machine (SVM)?

tries to divido 2 groups, NOT JUST STRAIGHT LINE, is a discriminating classifier that is defined by a seprating hyperplane first finds wides margin and then take the middle of 2 widest margin lines --> middle line

Q: (Classification) What does the SVM margin mean?

distance between seprating line (hyperplane) and closest data points from each class

Q: (Classification) Why is SVM useful?

A: It works well for both linear and non-linear classification, especially with small datasets and clear class separation.

Q: (Classification) What is underfitting?

A: A model that is too simple and misses important patterns in the data (high bias, poor training & testing accuracy).

Q: (Classification) What is overfitting?

A: A model that is too complex and fits the training data too well, including the noise (low bias, high variance, poor generalization). fits training data too well will do poor in test data

Q: (Classification) How to detect underfitting vs. overfitting?

Underfitting: low training & test accuracy Overfitting: high training but low test accuracy

(diagnostic) profiling BENFORDS LAW

early detection of irregularities continuous monitoring enables company to analyze real time data

Labeled data

Labeled data = data met bekende uitkomst of antwoord (label) Elke observatie (rij in je dataset) heeft niet alleen invoerwaarden (zoals leeftijd, geslacht, inkomen), maar ook een label, bijvoorbeeld: "ja/nee" "goed/slecht" "1/0" "fraude/niet-fraude" "goedgekeurd/afgewezen" 🔁 → Wordt gebruikt bij supervised learning → Het model leert van deze bekende antwoorden om nieuwe gevallen te voorspellen.

unlabeled data

Unlabeled data = data zonder uitkomst/antwoord Je hebt alleen kenmerken of eigenschappen (features), maar geen categorie of waarde die je vooraf kent. 🔁 → Wordt gebruikt bij unsupervised learning → Het model zoekt zelf naar patronen of groepen zonder dat je het goede antwoord geeft.

Chapter 3: performing the Test Plan and Analyzing the Results Flashcards

(37 cards)