Chapter 3: performing the Test Plan and Analyzing the Results Flashcards

(37 cards)

1
Q

What are the 4 main types of data analytics?

A

Descriptive

Diagnostic

Predictive

Prescriptive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Descriptive Analytics?

A

procedures that summarize existing data to determine WHAT HAS HAPPEND IN THE PAST

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are common tools in Descriptive Analytics?

A

Summary statistics (mean, median, etc.)

Data reduction/filtering (fuzzy matching)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

summary statistics

A

describes set of data in terms of location, range, shape

helps to quicly see how data is distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(desriptive ) data reduction or filtering

A

reduce observations, focus on relevant items –> reducing large set in samller set

helps to isolate high risk items, so can focus on what matters

ex: filtering large vendor list to only include those with transactions over 10k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Diagnostic Analytics?

A

exploring current data to determine WHY SOMETHING HAS HAPPENED, typically comparing data to benchmark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 4 key methods in Diagnostic Analytics?

A

Profiling

Clustering

Similarity Matching

Co-occurrence Grouping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(diagnostic) profiling

A

identifies typical behaviour of individual/group by compiling summary statistics
–> COMPARING INDIVIDUALS TO POPULATION

shows typical behaviour of a group –> helps to detect abnormal patterns (for ex: fraud)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

(diagnostic) clustering

A

identifies groups that have common underlying characteristics –> reveal hidden relationships

Grouping data into clusters based on natural patterns, without pre-defined labels.
finds groups without prior labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

(diagnostic) similarity matching

A

measures how alike two items are and used to group data in clusters
find things that look alike

for ex: detect suspicious entries or fraud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

(diagnostic) co-occurence grouping

A

find items that often appear/happen together
“people who buy x, also buy y”

helps reveal recurring transactional patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Predictive Analytics?

A

It uses historical data and models to forecast future outcomes.

procedures used to generate a model that can determine WHAT IS LIKELY TO HAPPEN IN FUTURE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are 3 common predictive methods?

A

Regression

Classification

Link Prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

(predictive methods) regression

A

A: A statistical model used to predict a number (e.g., sales, income) based on other variables.

predicts a number (ex:sales) show how variables are related

good when, high R2 (close to 1)
statistically significant coeffcients (p<005)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(predictive methods) classification

A

predicts class/category for new observation based on manual identification of classes from previous observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

(predictive methods) link prediction

A

PREDICTS which new connections or relationships are likely to form in network based on existing data

predicts connections between items –> like suggesting friend a social media

17
Q

Q: What is Prescriptive Analytics?

A

procesdures to identify the best possible options for WHAT SHOULD BE DONE IN THE FUTURE

18
Q

What tools are used in Prescriptive Analytics?

A

Decision support systems
= helps users make decisions by combining data and analysis to recommend best action

machine learning and AI
= they recommend a course of action + model adapts to new external data

19
Q

Decision support systems

A

rule based systems, that gather data and recommend actions based on input

for ex: cashflow forecasting and management toolsma

20
Q

(prescriptive) machine learning and AI

A

learns from data to improve suggestions

helps continuously scanning transactions for example flagging suspicous vendors

21
Q

Q: (Similarity Matching) What is fuzzy matching?

A

finds text entries that are similar but not exactly the same

technique for detecting suspicious records in imperfect data

22
Q

Q: (Similarity Matching) When do we use fuzzy matching?

A

uses probability to identify likely simir data
when data has inconsistencies, with imperfect data
like “123 Main St.” vs. “123 Main Street.”

23
Q

Q: (Similarity Matching) What’s the risk of using a high fuzzy matching threshold?

also for low

A

A: Fewer false positives but more false negatives (you miss real matches).

A: More false positives (many things look matched, but aren’t real).

24
Q

Q: (Classification) What is pruning in decision trees?

A

remove branches from decision tree to avoid overfitting

model works to good in training data, will not work great in test data

25
Q: (Classification) Why do we prune a decision tree? .
make model simpler, more generalizable to new data
26
Q: (Classification) What is a linear classifier?
A: A model that separates data using a straight line (or flat boundary in higher dimensions) based on input variables.
27
Q: (Classification) Why use linear classifiers?
simple, fast, interpretable results good when data can be separated linearly
28
Q: (Classification) What’s a limitation of linear classifiers?
performs bad when nonlinear or complex relationship between variables
29
Q: (Classification) What is a Support Vector Machine (SVM)?
tries to divido 2 groups, NOT JUST STRAIGHT LINE, is a discriminating classifier that is defined by a seprating hyperplane first finds wides margin and then take the middle of 2 widest margin lines --> middle line
30
Q: (Classification) What does the SVM margin mean?
distance between seprating line (hyperplane) and closest data points from each class
31
Q: (Classification) Why is SVM useful?
A: It works well for both linear and non-linear classification, especially with small datasets and clear class separation.
32
Q: (Classification) What is underfitting?
A: A model that is too simple and misses important patterns in the data (high bias, poor training & testing accuracy).
33
Q: (Classification) What is overfitting?
A: A model that is too complex and fits the training data too well, including the noise (low bias, high variance, poor generalization). fits training data too well will do poor in test data
34
Q: (Classification) How to detect underfitting vs. overfitting?
Underfitting: low training & test accuracy Overfitting: high training but low test accuracy
35
(diagnostic) profiling BENFORDS LAW
early detection of irregularities continuous monitoring enables company to analyze real time data
36
Labeled data
Labeled data = data met bekende uitkomst of antwoord (label) Elke observatie (rij in je dataset) heeft niet alleen invoerwaarden (zoals leeftijd, geslacht, inkomen), maar ook een label, bijvoorbeeld: "ja/nee" "goed/slecht" "1/0" "fraude/niet-fraude" "goedgekeurd/afgewezen" 🔁 → Wordt gebruikt bij supervised learning → Het model leert van deze bekende antwoorden om nieuwe gevallen te voorspellen.
37
unlabeled data
Unlabeled data = data zonder uitkomst/antwoord Je hebt alleen kenmerken of eigenschappen (features), maar geen categorie of waarde die je vooraf kent. 🔁 → Wordt gebruikt bij unsupervised learning → Het model zoekt zelf naar patronen of groepen zonder dat je het goede antwoord geeft.