Business Intelligence & Data Management Flashcards

1
Q

Data

A

Unprocessed raw facts, items that are the most elementary descriptions of things, events, activities, and transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Information

A

Organized data that has meaning and value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Knowledge

A

Processed data or information that is applicable to a business decision problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Structured data

A

Data that has structure, what you typically find in a database/sheet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Unstructured data

A

Data that has no structure, that is complex. Data that you find ‘in the wild’. E.g. images, sound, video, written text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data management

A

discipline that focuses on the proper generation, storage and retrieval of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Database

A

A shared and integrated computer structure that stores a collection of end user data and metadata.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Database management System (DBMS)

A

Collection of programs that manages the DB structure and controls access to the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Single user database

A

Database that supports only one user at a time. E.g. desktop database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Multiuser database

A

Database that supports multiple users at a time. E.g. Workgroup database or enterprise database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Centralized database

A

Database for which the data is stored at a single site.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Distributed database

A

Database for for the data is stores across several sites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Operational database

A

Also known as transactional or production database. A database that support an organization’s day to day operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data warehouse (description)

A

A system with the primary focus of storing data to be used to generate information to make tactical or strategic decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Semi-structured data

A

Data that you have already prepared to some extent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Extensible Markup Language (XML)

A

A special language used to represent and manipulate data elements in a textual format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

XML Database

A

Database that supports the storage and management of semi-structured XML data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Database design

A

Refers to the activities that focus on the design of a database. If done properly, it requires the designer to identify precisely the database’s expected use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Data model

A

A simple representation of a more complex real world data structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Entity

A

Any physical or conceptual object about which data is to be collected or stored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Attribute

A

Characteristics of an entity (e.g. weight, class, size, value, name)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Relationship

A

AKA Cardinality. The association between entities. Can be:

One to many (1, M)
Many to many (M, N)
One to one (1, 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Constraint

A

Restriction placed on an attribute in a database. E.g. Employee age must be between 16 and 99.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Table

A

Contains rows and columns . Is a persistent representation of logical relations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Relational table

A

Table for which each row can be uniquely identified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Primary key

A

Attribute or combination of attributes that uniquely identify an entity (or row in database). Is also a super key and a primary key. Primary key is always underlined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Key

A

One or more attributes that determine other attributes. E.g. an order ID determines what products where bought by who.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Composite key

A

A uniquely identifying key that can only uniquely identify an entity because it is comprised of two or more non-uniquely identifying keys. For example, a postal code + house number to identify a house.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Functional dependence

A

Attribute Y is functionally dependent if attribute X determines Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Full functional dependence

A

If Attribute X determines attribute Y, but attribute Y is not determined by attribute Z (or any other attribute).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Super key

A

Key that uniquely identifies each rows, that determines all of the entity’s attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Candidate key

A

A minimal (irreducible) superkey. A superkey that does not contain a subset of attributes that is itself a superkey.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Entity integrity

A

Each primary key must be unique as to ensure that a each entity is uniquely identified by only one primary key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Controlled redundancy

A

How relational databases work. By placing foreign keys in tables in database, you are able to refer to an entity from another table. The foreign key is present both in the referring and referred table, thus storing redundant information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Referential integrity

A

Whether the foreign key contains a value, that value refers to an existing valid tuple in another relation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Secondary key

A

A key that is used strictly for data retrieval purposes only. This key doesn’t necessarily result in a unique outcome (multiple entities can have a shared secondary key).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

One-to-many (1:M) relationship

A

For example a mother (1) has two or more children (M).

In this case, the kids have (1) mother but she can have more (M) children.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

One-to-one (1:1) relationship

A

For example a car can (at one time) only be driven by one person, and that person can only drive one car (at one time).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Many-to-many (M:N) relationships

A

For example a company can have multiple employees and each employee can have multiple employers (jobs).

In order to make this relationship in a database you need a compose entity. This is a table that links each unique combined relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Compose entity

A

AKA bridge entity or associative entity. A (linking) table that is used to link two other tables that have a M:N relationship. The compose entity has the primary keys of the other to tables, as to indicate the unique relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Normalization

A

The process of evaluating and correcting table structure to minimize data redundancies. There are 4+1 levels of data redundancy. It is not always required to go to the highest normalization level. More normalization means more join operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Denormalization

A

The process of lowering the normal form. For example, 3NF can be converted to 2NF through denormalization. It decreases the number of required operations to manage the database (increasing performance), increases the data redundancy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Data redundancy anomalies

A

There are three types:

  • Update anomalies
  • Insertion anomalies
  • Deletion anomalies

If the same data is stored in multiple tables, you can accidentally forget to update all the versions of the data, you might accidentally delete data you still need, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Internal data

A

Data that comes from within the company

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

External data

A

Data that comes from elsewhere but is useful for a company

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Business Intelligence

A

Is the combination of data warehousing and descriptive analytics. It is an umbrella term that combines processes, technology and tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Business Analytics

A

Is the combination of predictive and prescriptive analytics. However in this course both refer to the same thing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Database system

A

Consists of: Data, Software, Hardware, Users

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Relational database

A

Allows data to be grouped into table and sets relationships between tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Structured Query Language (SQL)

A

Most popular querying language for relational databases. SQL is used at the backend and frontend for many databases management systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Join (SQL)

A

Joins two tabel together in multiple different ways:

(Inner) Join - returns the entries that are shared between two tables
Left/Right outer join - Returns the entire left/right table plus shared entries between both
Full outer join - Return both tables in full
Minus - Returns one table except entries that occur in common.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Production stage

A

Consists of:

Different platforms and databases. Internal, external, (un)structured data, inconsistent data, limited history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Extraction Transformation Load (ETL) stage

A

Stage that extracts data from operational / production stage. Transforms data to make it fit the warehouse. Then loads into warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Data warehouse (makeup)

A

Relational DBMS

Is of high quality, subject-oriented, integrated, time-variant and nonvolatile.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Data mart

A

Subset (or small warehouse) of a data warehouse to support a specific Business Unit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Metadata

A

Data about data. For example the location, meaning or origin of other data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Business intelligence (front-end) application

A

Used for:
Querying and reporting
Data mining
Data visualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Subject-oriented data

A

Focusses on the analysis of data for the decision makers

Provides a simple view around a particular subject by excluding irrelevant / not useful data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Integrated data

A

Data is combined (integrated) with data from multiple heterogeneous data sources in a clean and consistent way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Time-variant data

A

The ability to store data as it existed in multiple points in time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Non-volatile data

A

This means that data is not updated once it’s in the datawarehouse. Only new data is being added (for example a new point in time).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Six data warehouse architectures

A
  • Direct BI on source systems
  • Canned DW
  • Independent data marts with bus architecture
  • Enterprise DW
  • Hub & Spoke (enterprise + data marts) - Most popular
  • Federated DW
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

DW Development approaches

A

Bottom-up - Start with data marts for each BU, then integrate those
Top-down - Start with one large DW, then make smaller marts for each BU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Data Lake

A

Database that holds raw data in its’ native format until it’s needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Features of DW (Data, structure, processing, security, Users)

A
Data Warehouse
Data - Cleaned, aggregated
Data structure - Structured, processed
Processing -Focus on write
Security - Mature
Users - Business professionals and data scientists
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Features of Data Late (Data, structure, processing, security, Users)

A
Data Lake
Data - Raw
Data structure - Unstructured
Processing - Focus on read
Security - Maturing
Users - Data scientists
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Data converting

A

Converting data types into other types. For example converting integer 0 & 1 to boolean true and false, also currency conversions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Data clearing

A

Validation and filtering of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Integral load

A

In this case the ETL stage loads all the data (both changed and unchanged) into the DW at one time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Delta / Incremental load

A

Here the ETL stage only loads the changes since the previous load

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Online Analytical Processing (OLAP)

A

Enables users to interactively analyze multi-dimensional data from multiple perspectives. It’s intuitive way to organize large amounts of data. The user acts as a sort of explorer. Data is structured as a cube.

72
Q

Roll up (OLAP)

A

Aggregating measures to a higher dimension level (from quarter to year)

73
Q

Drill down (OLAP)

A

Revers of roll up (from months to days)

74
Q

Slice & dice (OLAP)

A

Selecting subsets of cells that satisfy a certain condition

75
Q

Pivots (OLAP)

A

Rotates data axis

76
Q

Lattice Cubes

A

Still need to figure this one out

77
Q

OLAP Software basics

A

OLAP Software is based on:
Measures - the numerical data of thing you want to compute
Dimensions - the different perspectives on those measures, more categorical data
Dimension hierarchies - structure behind the dimensions

78
Q

Star scheme

A

A table layout with a fact table (containing numeric data) in the middle, with dimension tables connected to it (containing categorical data).

This scheme limits the number of required SQL joins, increasing performance.

79
Q

Snowflake scheme

A

Variance / refinement of the star scheme. Some dimensional data is morlaized into smaller dimension tables.

80
Q

Central fact table

A

The central table in a star or snowflake scheme. Multi-valued since all numeric data is stored within the fact table.

81
Q

Dimension table

A

The surrounding tables in a star of snowflake scheme. Are single valued with meaningless surrogate keys

82
Q

Surrogate key

A

Are primary keys that replace the original (operational) primary key. This in an effort to save storage space. Has a lot of columns and always a 1:M relation with the fact table

83
Q

Dimensions

A

Still need to write this one

84
Q

Conforming dimensions

A

Dimension tables that are shared between multiple fact tables.

85
Q

Data mining

A

The process of discovering new valuable knowledge in databases. Can be both hypothesis (test a claim) and discovery (explore for knowledge) driven.

Can be machine learning, statistics and databases

86
Q

Big data’s five V’s

A

Volume - Long and wide
Velocity - High speed
Variability - lots of different numbers, texts, images, etc.
Veracity - Quality issues
Value - Should be valuable to the organization.

87
Q

Deductions

A

Example:
All cows give milk.
Betsie is a cow
Betsie gives milk

88
Q

Abduction

A
Example:
All cows give milk.
Betsie gives milk
Betsie is a cow
This is the wrong way around, ofc.
89
Q

Induction

A
Betsie is a cow
Betsie gives milk
Clarabelle is a cow
Clarabelle gives milk
All cows give milk
Similar to this is how data mining with machine learning works
90
Q

Regression analysis

A

The analysis of statistical relationship between variables:

Y= alpha + beta * X+ residual

91
Q

R-square

A

A number between 0 and 1. Indicates the goodness of the fit. 1 is perfect, 0 is the worst..

92
Q

Ordinary least Squares (OLS)

A

Also referred to as lineair regression. Is a method of estimation used in linear regression
Minimizes the errors associated with each predicted value for Y
Uses square because without squares both positive and negative deviations can appear and cancel each-other out.

93
Q

Regression model selection

A

Forward selection
Create a model with one independent variable and add more until you only generate more noise

Backward selection
Create a model with all the indep. variables and delete them until you are left with the ones that influence the dep. variable.

94
Q

K-nearest neighbors

A

Is a classification method. Based on a training set you make an algorithm that is able to determine the class of new data. You also split a validation set from the training set to determine value of K. After determining best K use testing set as final check.

95
Q

Euclidian distance

A

To calculate the distance between point A and B:
Square root of: (Xb-Xa)^2 + (Yb-Ya)^2

It’s based on the Pythagoras theorem.

96
Q

Manhattan distance

A

To calculate the distance between A and B:

Xb-Xa) + (Yb-Ya

97
Q

Choosing value of K

A

If K is too small, it will be sensitive to noise and not be representative to the class.
If K too large, the model may may include points from other classes.

To optimize the K test all of them and check for optimal value.

98
Q

Curse of dimensionality

A

In case of a high amount of dimensions (with many attributes), everything is ‘far’ apart. This results in no nearby points, unless an extremely large amount of data is found (exponential to the number of dimensions). For 10 dimensions you have 1024 subcubes. You need more than 1024 datapoints in order have on avg 1 per subcube.

99
Q

Naive Bayes formula

A

P (Y|X) = (P(X|Y) * P(Y)) / P(X)
OR
P (Y|X1X2Xn) = P(X1X2Xn|Y) * P(Y)/P(X1X2Xn)

100
Q

Precision

A

The amount of errors in the class prediction

Precision of 80% means that 80% of the predictions were correct.

Formula:
True Positive / True positive + false positive

101
Q

Recall

A

The fraction of correct class predictions over the actual amount in that class.

Recall of 80% means that 80% of the items of a class were found.

Formula:
True Positive / True positive + false negative

102
Q

F1-Measure

A

Combination of precision and recall in one score

Formula: (2 * Precision * Recall) / (Precision + recall)

103
Q

ROC Curve

A

The ROC Curve is a plot of the recall against 1-precision.

104
Q

Decision Tree

A

A data-driven classification method that is easy to explain and can easily be transformed into a ruleset

105
Q

Information Gain

A

A method to determine the best way to split data in a decision tree. Calculate M0 - M12 and M0 - M34 using the Gini Index. Whichever is the best is the best way to split data.

106
Q

Gini Index

A

1 - (the probability of “yes”)^2 - (the probability of “no”)^2
The lower, the better the split.

107
Q

Entropy Measure

A

Similar to the Gini index, but the proportion of the instances that belong to each class K is multiplied by log(2).

108
Q

Recall Formula

A

tp/(tp+fn)

109
Q

Precision Formula

A

tp/(tp+fp)

110
Q

Accuracy Formula

A

(tp+tn)/(tp+tn+fp+fn) or (tp+tn) / n

111
Q

F1-Measure Formula

A

2*(Precision * Recall) / (Precision + Recall)

112
Q

Error Rate Formula

A

(fp + fn) / N

113
Q

Univariate

A

Statistical analysis with a type of data which consists of observations on only a single characteristic or attribute. For example, the salaries of workers in an industry

114
Q

Multivariate

A

Statistical analysis with a type of data which consists of observations on two or more characteristic or attribute. For example, the salaries of workers in an industry.

115
Q

Binary Split

A

Trees with only binary (two-way) splits are called binary threes

116
Q

Multiway split

A

Trees with more than two-way splits are called Multiway trees

117
Q

Overfitting

A

When you make a model that fits every point of the training data perfectly but fails to capture the underlying structure resulting in bad predictive or classification power on testing data.

118
Q

Underfitting

A

When you make a model that doesn’t fit the training data enough so that it is unable to capture the underlying (actual) structure

119
Q

Parametric

A

A analytical method that assumes that the underlying data structure follows a normal distribution and homogenous sample variances

120
Q

Non-parametric

A

A analytical method that DOES NOT assume that the underlying data structure follows a normal distribution and homogenous sample variances

121
Q

Supervised training

A

A training method that requires you already know the actual class for the training set, allowing you to train the model so that it will result in those classes on testing data.

122
Q

Unsupervised training

A

A training method that does not require you to know the actual class for the training set.

123
Q

Antecedent

A

Something that existed before, or logically precedes something else

124
Q

Consequent

A

Something that exists after, or logically follows something else (following as a result / effect)

125
Q

Item

A

A single attribute value pair. For example; Outlook = sunny

126
Q

Itemset

A

All items occurring in a transaction or record. For example: Outlook = sunny, Humidity = high, PlayTennis = No

127
Q

Frequent itemset

A

An itemset that has minimal support of K (where K is predefined by the user)

128
Q

Association rule

A

A rule set that follows the IF - THEN format.

For example: IF humidity = high AND play = no THEN windy = false AND outlook = sunny

129
Q

Apriori Property

A

Is a seminal algorithm for mining frequent itemsets for Boolean association rules.

Example:
If (milk bread jam) is a frequent itemset, then (milk bread) is also a frequent itemset

130
Q

Interesting association rules

A

Association rules can be interesting for two reason:

Objective measures & Subjective measures

131
Q

Objective association rule measure

A

A association rule is objectively interesting if it has:
Support
Confidence
Lift

132
Q

Subjective association rule measure

A

A association rule is subjectively interesting if it:
Is unexpected
Is actionable

133
Q

(Association rule) Support Formula

A

frequency (X, Y) / N

134
Q

(Association rule) Confidence Formula

A

frequency (X, Y) / frequency (X)

135
Q

(Association rule) Lift Formula

A
Support / (Support (Y) * Support (X))
OR
Confidence / Prob (RHS)
OR
Prob (LHS and RHS) / (Prob (LHS) * Prob (RHS)
136
Q

Lift value meaning

A

If Lift = 0, then the frequency of occurence between LHS and RHS is not existant

If Lift = 1, then LHS and RHS are independent

If Lift > 1, there is a relation between LHS and RHS (LHS is a strong indicator for RHS)

If lift < 1 (not yet sure).

137
Q

Confusion Matrix order

A

True Negative, False Positive

False Negative, True Positive

138
Q

Decision tree node meaning

A

Red node = Non acceptor
Blue node = Accepter
Left = Non Acceptor
Right = Acceptor

Red node shows:
21
[20,1]
This means 21 were classified as Non-acceptor. 20 Were truly non-acceptor (true negative). 1 was actually acceptor (false negative)

139
Q

Clustering

A

Unsupervised classification method where data is divided into natural groups such that objects in the same cluster are similar and objects in different clusters are dissimilar.

140
Q

Clustering methods

A

Partitional algorithms:
Construct various partitions and then evaluate them by some criterion

Hierarchical algoritms:
Create a hierarchical decomposition of the objects using some criterion

141
Q

Partitional Clustering Method

A

Pre-specify a desired number of cluster, i.e. ‘K’
Assign a cluster to each object (K clusters)
Minimise sum of distances within clusters
Maximise sum of distances between clusters
In this course only K-means is considered

142
Q

Centroid

A

A virtual object that is used to represent all the physical objects in a cluster

143
Q

K-Means clustering

A

Partitional clustering approach
Each cluster is associated with a centroid (center point of the cluster)
Each point is assigned to the cluster with the closest centroid
Basic algorithm is very simple

144
Q

K-Means clustering algorithm

A
  1. Select K points as the initial centroids (at random)
    REPEAT
  2. Form K clusters by assigning all points to the closest centroid
  3. Recompute the centroid of each cluster
    UNTIL the centroids don’t change
145
Q

Hierarchical Clustering Method

A

Hierarchical decomposition of the objects using some criterion.
Begin with a matrix with the distances between all the object pairs. Then continue with bottom-up approach (agglomerative)

146
Q

Bottom-Up (agglomerative) Approach

A

REPEAT
1. Put each item in its’ own cluster
2. Find the best pair to match in new cluster
UNTIL all clusters are fused together

147
Q

Single linkage distance

A

Determined by the distance of the two closest records (nearest neighbors) in the different clusters.
Tends to produce long chains / elongated clusters.

148
Q

Complete linkage distance

A

Determined by the maximum distance between any two records in the different clusters.
Tends to produce spherical clusters with consistent diameter.

149
Q

Centroid linkage distance

A

Calculated as the distance between the centroids of the different clusters.

150
Q

Ward’s method

A

Starts with n clusters, each containing a single object. These n clusters are combined to make one cluster containing all objects. At each step, the process makes a new cluster that minimizes variance, measured by an index called E (Error sum of squares)

151
Q

Dendrogram (Binary tree)

A

Is a treelike diagram that summarizes the process of clustering. The records are drawn on the X-axis (horizontal). Vertical line reflects distance between records. Use a cutoff value to limit the amount of clusters.

152
Q

Average linkage distance

A

Calculated as the average distance between all pairs of records between the different clusters.
Is less affected by outliers.

153
Q

Aspects to consider when validating clusters

A

Cluster interpretability
Cluster stability
Cluster separation
Number of clusters

154
Q

Retrospective analytics

A

Descriptive analytics and diagnostic analytics

Methods: Database management, Data warehousing framework, OLAP databases & Dashboards

155
Q

Prospective analytics

A

Predictive & Prescriptive analytics

Methods: Data Mining Process and all the python models

156
Q

Static Typing

A

Variable type is checked at compile-time

The variable itself is defined as as a certain type. All values of the variable need to be that type.

157
Q

Dynamic Typing

A

Variable type is checked at run-time

The variable is not defined as a certain type. The value of the variable is assigned a type

158
Q

Interpreter

A

Program that directly executes instructions written in a programming language

159
Q

Compiler

A

Program that transforms source code in a lower-level (machine) language

160
Q

Inmon approach

A

Top-down, Complex, Subject- or data-driven

Enterprise-wide (atomic) DW, Feeds departmental DB’s

161
Q

Kimball approach

A

Bottom-up, Simple, Proces Oriented, Dimensional modelling. ‘DW’ is a collection of Data Marts.

Data marts model single Business processes and conformed dimensions

162
Q

Transient data source

A

Name for the supporting databases that store intermediate transformation results (ETL-stage)

163
Q

Transform stage

A

Simple data conversions (time notation, etc)
Complex type conversions (categorical data, address formats)
Currency conversions
Language conversions

164
Q

Dimension Hierarchy

A

The hierarchy in a dimensions. For example hierarchy in the dimension time: All, Year, Month, Day

165
Q

Dimension Hierarchy

A

The hierarchy in a dimensions. For example hierarchy in the dimension time: All, Year, Month, Day

166
Q

Calculate number of cubes in a multi-dimensional database

A

(Dimension level + 1) ^ dimensions

167
Q

Star scheme: restating

A

Overwrite old values. Use with infrequent changes (e.g. customer’s address), or to correct errors. Kills history (overwriting previous values).

168
Q

Nominal Data

A

AKA Categorical data. Values themselves only serve as label. E.g. Sunny, Overcast, Rain

169
Q

Ordinal Data

A

Impose order, but no distance between values can be derived. E.g. Hot, Mild, Cold

170
Q

Interval Data

A

Impose order and distance between value can be derived. However, no natural zero-point. E.g. Temperature in Fahrenheit or the current year.

171
Q

Ratio Data

A

Impose order and distance between values van be derived. Also has a natural zero point. E.g. Length, weight.

172
Q

SEMMA

A

Sample, Explore, Modify, Model, Assess

173
Q

CRISP-DM

A
Cross Industry Standard Process for Data Mining
Step 1: Business Understanding
Step 2: Data Understanding
Step 3: Data Preparation (I)
**Step 1-3 take about 85% of project time.**
Step 4: Model Building
Step 5: Testing and Evaluation
Step 6: Deployment
174
Q

Laplace Smoothing

A

Prevents nullification of the Naive Bayes formula by smoothing preventing a certain probability from being zero.
Formula: Count (A,C) + 1 / Count (C) + Number of values in class

175
Q

Mean Absolute Error / Deviation (MAE)

A

(1/n) * ∑( e(i) )

176
Q

Mean Percentage Error (MPE)

A

100 * (1/n) * ∑( e(i) / y(i) )