test Flashcards
(24 cards)
What is data mining?
The non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data.
How does data mining relate to Knowledge Discovery in Databases (KDD)?
Data mining is one (central) step within the broader KDD process.
Name four common data-mining tasks.
Classification, clustering, association-rule mining, and regression (prediction).
Which two broad categories do data-mining tasks fall into?
Predictive mining and descriptive mining.
During KDD, what is the purpose of the pattern-evaluation step?
To identify the truly interesting, actionable patterns discovered by the mining algorithms.
List the main stages of the KDD pipeline in order.
Data cleaning → data integration → data selection → data transformation → data mining → pattern evaluation → knowledge presentation.
How does data mining differ from Online Analytical Processing (OLAP)?
OLAP supports fast, interactive exploration of known aggregates, whereas data mining digs deeper to discover previously unknown patterns automatically.
What is DMQL and why is it useful?
The Data Mining Query Language – a SQL-like language that lets users specify, in a high-level way, which patterns to mine and how to post-process them.
Why is the data-transformation step necessary before mining?
Because many algorithms expect data in specific formats or scales (e.g., normalised numbers, encoded categories).
Define data cleaning in the context of KDD.
The process of detecting, correcting or removing errors, inconsistencies and noise from the raw data.
What is the role of data integration in KDD?
To merge data from multiple heterogeneous sources into a coherent data set.
Why are measures of interestingness important?
They rank or filter mined patterns so that only the most relevant, novel or useful results are presented.
What are the three coupling schemes for integrating a mining system with a database?
No coupling, loose (semi-tight) coupling, and tight coupling.
What does the acronym OLAM stand for?
Online Analytical Mining.
Give one advantage of multidimensional data mining over traditional flat-table mining.
It can exploit pre-computed cube aggregates to speed up pattern discovery.
What is a fact table in a data-warehouse star schema?
A central table containing numeric measures of business processes, referenced by keys to surrounding dimension tables.
How does a dimension table differ from a fact table?
Dimension tables store descriptive attributes that define the perspectives (e.g., time, product, geography) used to analyse facts.
What is metadata in a data-mining context?
Data that describes other data – such as schema definitions, data provenance, quality indicators and transformation histories.
Describe a schema hierarchy in OLAP.
A concept hierarchy that organises attribute values into levels of abstraction (e.g., city→state→country).
What are the three strategies for materialising a data cube?
Do-not-materialise, partial materialisation, and full materialisation.
Contrast OLTP and OLAP systems in one sentence.
OLTP captures current, transaction-level data for routine operations, whereas OLAP stores integrated, historical data optimised for analysis.
Why is visualisation considered part of the KDD process?
It transforms mined patterns into human-readable forms (charts, graphs, dashboards).
What is data reduction and when is it applied?
A pre-processing technique that reduces data volume but produces a representative sample.
Give one ethical consideration before performing data mining.
Whether the mining activity violates privacy or could lead to discriminatory outcomes.