DM Exam Paper Qs Flashcards

(96 cards)

1
Q

Give a definition of an outlier

A

A data point that lies unusually far from the majority of datapoints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give an example of an outlier in credit card transactions

A

A fraudulent transaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Propose two methods that can be used to detect outliers

A
  • Clustering
  • combined computer and human inspection
  • regression
  • box plots
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which outlier detection method is the most reliable

A

Combined computer and human inspection is the most reliable as the computer detects the suspicious values and they are checked by the human, this two-step process is more reliable than just relying on an algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Steps of K-means algorithm

A
  1. randomly select k initial cluster centroids
  2. assign every item to its nearest centroid
  3. recompute the centroids
  4. Repeat from step 2 until no reassignments occur
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Give three application examples of spatiotemporal data streams

A
  • traffic monitoring
  • environmental monitoring
  • weather pattern analysis
  • asset tracking in logistics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Discuss what kind of interesting knowledge can be mined from such data streams, with limited time and resources.

A

Summarisation and aggregation of data at different levels of granularity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Identify and discuss the major challenges in spatiotemporal data mining

A

High Dimensionality: it involves multiple dimensions (space and time) for each data point

  • Analysing and visualising can be computationally intensive
  • careful consideration for data structure
  • choice of computation method (selective, partial, full)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Using one application example, sketch a method to mine one kind of knowledge from such stream data efficiently.

A
  • discuss the inputs and outputs of weather data warehouse
  • Draw star schema of weather warehouse
  • Use OLAP for efficient data analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the bottleneck of Apriori Algorithm

A

Candidate Generation
- very large candidate sets
- multiple scans of the database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Briefly explain a method to improve Apriori’s efficiency

A

Transaction reduction: A transaction that does not contain any frequent itemsets is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Apriori Algorithm Steps

A

Given a Database D and a minimum support MS:
1. Initial scan of D to get the frequencies of each individual item (Candidate generation)
2. Eliminate candidates that don’t have MS
3. Construct new candidates by combining eligible candidates (resulting itemset must be only one item larger)
4. Repeat pruning and construction until all frequent itemsets are found

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Give the necessary steps of the FP-growth algorithm.

A
  1. Scan dataset and construct frequency table of each item
  2. Eliminate items without minimum support
  3. Transform the transactions ordering the frequent items in descending order
  4. Construct tree in top-down recursive fashion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a data cube and how is it formed?

A

A multidimensional data model views data in the form of a data cube.
It is formed from the lattice of cuboids and allows data to be modelled and viewed in multiple dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a fact table

A

Contains keys to each of the related dimension tables and measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Give a definition for each of the three categories of measures that can be used for the data warehouse

A

Distributed: can be computed independently on partitions of the data (e.g. count)
Algebraic: involve combining results from distributive functions in a structured way (e.g. avg)
Holistic: require analyzing the entire dataset as a whole due to their complexity and lack of constant bounds on storage size (e.g. median)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Discuss how to efficiently calculate the top 10 values of a feature in a data cube that is partitioned into multiple chunks

A

Get the maximum of all the chunks, remove it and repeat the process 10 times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Define the notions of support and confidence

A

rule: Y → Z
Support: probability that a transaction contains {Y → Z}
Confidence: Conditional probability that a transaction having Y also contains Z

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain why holistic measures are not desirable when designing a data warehouse.

A
  • provide: valuable insights
  • trade-offs: computational complexity, scalability, storage requirements
  • less desirable in the context of a data warehouse, where performance and responsiveness are critical considerations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

K-means complexity calculation

A

O(tkn)
t: number of iterations
k: number of centroids
n: number of data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What kind of problem is k-means clustering and what does it mean

A

It is NP-hard, which means that it is at least as hard as any NP-problem, although it might, in fact, be harder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Describe the possible negative effects of proceeding directly to mine the data that has not been pre-processed.

A
  • inaccurate results
  • overfitting
  • data inconsistency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define Information Retrieval

A

the process of obtaining relevant information from a large repository of data, typically in the form of documents or records, in response to a user’s information need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Define Precision and Recall

A

Precision: The percentage of retrieved documents that are in fact relevant to the query
Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the main difference between clustering and classification
Clustering is an unsupervised learning technique which groups the data. Classification is a supervised learning technique that predicts the class or value of unseen data
26
Describe briefly the necessary steps for handling ordinal variables when computing dissimilarity measures.
Convert the ordinal categories into numerical values while preserving the ordinal relationship. Assign integers to ordinal categories based on their order.
27
How to calculate the dissimilarity matrix for an attribute variable
1. Calculate dissimilarity: for each pair of observations, calculate the dissimilarity using the chosen distance metric. 2. Form the matrix: create a square matrix where each element represents the dissimilarity between two observations
28
Define the notion of a frequent itemset
Any subset of a frequent itemset must be frequent
29
The FP-growth algorithm adopts the strategy of divide-and-conquer. What is the main advantage of this strategy over the one used in the Apriori algorithm?
Its compactness. - reduces irrelevant info: no infrequent items - descending ordering of frequencies - result is never larger than the original dataset
30
Describe briefly the data mining process steps
1. Data preprocessing: Gathering, Cleaning, transformation, integration 2. Data Analysis: Model building and evaluation
31
It was claimed that the clustering can be used for both pre-processing and data analysis. Explain the difference between the two applications of clustering
- sampling in Data Preprocessing - gain insights in Data Analysis
32
Explain the main differences between data sampling and data clustering during the pre-processing step
- sampling is using representatives from the clusters - clustering is the process of identifying the clusters
33
Explain how a query-driven approach is applied on heterogeneous datasets.
A wrapper is built on top of the database's meta-dictionary, the MD solves the inconsistencies in the dataset
34
Give the advantages and disadvantages of an update-driven approach
ADV: faster processing, data is stored and structured DISADV: small datasets create overhead when the data is homogeneous
35
Explain why an update-driven approach is preferred to a query-driven approach
When you have a large dataset of heterogeneous sources, it provides faster processing with potentially lower costs in the long term
36
Describe situations where a query-driven approach is preferable to an update-driven approach in pre-processing
when the dataset is small or consists of homogeneous sources
37
How to draw a snowflake schema
Extend the main Fact Table to its nested fact tables
38
Give a brief description of the core OLAP operations
- Roll up (drill-up): summarise data by climbing up hierarchy or dimension reduction - Drill down (roll down): reverse of roll-up - Slice and dice: project and select - Pivot (rotate): reorientate the cube
39
One of the benefits of the FP-tree structure is Compactness. Explain why FP-growth method is compact
- reduces irrelevant info: no infrequent items - descending ordering of frequencies: more frequent items are more likely to be shared - never larger than the original dataset
40
Explain how the FP-growth method avoids the two costly problems of the Apriori algorithm
The two costly problem of huge candidate sets and multiple scans of the database are avoided by scanning the database once.
41
Can you always find an optimal clustering with k-means? Justify your answer
No, as optimal clustering relies on a certain initialisation of the centroids.
42
Illustrate the strength and weakness of k-means in comparison with a hierarchical clustering scheme (e.g. AGNES)
- initial initialisation: AGNES is robust - cluster shapes: AGNES is more flexible - outlier sensitivity: K-means is very sensitive - specification of number of clusters - k-means is less expensive
43
Illustrate the strength and weakness of k-means in comparison with k-medoids
- outliers: k-medoids is more robust - cluster shapes: k-medoids does not assume spherical shapes - computational complexity: k-medoids is higher
44
what is the basic methodology of Latent Semantic Indexing
Create a frequency table Use a singular value decomposition (SVD) technique to reduce the size of the frequency table, then retain the most significant rows
45
What does DBSCAN discover, and what does it rely on?
- clusters of arbitrary shape in spatial datasets with noise - a density-based notion of cluster
46
Properties of a spatial data warehouse
Same as DW; integrated, subject-oriented, time-variant, and non-volatile
47
What are the two parameters in density-based clustering
Eps: max radius of the neighbourhood MinPts: min number of points in an eps-neighbourhood of that point
48
List clustering applications
- Pattern recognition - image processing - economic science - spatial data analysis - WWW
49
What is the simplified assumption in the Naive Bayes Classifier
attributes are conditionally independent
50
Bayesian Theorem Formula
P(A|B) = P(B|A) P(A) / P(B)
51
Briefly explain the two methods to avoid overfitting in Decision Trees
Post-pruning: Remove branches from a "fully grown" tree - get a sequence of progressively pruned trees Pre-pruning: Halt tree construction early - do not split a node if this would result in the goodness measure falling below a threshold
52
How is an attribute split performed using the Gini Index?
The attribute that provides the smallest Gini Split is chosen to split the node
53
What does the Gini index measure?
The impurity or purity of the dataset
54
How is a Decision Tree constructed using a basic algorithm (a greedy algorithm)?
In a top-down recursive divide-and-conquer manner
55
What is tree pruning?
Identification and removal of branches that reflect noise or outliers
56
Give an example of web structure mining using (a) links and (b) generalisation
a) PageRank: assignment of weights to pages using interconnections between pages b) VWV: multi-level database representation of the web
57
What is Singular Value Decomposition (SVD)?
a matrix factorization method that decomposes a matrix into three other matrices: U, S, and V.
58
What are some major difficulties of keyword-based retrieval?
- Synonymy: A keyword does not appear anywhere in the document, even though the document is closely related to the keyword - Polysemy: The same keyword may mean different things in different contexts
59
what is the strategy for multidimensional analysis of complex data objects?
1. generalise the plan-base in different directions 2. look for sequential patterns in the generalised plans 3. derive high-level plans
60
what is a plan and plan mining
- a plan is a variable sequence of actions - plan mining is extracting significant generalised (sequential) patterns from a plan-base (large collection of plans)
61
Why Decision Tree Induction in data mining?
- relatively faster learning speed (than other classification methods) - convertible to simple and easy-to-understand classification rules - can use SQL queries for accessing databases - comparable classification accuracy with other methods
62
List some enhancements to basic decision tree induction
- Allow continuous-valued attributes - handle missing attribute values - attribute construction
63
What is the general methodology of Association-Based Document Classification
- Extract keywords and terms by information retrieval and association analysis techniques - Obtain the concept hierarchies, then perform classification and association mining methods
64
Keyword-based association analysis definition
Collect sets of keywords that occur frequently together and then find the association or correlation relationships among them
65
what is the step-by-step method for performing Latent Semantic Indexing
1. Create a term frequency matrix 2. SVD construction 3. Vector Identification 4. Index creation
66
What are the purposes of dimension tables, cardinality, and operators in generalisation-based sequence mining
- use dimension tables to generalise plan-base in a multidimensional way - cardinality determines the right level of generalisation (level planning) - use operators (merge + , option []) to further generalise patterns
67
what are the steps for mining spatial association
1. rough spatial computation (as a filter) 2. detailed spatial algorithm (as a refinement)
68
two categories of similarity queries in time-series analysis
- Whole matching - subsequence matching
69
List major features of density-based clustering methods
- discover clusters of arbitrary shape - handle noise - one scan - need density parameters as stopping condition
70
in K-means, the global optimum may be found using what techniques?
- deterministic annealing - genetic algorithms
71
List the Data Mining Tasks
1. Problem Definition 2. Data gathering and preparation 3. Model building and evaluation 4. Knowledge deployment
72
List the four Data Quality Types in order
Perfect, not perfect, inspection, soft.
73
What is data discretisation?
reducing the number of values for a continuous variable by dividing the range into intervals, replacing the actual values with interval labels.
74
List the three central tendency statistics
mean, median, midrange
75
List the three data dispersion statistics
quantiles, IQR, variance
76
Five Number Summary
min, q1, median, q3, max
77
How to do equal-width partitioning?
Divide the range into N intervals of equal size Width = (Max-Min)/N
78
How to do equal-depth partitioning?
Divide the range into N intervals, each containing appx. the same number of objects
79
name a method to detect redundant data
correlation-based analysis
80
Give an example a non-parametric method for achieving numerosity reduction
Histograms: Divide the data into buckets and store average for each bucket Clustering: partition the dataset into clusters, and one can store cluster representation only Stratified Sampling: approximate the percentage of each class in the overall database to choose a representative subset of the data
81
How to use concept hierarchies for data reduction
collect and replace low level concepts (numerical age) by higher level concepts (young, old)
82
What is a Virtual Data Warehouse?
A set of views over operational databases
83
How is a DW subject-orientated?
- Organised around major subjects, such as customer, product, sales - Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing - Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
84
How is a DW integrated?
- integrates multiple heterogenous data sources - apply techniques of data cleaning and integration
85
How is a DW time variant?
- the data provides information from a historical perspective rather than current value data - every key structure in the DW contains an element of time, explicitly or implicitly
86
How is a DW non-volatile?
once data is loaded in, it is typically not subject to frequent changes or updates. The data remains relatively stable and unchanged over time.
87
What is Online Transaction Processing?
- OLTP is a major task of traditional relational DB - used by IT for day-to-day operations
88
What is a challenge in Weather Pattern Analysis when using a spatial data warehouse
A merged region may contain hundreds of **primitive** regions (polygons)
89
What are Polygons in the context of spatial data warehouses
Spatial Areas
90
What Dimensions and measurements are in the Fact Table of a Spatial Data Warehouse for mining weather pattern analysis
- Dimensions: Region_name, time, precipitation, temperature - Measurements: region_map, area, count
91
What is a reasonable choice for choice of computation method for Spatial Data Cubes
Selective computation: Only materialise spatial objects that will be accessed frequently
92
What is the difference between a traditional DB and a Data Warehouse
- DB used for day-to-day operations using OLTP - DW used for data analysis using OLAP
93
How to generalise spatial data, and what does it require
- **Generalise detailed geographic points into clustered regions**, such as business, residential, industrial, or agricultural areas, **according to land usage** - requires the **merge** of a set of geographic areas by **spatial operations**
94
Structure of star schema for weather warehouse
Fact Table with the four dimensions and 3 measures: Time: time_key, day, month... Region: region_key, name, location, city,... Temperature: temp_key, range, temp_value, description Precipitation: key, range, value, description Measures: map, area, count
95
Inputs and output of Weather Spatial Data Warehouse
Input: - a map with weather probes scattered around in an area - daily weather data - concept hierarchies for all attributes Output: - a map that reveals patterns: merged (similar) regions
96
Method to efficiently generate candidate sets
Store candidate itemsets in a hash-tree - leaf node of tree contains a list of itemsets and counts - Interior node contains a hash table - subset function: finds all the candidates contained in a transaction