DM Exam Paper Qs Flashcards

Question

What is the main difference between clustering and classification

Answer 1

Clustering is an unsupervised learning technique which groups the data. Classification is a supervised learning technique that predicts the class or value of unseen data

Answer 2

Convert the ordinal categories into numerical values while preserving the ordinal relationship. Assign integers to ordinal categories based on their order.

Answer 3

1. Calculate dissimilarity: for each pair of observations, calculate the dissimilarity using the chosen distance metric. 2. Form the matrix: create a square matrix where each element represents the dissimilarity between two observations

Answer 4

Any subset of a frequent itemset must be frequent

Answer 5

Its compactness. - reduces irrelevant info: no infrequent items - descending ordering of frequencies - result is never larger than the original dataset

Answer 6

1. Data preprocessing: Gathering, Cleaning, transformation, integration 2. Data Analysis: Model building and evaluation

Answer 7

- sampling in Data Preprocessing - gain insights in Data Analysis

Answer 8

- sampling is using representatives from the clusters - clustering is the process of identifying the clusters

Answer 9

A wrapper is built on top of the database's meta-dictionary, the MD solves the inconsistencies in the dataset

Answer 10

ADV: faster processing, data is stored and structured DISADV: small datasets create overhead when the data is homogeneous

Answer 11

When you have a large dataset of heterogeneous sources, it provides faster processing with potentially lower costs in the long term

Answer 12

when the dataset is small or consists of homogeneous sources

Answer 13

Extend the main Fact Table to its nested fact tables

Answer 14

- Roll up (drill-up): summarise data by climbing up hierarchy or dimension reduction - Drill down (roll down): reverse of roll-up - Slice and dice: project and select - Pivot (rotate): reorientate the cube

Answer 15

- reduces irrelevant info: no infrequent items - descending ordering of frequencies: more frequent items are more likely to be shared - never larger than the original dataset

Answer 16

The two costly problem of huge candidate sets and multiple scans of the database are avoided by scanning the database once.

Answer 17

No, as optimal clustering relies on a certain initialisation of the centroids.

Answer 18

- initial initialisation: AGNES is robust - cluster shapes: AGNES is more flexible - outlier sensitivity: K-means is very sensitive - specification of number of clusters - k-means is less expensive

Answer 19

- outliers: k-medoids is more robust - cluster shapes: k-medoids does not assume spherical shapes - computational complexity: k-medoids is higher

Answer 20

Create a frequency table Use a singular value decomposition (SVD) technique to reduce the size of the frequency table, then retain the most significant rows

Answer 21

- clusters of arbitrary shape in spatial datasets with noise - a density-based notion of cluster

Answer 22

Same as DW; integrated, subject-oriented, time-variant, and non-volatile

Answer 23

Eps: max radius of the neighbourhood MinPts: min number of points in an eps-neighbourhood of that point

Answer 24

- Pattern recognition - image processing - economic science - spatial data analysis - WWW

Answer 25

attributes are conditionally independent

Answer 26

P(A|B) = P(B|A) P(A) / P(B)

Answer 27

Post-pruning: Remove branches from a "fully grown" tree - get a sequence of progressively pruned trees Pre-pruning: Halt tree construction early - do not split a node if this would result in the goodness measure falling below a threshold

Answer 28

The attribute that provides the smallest Gini Split is chosen to split the node

Answer 29

The impurity or purity of the dataset

Answer 30

In a top-down recursive divide-and-conquer manner

Answer 31

Identification and removal of branches that reflect noise or outliers

Answer 32

a) PageRank: assignment of weights to pages using interconnections between pages b) VWV: multi-level database representation of the web

Answer 33

a matrix factorization method that decomposes a matrix into three other matrices: U, S, and V.

Answer 34

- Synonymy: A keyword does not appear anywhere in the document, even though the document is closely related to the keyword - Polysemy: The same keyword may mean different things in different contexts

Answer 35

1. generalise the plan-base in different directions 2. look for sequential patterns in the generalised plans 3. derive high-level plans

Answer 36

- a plan is a variable sequence of actions - plan mining is extracting significant generalised (sequential) patterns from a plan-base (large collection of plans)

Answer 37

- relatively faster learning speed (than other classification methods) - convertible to simple and easy-to-understand classification rules - can use SQL queries for accessing databases - comparable classification accuracy with other methods

Answer 38

- Allow continuous-valued attributes - handle missing attribute values - attribute construction

Answer 39

- Extract keywords and terms by information retrieval and association analysis techniques - Obtain the concept hierarchies, then perform classification and association mining methods

Answer 40

Collect sets of keywords that occur frequently together and then find the association or correlation relationships among them

Answer 41

1. Create a term frequency matrix 2. SVD construction 3. Vector Identification 4. Index creation

Answer 42

- use dimension tables to generalise plan-base in a multidimensional way - cardinality determines the right level of generalisation (level planning) - use operators (merge + , option []) to further generalise patterns

Answer 43

1. rough spatial computation (as a filter) 2. detailed spatial algorithm (as a refinement)

Answer 44

- Whole matching - subsequence matching

Answer 45

- discover clusters of arbitrary shape - handle noise - one scan - need density parameters as stopping condition

Answer 46

- deterministic annealing - genetic algorithms

Answer 47

1. Problem Definition 2. Data gathering and preparation 3. Model building and evaluation 4. Knowledge deployment

Answer 48

Perfect, not perfect, inspection, soft.

Answer 49

reducing the number of values for a continuous variable by dividing the range into intervals, replacing the actual values with interval labels.

Answer 50

mean, median, midrange

Answer 51

quantiles, IQR, variance

Answer 52

min, q1, median, q3, max

Answer 53

Divide the range into N intervals of equal size Width = (Max-Min)/N

Answer 54

Divide the range into N intervals, each containing appx. the same number of objects

Answer 55

correlation-based analysis

Answer 56

Histograms: Divide the data into buckets and store average for each bucket Clustering: partition the dataset into clusters, and one can store cluster representation only Stratified Sampling: approximate the percentage of each class in the overall database to choose a representative subset of the data

Answer 57

collect and replace low level concepts (numerical age) by higher level concepts (young, old)

Answer 58

A set of views over operational databases

Answer 59

- Organised around major subjects, such as customer, product, sales - Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing - Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Answer 60

- integrates multiple heterogenous data sources - apply techniques of data cleaning and integration

Answer 61

- the data provides information from a historical perspective rather than current value data - every key structure in the DW contains an element of time, explicitly or implicitly

Answer 62

once data is loaded in, it is typically not subject to frequent changes or updates. The data remains relatively stable and unchanged over time.

Answer 63

- OLTP is a major task of traditional relational DB - used by IT for day-to-day operations

Answer 64

A merged region may contain hundreds of **primitive** regions (polygons)

Answer 65

Spatial Areas

Answer 66

- Dimensions: Region_name, time, precipitation, temperature - Measurements: region_map, area, count

Answer 67

Selective computation: Only materialise spatial objects that will be accessed frequently

Answer 68

- DB used for day-to-day operations using OLTP - DW used for data analysis using OLAP

Answer 69

- **Generalise detailed geographic points into clustered regions**, such as business, residential, industrial, or agricultural areas, **according to land usage** - requires the **merge** of a set of geographic areas by **spatial operations**

Answer 70

Fact Table with the four dimensions and 3 measures: Time: time_key, day, month... Region: region_key, name, location, city,... Temperature: temp_key, range, temp_value, description Precipitation: key, range, value, description Measures: map, area, count

Answer 71

Input: - a map with weather probes scattered around in an area - daily weather data - concept hierarchies for all attributes Output: - a map that reveals patterns: merged (similar) regions

Answer 72

Store candidate itemsets in a hash-tree - leaf node of tree contains a list of itemsets and counts - Interior node contains a hash table - subset function: finds all the candidates contained in a transaction

DM Exam Paper Qs Flashcards

(96 cards)