Final Flashcards

Question 1

Q

Given an input dataset and the Apriori algorithm, how to trace the algorithm for intermediate results? (Review)

Question 2

Q

How to derive strong rules from the given frequent itemsets L and a conf_rate?

Answer

A

Test for strong rules by filtering the rules with conf < min_conf.

Question 3

Q

How to improve the efficiency of the rule generation procedure by applying the apriori property?

Answer

A

Pruning while generating rules.

Question 4

Q

What are the two general purposes of DM, use some examples of mined association patterns to explain for each purpose?

Answer

A

Find frequent itemsets
- find all frequent itemsets from D
Generate association rules
- Dervice rules from each frequent itemset

Question 5

Q

How can the association mining process be mapped to the empirical cycle model of scientific research?

Answer

A

Obersevation: observe all the data
Analysis: generating all associations
Theory: apply apriori knowledge (support / confidence)
Prediction: predicting X given apriori knowledge

Question 6

Q

Why classification mining is a supervised learning process? How about association mining?

Answer

A

Partitioning training data based on divide-and-conquer strategy.
It trains on a portion of the data and then tests on the other portion. If the accuracy is acceptable it can be used to predict.

Question 7

Q

What are the major phases of conducting a classification mining application?

Answer

A

Training
- Each tuple is assumed to belong to a predefined class, as determined by the class label attribute
- The set of tuples used for model construction is training set
Testing
- The known label of test sample is compared with the classified result from the model
- Accuracy rate is the percentage of test set samples that are correctly classified by the model
Predicting
- If the accuracy is acceptable, use the model to classify unseen data

Question 8

Q

Can you describe a mapping between a classification application process and the empirical cycle?

Answer

A

Analysis –> Classification algorithm
Theory –> Classification Model
Prediction –> Testing & Prediction
Observation –> Training Data

Question 9

Q

What is the general idea/strategy/method/alorithm of DT induction for classification mining?

Answer

A

Supervised learning
- Derive a model form a training data set
Inductive learning process
- Contructing a tree using to-down, divide-and-conquer strategy
- Testing -> Choose -> Split
Tree constuction/induction by greedy search
- Depth-first search
- Heuristic function

Question 10

Q

What is the general strategy of Inductive Learning (via observing examples)?

Answer

A

Divide and conquer strategy
Continue dividing D into subsets, based a search method, until each subset has only one label, i.e. all examples in the subset share a same class label.

Question 11

Q

What are the major technical issues of DT Induction approach for classification mining?

Answer

A

Preparing datasets: (training & testing)
- A training dataset for learning a model
- A test dataset for evaluating the learned model
Classification model discovery: (constructing a DT)
- Stopping criteria for testing at each node
- How to choose which attribute to split, and how to split (method)
- Control structure for tree construction (recursive process)
- Pruning method

Question 12

Q

What is the heuristic function used in ID3 algorithm for evaluating search directions?

Answer

A

Entropy Calculation

Question 13

Q

What is the notion of Information Gain, and how it is applied in ID3 algorithm?

Answer

A

Expected reduction in entropy
- Define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X
ID3 uses information gain to select among the candidate attributes at each step while growing the tree

Question 14

Q

How to convert the ID3 algorithm into an implementation code structure?

Question 15

Q

How to quantify information contained in a message?

Answer

A

Q(message) = P2(outcome_after) - P1(outcome_before)

Q, the quantity of information contained in a message.

P1, the probability of outcome before receiving a message.

P2, the probability of outcome after receiving the message.

Information in general always associates with a question and a message for answering the question
Information measure is to quantify the outcome of some expectation from the message for answering the question

Question 16

Q

Suppose a missing cow has strayed into a pasture represented as an
8 x 8 array of “cells”.

Question: Where is the cow?
Outcome: the probability of findng the cow.
Answer 1: Nobody knows.
Answer 2: The cow is in cell (4, 7).

What is the information received?

Answer

A

Outcome1 (cow before) = 1/64

Outcome2 (cow after = 1

Information received = log2 P2 - log2 P1

= log2 (P2/P1)

= log2 (1 / (1/64)

= log2 (64)

= 6 bits

Question 17

Q

What is the message and information received formulas?

Answer

A

Q(message) = P2(after) - P1(before)
Information received = log2 P2 - log2 P1
- log2(P2/P1)

Question 18

Q

How the concept of 1 can be applied to a classification method, such as ID3 algorithm?

Question 19

Q

What is entropy and information gain? How to use information gain for choosing an attribute?

Answer

A

Purpose is to select the best attribute with the highest information gain
Entropy measures information required to classify any arbitrary tuple
Information gain is used to dtermine the best split

Question 20

Q

What is ID3’s induction bias?

Answer

A

There is a natural bias in the information gain measure that favors attributes with many distinct values over those with few distinct values

Question 21

Q

What is over fitting?

Answer

A

Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations.

Question 22

Q

What are the technical options of overcoming the bias problem for possible improvement?

Answer

A

Use the gain ratio instead (The measure Gain Ratio has been used in C4.5 algorithm)
*

Question 23

Q

How different a classification task is done by DT induction and by Naïve Bayes classifier? (*Give 3 differences.)

Answer

A

NB uses a probable model vs. inductive model
Use of bayes theorem weighting the evidence
Calculate prior probabilities

[Need review]

Question 24

Q

What are the two assumptions for using NB classifier?

Answer

A

The quantities of interest are governed by the distribution of prior probabilities in that optimal decisions can be made by reasoning about these probabilities together with observed data
Attributes are independant of each other

Question 25

Q

Why Naïve Bayes algorithm is more suitable to high dimensional data?

Answer

A

Because it makes a very strong assumption.

Question 26

Q

What is text classification? What is the basic idea to convert unstructured text data into structured for classification?

Answer

A

Consider a classification problem in which the instances are text documents:

Spam filtering
Text document categorization
Classify survey comments

Make text data be structured:

Treat each text document as an object
Treat word postion as attribute
Treat words as domain values

Question 27

Q

How to estimate the number of prior probabilities which need to be calculated for text classification?

Answer

A

Estimation = C * I * K

Target value (C)

Text positions (I)

Distrinct words (K)

- OR -

Estimation = C * K

Question 28

Q

Why Naïve Bayes algorithm is more suitable for text data mining? What is its limitation?

Answer

A

Pros

Very good in domains with many equally important features
A good dependable baseline for text classification

Cons

The independence assumption states that the word probability contributes to the text classification has not association with its location in the text (i.e. words are independent each other)
This assumption is clearly inaccurate. In practice, however, the naive Bayes algorithm performs remarkably well when classify text by topic despite the incorrectness of this independence assumption.

Question 29

Q

What are the main differences between topic oriented and sentiment oriented text classification?

Answer

A

Topic oriented relies on word distribution and the assumption of independant contribution of words to the classification evidence
Sentiment analysis requires knoweldge of individual word meaning and the relationships between the words.

Question 30

Q

Based on what principle all classification methods can be commonly divided into two general categories (name each)? Provide two example classification algorithms to illustrate each category.

Answer

A

Model based classification
- DT and ANN
Instance based classification
- Naive-Bayes and k-nearest neigbor

Question 31

Q

CommentonANNclassificationapproachinterms of its principle and trade-offs.

Answer

A

Borrows from the human brain: neurones and synapses
Has an input, hidden layers and output
Weights are used and a refined using training data

Tradoffs

Expensive
Unpredictable

Question 32

Q

What are the main differences between Classification and Clustering DM (list 3 from different perspectives)?

Answer

A

Clustering doesn’t not use training data
There is not predefined class, rather you discover the classes
Clustered mining is less precise (by design) - looks at the big picture
Gain insights with hidden data concepts and data distribution

Question 33

Q

Provide two application examples of clustering DM, explain how the DM result may be used for supporting business decision making.

Answer

A

Data clustering for customer group analysis
Generated concepts for supporting information retrieval (clustering content for search engines)

Question 34

Q

What are the general criteria for judging quality of clustering DM results?

Answer

A

high intra-class similarity
low inter-class similarity
scalability
deal with different types of attributes
deal with noise and outliers
high dimentionality
interpretability and usability

Question 35

Q

What is the basic idea to convert non-numerical data into numerical ones and how to prepare your data with various attribute types for clustering?

Answer

A

You need to use calculate the Euclidean distance between objects therefore they need to be quantified. You prepare you data using the following:

Binary values {symmetric, asymmetric}
Norminal variables (e.g. {red, green, blue}
Ordinal variables (e.g. medals)

Question 36

Q

What are the basic data structures for clustering mining?

Answer

A

Data matrix
Dissimilarity matrix

Question 37

Q

What are basic data structures for clustering mining?

Answer

A

Data matrix
Dissimilarity matrix

Question 38

Q

How to calculate dissimilarity of object pairs for a dataset with mixed attribute types for clustering?

Answer

A

Euclidean distance

Euclidean distance, let x1 = (1,2) and x2 = (3,5) represent two objects in the data matrix figure above.

sqrt(2² + 3²) = 3.61

Question 39

Q

What are the main categories of clustering mining methods?

Answer

A

Partition-based clustering
Hierarchical clustering

Question 40

Q

How to trace K-means algorithm on a given dataset?

Given

1) a 1D data set: {2, 4, 10, 12, 3, 20, 30, 11, 25}
2) K = 2.

Question 41

Q

Explain why clustering DM is said to discovery concepts hidden in large datasets?

Answer

A

Is a provides a high-level view of dataset, that is, see the bigger picture. Clustering doesn’t use a a predetermined class but can be used to find a class.

[Review]

Question 42

Q

What are outliers? How to handle them in applications?

Answer

A

Outliers are the data points with the values much different from those of the remaining set of data
Handle them as noise or as targets (through algorithms)

Question 43

Q

What are the main differences between the two partition based methods: k-means and k-medoids?

Answer

A

Instead of taking mean values of the objects in a cluster as a reference point - medoid - i.e. the most centrally located object is used in a cluster.
A medoids average dissimilarity to all objects is the cluster is minimal. i.e. it is most centrally located point in the cluster
K-medoids handles outliers better than k-means
K-medoids is efficient on small data sets

Question 44

Q

What are the main strength and limitation of k-medoids algorithm?

Answer

A

Strenghts
- Efficient on small data sets
- Handles outliers better than k-means
Limitations
- Inefficient on large data sets

Question 45

Q

Use a small dataset to trace each algorithm.

Answer

A

See Wikipedia article.

Question 46

Q

What are the differences between partition based clustering and hierarchical clustering?

Answer

A

Clusters are formed in levels - actually creating sets of clusters at each level

Question 47

Q

How are the mined clusters stored in a dendrogram data structure in a hierarchical clustering result?

Answer

A

Root is one cluster
Leaf is individual cluster
A cluster at level i is the union of its children clusters at level i+1

Question 48

Q

What kind of constraints may be applied to cluster analysis?

Answer

A

A single target concepts (i.e. single dimension clustering) such as salary
A selected set of attributes such as salary, female, age (20-39)
All attributes

Question 49

Q

Brainscape's Knowledge GenomeTM

Final Flashcards

Brainscape's Knowledge Genome^TM