[1] Data Mining Flashcards

Question 1

Q

What is KDD?

Answer

A

Knowledge Discovery in Databases; the process of extracting knowledge from data

Question 2

Q

What are the steps for KDD?

Answer

A

[1] Selection - choose which data to use
[2] Pre-processing - handle missing data and errors
[3] Transformation - change the data format; consider feature reduction or extraction
[4] Data mining - apply algorithms to the transformed data to generate results
[5] Interpretation and evaluation - i.e. visualization

Question 3

Q

Why is data mining important?

Answer

A

The exponential growth of data exceeds human abilities to understand it themselves

Question 4

Q

What factors influence the selection of DM algorithms?

Answer

A

The kind of data i.e. numeric or symbolic
Whether the task is supervised or unsupervised
What is the desired output i.e. a blackbox or a tree

Question 5

Q

What are the ways to approach KDD?

Answer

A

Top-down discovery - analyst suggests hypotheses; results are analysed to see if they support this
Bottom-up discovery - system automatically suggests patterns, analyst considers if they are relevant
Mixed - analyst provides an area of search, and refine this based on the hypotheses generated

Question 6

Q

How are database queries different from DM?

Answer

A

Database queries return a subset of the data or an aggregate of it

DM outputs a KDD object i.e. a rule or tree which dind’t exist before

Question 7

Q

What is a data warehouse?

Answer

A

A collection of data on a single subject where the number of datasets can change

Question 8

Q

How is KDD applied to websites?

Answer

A

Web content mining - collect and analyse the content of pages
Web structure mining - use graph theory to study how pages are connected with hyperlinks

Question 9

Q

What is IR?

Answer

A

Information Retrieval i.e. search

Question 10

Q

At a high-level, what do classification algorithms do?

Answer

A

Find separating curves. Note that SVM provides the equations for these, while neural networks only give the regions

Question 11

Q

What defines AI systems?

Answer

A

They are machines that mimic cognitive functions such as learning and problem solving

Question 12

Q

What are fuzzy sets?

Answer

A

In contrast to crisp sets, they allow partial membership based on an membership function

Question 13

Q

How do set operators work for fuzzy sets?

Answer

A

AND(x,y) -> m(x N y) = min(m(x), m(y))
OR(x,y) -> m(x V y) = max(m(x), m(y))
NOT(x) -> m(~x) = 1 - m(x)

Question 14

Q

What is the process for fuzzy logic?

Answer

A

~ Fuzzify the input values into membership functions
~ Use rules to compute fuzzy outputs
~ De-fuzzify the output functions to get crisp outputs

Question 15

Q

What are the main DM tasks?

Answer

A

Classification - map data into #pre-defined# classes
Regression - map data to real-valued output
Prediction - predict a future state #in time# based on current state
Time Series Analysis - special case of classification, regression and prediction when the attributes vary over time
[C.R.P.T]

Clustering - the groups are pre-defined, so it is unsupervised
Segmentation - special cases of clustering where the DB is partitioned into #dis-jointed# sets
Association Rules - find relationships between instances i.e. items that frequently occur together
Sequence Discovery - find relationships based on time i.e. if X then Y will happen
[C.S.A.S]

Question 16

Q

What are the assumptions used for Bayes classifiers?

Answer

A

Only one hypothesis (class) can occur at a time

- Conditional independence - P(X1, X2 | Y) = P(X1|Y) * P(X2|Y)

Question 17

Q

What is a key practical consideration for Bayes classifiers?

Answer

A

When the conditional etc. probabilities are calculated, start the counts at 1 to avoid multiplying by 0

Question 18

Q

What at the names of the probabilities in Bayes classifiers?

Answer

A

Posterior - p(hj|xi) - [this is the target]
Prior - p(hj)
Unconditional probability - p(x)
Conditional probability - p(xi|hj)

Question 19

Q

What do SVMs do?

Answer

A

Find a hyperplane which correctly classifies the data points and #separates them as far as possible#

Question 20

Q

In the context of SVMs, what are margins and what types are there?

Answer

A

This is a measure of how well they separate the data

A hard margin is the distance to the nearest point; a soft margin penalizes close points

Question 21

Q

What is a key limitation of SVMs, and how can it be addressed?

Answer

A

They can only classify data if it is linearly separable.

However, kernels transform data into higher dimensions so they do become separable (the actual hyperplanes are always linear)

Question 22

Q

What are the limitations of SVM?

Answer

A

They require selection a good kernel
They use a lot of memory and CPU
Special consideration is needed for multiple classes i.e. non-binary problems

Question 23

Q

How do SVMs handle multiple classes?

Answer

A

Transform the problem into a set of binary problems, and either:

one-vs-rest - train a classifier for each class to determine if it is or isn’t that class
one-vs-many - train a classifier for each pair of classes, and use a voting protocol

Question 24

Q

In the context of DM, what is segmentation?

Answer

A

A special case of clustering where the tuples are disjointed i.e. aren’t neighbors

Question 25

Q

What is the capacity of a function?

Answer

A

A measure of its expressive power and flexibility etc.

Question 26

Q

What does shattering mean?

Answer

A

A function f(theta) shatters a set of point X if for all assignments of classes to that point, there exists a theta such that all the points are correctly classified

Question 27

Q

What is the VC Dimension?

Answer

A

A measure of #capacity# for a function based on the largest number of points it can shatter.

Note that it might only work for a particular set of locations for the points

Question 28

Q

What is Structural Risk Minimization?

Answer

A

The process of finding models with the lowest risk

Risk is the chance of prediction error on some unseen data points which are iid for some unknown distribution

Question 29

Q

How is risk found?

Answer

A

The expected risk is the true risk, but it can’t be determined

However, it is bounded by the empirical risk (based on points already classified) and the VC dimension confidence (based on the number of points and the model’s VC)

Question 30

Q

How the risk of a model be reduced?

Answer

A

Train a better model to reduce to empirical risk
Minimize the VC dimension i.e. use a simpler model
Use more data points, as this lowers the VC dimension confidence

Question 31

Q

Why is Structural Risk Minimization often infeasible?

Answer

A

The VC dimension can only be calculated for a limited number of models

Question 32

Q

What are some key ways to measure the error of an algorithm?

Answer

A

TSS (total sum of squared error)
MSE
RMSE
R-squared - the ratio of the error to the variance of the response

Question 33

Q

What is the advantage of R-squared, and how is it calculated?

Answer

A

It is easily interprete.

R2 = 1 - TSS/Var(y)

Question 34

Q

What is accuracy? What is its opposite?

Answer

A

Accuracy is the portion of instances correctly classified.

Error rate is the portion of instances incorrectly classified

Question 35

Q

What is TPR?

Answer

A

The portion of all actually true instances which are classified as true.

It is also known as sensitivity or recall

TPR = TP / (TP + FN)

Question 36

Q

What is sensitivity?

Question 37

Q

What is recall?

Question 38

Q

What is TNR?

Answer

A

The portion of truly negative instances which are classified as negative

It is also known as specificity

Question 39

Q

What is specificity?

Question 40

Q

How are FPR and FNR related to TPR and TNR

Answer

A

They sum to 1 i.e.e FPR+TPR = 1

Question 41

Q

What is an ROC curve?

Answer

A

It plots how the FPR on the x-axis affects the TPR on the y-axis

Question 42

Q

What is the worst possible result on an ROC curve?

Answer

A

A diagonal line (y=x), if it were below this, predictions could be flipped

Question 43

Q

What is AUC and what are its limits?

Answer

A

It is bounded between 0.5 and 1, with higher values being better

However:

it is difficult to calculate when the points are not well spread across the ROC space
it depends a lot on the non-interesting part of the ROC curve

Question 44

Q

What is the distance measure?

Answer

A

The distance from the center of the ROC and the intersection of the ROC curve and the minor diagonal

Question 45

Q

What metrics are used for object detection?

Answer

A

Detection rate (DR)
False alarm / objects (FA/O)
Extended ROC

Question 46

Q

What is detection rate?

Answer

A

objects found / total objects

Question 47

Q

What is false alarm / objects?

Answer

A

non-objects reported / total objects

It can be greater than 100% for difficult problems

Question 48

Q

What is the extended ROC curve?

Answer

A

It has FA/O on the x-axis and DR on the y-axis

Unlike ROC, models can perform worse than the diagonal

Question 49

Q

What evaluation metrics are used for information retrieval?

Answer

A

Precision and recall

Question 50

Q

In the context of IR, what is precision?

Answer

A

The quality of records retrieved:

relevant documents retrieved / number of documents retrieved

Question 51

Q

In the context of IR, what is recall?

Answer

A

How useful the information retrieved was:

number of relevant documents retrieved / total relevant documents in DB

Question 52

Q

What are the disciplines within computer imaging?

Answer

A

computer vision* - the outputs are for machines

* image processing* - the outputs are for people

Question 53

Q

What are two tasks to do with improving the quality of images?

Answer

A

image restoration* - return image to its original appearance
image enhancement* - make image better than ever

Question 54

Q

What is dynamic range?

Answer

A

The ratio between brightest and darkest values that can be shown i.e. the contrast ratio

Question 55

Q

How many colors does 8 bit color give?

Answer

A

256 i.e. 0 to 255

Question 56

Q

What are images outside human’s perceptual range called?

Answer

A

Multi-spectral

Question 57

Q

Describe JPEG images.

Answer

A

They have 24 bit color (16 million colors); compress well but have artifacts; don’t support transparency or animation

Question 58

Q

Describe PNG images.

Answer

A

They use lossless compression. Tehy support 24-bit color but can’t be animated

Question 59

Q

Describe GIF images.

Answer

A

Support transparency and animation.

They use a palette of 256 colors. They are lossless if less than this many colors

Question 60

Q

Describe TIFF images.

Answer

A

They support a range of color formats i.e. indexed color, 24-bit RGB, and 32-bit RGBA

They can have a range of compression methods and support layers

Question 61

Q

Describe BMP images.

Answer

A

They support 8-bit, 16-bit or 24-bit RGB color, but don’t scale well

Question 62

Q

What is the image format taught in this course?

Answer

A

PNM (Portable aNyMap). It supports:

binary (.pbm)
grayscale (.pgm)
pixmap (.ppm)

Question 63

Q

What is the format of PNM images?

Answer

A

A magic number
The width of the image
The height of the image
The maximum grayscale value i.e. 15 or 255
Optional: comments starting with #
The actual image data: pixels separated by whitespace

Question 64

Q

What are the magic numbers for PNM images?

Answer

A

P2 for PGM in ASCII, P5 for PGM in RAWBITS

Answer 62

A

Region of Interest Extraction
Image Algebra
Spatial filters

Answer 63

A

Use cropping, zooming, shrinking, translation or rotation focus on part of the image

Answer 64

A

Shrinking

Answer 65

A

zero-order hold repeats pixel values; first-order-hold uses averaging or convolution etc.

Answer 66

A

Addition: superimpose image
Subtraction: remove additive noise or detect motion
Multiplication: adjust brightness (generally with a constant)
Logical operators: masking

Answer 67

A

Image operations which change each pixel based on its neighborhood. They include:

mean filters
median filters
enhancement filters
image quantization

Answer 68

A

They are form of linear filter, which means that:

If the coefficients sum to 1, the average brightness is maintained
If the coefficients sum to 0, the average brightness is lost
If the coefficients alternate positive and negative, edge information will be returned
If the coefficients are all positive, the image will be blurred

Answer 69

A

They enhance features in an image

Laplacian filters enhance the image in all directions equally:

-1 on edges with 8 in the centre
-1 in NESW, 4 in the centre, 0 on the diagonals

Difference filters enhance in a particular direction. Note they have two 1s and one -1

Answer 70

A

The process of reducing the amount of image data. It includes:

Grey-level reduction
Spatial reduction

Answer 71

A

A form of image quantization in which the number of grey levels is reduced.

Thresholding makes it binary based on whether the value is above or below the threshold

Answer 72

A

Image quantization where groups of pixels that are #spatially adjacent# are mapped to a single pixel

This can be done with averaging, median or decimation in which some of the pixels are simply eliminated

Answer 73

A

The process of identifying sharp changes in brightness over short spatial distances

This is particularly difficult in noisy images

Answer 74

A

Roberts operator
Sobel operator
Canny edge detector

Answer 75

A

It only marks edge points. It does so by checking change of intensity in the diagonal direction.

It is generally used for binary images

Answer 76

A

Detect edges in the horizontal and vertical directions.

The kernels are: [-1 -2 -1; 0 0 0; 1 2 2] for horizontal (S1)

The edge magnitude is sqrt(S1^2 + S2^2). The direction in atan(S1/S2)

Answer 77

A

A general process for edge detection:

Smooth the image using Gaussian to minimize the impact of noise
Estimate the gradients using partial derivatives or Sobel
Thin edges using non-maxima suppression (suppress pixels that aren’t larger than their neighbor in the positive and negative #gradient direction#

Answer 78

A

It divides images into segments (#connected# pixels)

Answer 79

A

Four-connectivity, eight-connectivity and six-connectivity (includes top-left and bottom-right)

Answer 80

A

Uses thresholding to segment based on brightness.

It is very vulnerable to noise.

The threshold can be chosen manually, or automatically using K-means clustering or by minimizing group variance

Answer 81

A

The splitting and merging algorithm

[1] Define a homogeneity test to measure how similar regions are
[2] Split the image into equally sized regions
[3] Calculate the homogeneity measure for each region. If the test is passed, attempt to merge with neighbors; Otherwise, split
[4] Repeat [3] until all regions pass the homogeneity test

Answer 82

A

Growing - no splitting is applied

Shrinking - no merging is applied

Answer 83

A

By transformation from the spatial domain to the frequency domain (spectral domain)

Answer 84

A

The image will have the same size

Areas of rapid brightness change correspond to high frequency.

Answer 85

A

A set of basis images is used, where each basis image is the same size of the original image

T(u,v) = SumSum I(r,c)B(r,c;u,v)

Fourier transforms are generally used; cosine transforms don’t use sine functions

Answer 86

A

Low-pass filters eliminate high-frequencies, blurring the image

High-pass frequencies can be used for edge detection

Answer 87

A

Those that let some of the signal through if it is close to the threshold

Answer 88

A

Transforms that include both frequency and spatial information

They break the image into four sub-sampled images by high-pass and low-pass filtering the image in both directions

They are commonly used for image compression

Answer 89

A

Feature extraction is the general process of turning image data into feature vectors

The features may be domain specific or domain independent

Good features are RST invariant i.e. unaffected by rotation, scale or transformation

Answer 90

A

Using binary objects of interest
Histogram features
Global features
LBP
SIFT
Pixel Statistics

Answer 91

A

Features can be extracted based on its:

area
center of area
axis of the least second moment (gives an indication of its orientation)
thinness
the Euler number

Answer 92

A

A measure of how thin an object is based on the ratio of its area to perimeter:

T = 4 pi A/P^2

Note that a circle has a thinness of 1, a very thin object approaches 0

Answer 93

A

The number of objects in an image minus the number of holes

Answer 94

A

Global features based on the image’s brightness or contrast

Answer 95

A

Global features generally assume that there is only one object in the image; local filters can be more difficult because you have to implement the case when there are multiple objects

Local features generally perform better than global if there is significant clutter or occlusion

Answer 96

A

Local Binary Pattern is a #global# feature based on:

For each central pixel, identify p points at distance r from the central pixel
Calculate the difference between the central pixel and each pixel
Encode these differences in binary - 1 if the positive values and 0 for negative
Create a histogram of these binary codes across the image

Answer 97

A

An algorithm for matching features between images.

Answer 98

A

Resolution - what is the minimum size objects that can be distinguished
Number of features - algorithms take longer if more features are used
Invariances - are the features affected by rotation or lighting changes etc.
Redundancy - could fewer features encode the same amount of information?
Completeness - can the original image be reconstructed based on the features

Answer 99

A

The image is divided into regions and aggregates of pixel values are computed for each region

Answer 100

A

Template matching - the template is swept across the image to find matches
Nearest neighbor - find labelled images most similar to the current image

Answer 101

A

With a cutout square. It should be large enough to contain any object, but not large enough to contain multiple objects

Brainscape's Knowledge GenomeTM

[1] Data Mining Flashcards

Brainscape's Knowledge Genome^TM