[1] Data Mining Flashcards

1
Q

What is KDD?

A

Knowledge Discovery in Databases; the process of extracting knowledge from data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the steps for KDD?

A

[1] Selection - choose which data to use
[2] Pre-processing - handle missing data and errors
[3] Transformation - change the data format; consider feature reduction or extraction
[4] Data mining - apply algorithms to the transformed data to generate results
[5] Interpretation and evaluation - i.e. visualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is data mining important?

A

The exponential growth of data exceeds human abilities to understand it themselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What factors influence the selection of DM algorithms?

A
  • The kind of data i.e. numeric or symbolic
  • Whether the task is supervised or unsupervised
  • What is the desired output i.e. a blackbox or a tree
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the ways to approach KDD?

A
  • Top-down discovery - analyst suggests hypotheses; results are analysed to see if they support this
  • Bottom-up discovery - system automatically suggests patterns, analyst considers if they are relevant
  • Mixed - analyst provides an area of search, and refine this based on the hypotheses generated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How are database queries different from DM?

A

Database queries return a subset of the data or an aggregate of it

DM outputs a KDD object i.e. a rule or tree which dind’t exist before

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a data warehouse?

A

A collection of data on a single subject where the number of datasets can change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is KDD applied to websites?

A

Web content mining - collect and analyse the content of pages
Web structure mining - use graph theory to study how pages are connected with hyperlinks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is IR?

A

Information Retrieval i.e. search

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

At a high-level, what do classification algorithms do?

A

Find separating curves. Note that SVM provides the equations for these, while neural networks only give the regions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What defines AI systems?

A

They are machines that mimic cognitive functions such as learning and problem solving

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are fuzzy sets?

A

In contrast to crisp sets, they allow partial membership based on an membership function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do set operators work for fuzzy sets?

A

AND(x,y) -> m(x N y) = min(m(x), m(y))
OR(x,y) -> m(x V y) = max(m(x), m(y))
NOT(x) -> m(~x) = 1 - m(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the process for fuzzy logic?

A

~ Fuzzify the input values into membership functions
~ Use rules to compute fuzzy outputs
~ De-fuzzify the output functions to get crisp outputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the main DM tasks?

A

Classification - map data into #pre-defined# classes
Regression - map data to real-valued output
Prediction - predict a future state #in time# based on current state
Time Series Analysis - special case of classification, regression and prediction when the attributes vary over time
[C.R.P.T]

Clustering - the groups are pre-defined, so it is unsupervised
Segmentation - special cases of clustering where the DB is partitioned into #dis-jointed# sets
Association Rules - find relationships between instances i.e. items that frequently occur together
Sequence Discovery - find relationships based on time i.e. if X then Y will happen
[C.S.A.S]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the assumptions used for Bayes classifiers?

A
  • Only one hypothesis (class) can occur at a time

- Conditional independence - P(X1, X2 | Y) = P(X1|Y) * P(X2|Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a key practical consideration for Bayes classifiers?

A

When the conditional etc. probabilities are calculated, start the counts at 1 to avoid multiplying by 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What at the names of the probabilities in Bayes classifiers?

A

Posterior - p(hj|xi) - [this is the target]
Prior - p(hj)
Unconditional probability - p(x)
Conditional probability - p(xi|hj)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What do SVMs do?

A

Find a hyperplane which correctly classifies the data points and #separates them as far as possible#

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In the context of SVMs, what are margins and what types are there?

A

This is a measure of how well they separate the data

A hard margin is the distance to the nearest point; a soft margin penalizes close points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a key limitation of SVMs, and how can it be addressed?

A

They can only classify data if it is linearly separable.

However, kernels transform data into higher dimensions so they do become separable (the actual hyperplanes are always linear)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the limitations of SVM?

A
  • They require selection a good kernel
  • They use a lot of memory and CPU
  • Special consideration is needed for multiple classes i.e. non-binary problems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do SVMs handle multiple classes?

A

Transform the problem into a set of binary problems, and either:

  • one-vs-rest - train a classifier for each class to determine if it is or isn’t that class
  • one-vs-many - train a classifier for each pair of classes, and use a voting protocol
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

In the context of DM, what is segmentation?

A

A special case of clustering where the tuples are disjointed i.e. aren’t neighbors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the capacity of a function?

A

A measure of its expressive power and flexibility etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What does shattering mean?

A

A function f(theta) shatters a set of point X if for all assignments of classes to that point, there exists a theta such that all the points are correctly classified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the VC Dimension?

A

A measure of #capacity# for a function based on the largest number of points it can shatter.

Note that it might only work for a particular set of locations for the points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is Structural Risk Minimization?

A

The process of finding models with the lowest risk

Risk is the chance of prediction error on some unseen data points which are iid for some unknown distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How is risk found?

A

The expected risk is the true risk, but it can’t be determined

However, it is bounded by the empirical risk (based on points already classified) and the VC dimension confidence (based on the number of points and the model’s VC)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How the risk of a model be reduced?

A
  • Train a better model to reduce to empirical risk
  • Minimize the VC dimension i.e. use a simpler model
  • Use more data points, as this lowers the VC dimension confidence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Why is Structural Risk Minimization often infeasible?

A

The VC dimension can only be calculated for a limited number of models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are some key ways to measure the error of an algorithm?

A
  • TSS (total sum of squared error)
  • MSE
  • RMSE
  • R-squared - the ratio of the error to the variance of the response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the advantage of R-squared, and how is it calculated?

A

It is easily interprete.

R2 = 1 - TSS/Var(y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is accuracy? What is its opposite?

A

Accuracy is the portion of instances correctly classified.

Error rate is the portion of instances incorrectly classified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is TPR?

A

The portion of all actually true instances which are classified as true.

It is also known as sensitivity or recall

TPR = TP / (TP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is sensitivity?

A

The TPR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is recall?

A

The TPR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is TNR?

A

The portion of truly negative instances which are classified as negative

It is also known as specificity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is specificity?

A

TNR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

How are FPR and FNR related to TPR and TNR

A

They sum to 1 i.e.e FPR+TPR = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is an ROC curve?

A

It plots how the FPR on the x-axis affects the TPR on the y-axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is the worst possible result on an ROC curve?

A

A diagonal line (y=x), if it were below this, predictions could be flipped

43
Q

What is AUC and what are its limits?

A

It is bounded between 0.5 and 1, with higher values being better

However:

  • it is difficult to calculate when the points are not well spread across the ROC space
  • it depends a lot on the non-interesting part of the ROC curve
44
Q

What is the distance measure?

A

The distance from the center of the ROC and the intersection of the ROC curve and the minor diagonal

45
Q

What metrics are used for object detection?

A
  • Detection rate (DR)
  • False alarm / objects (FA/O)
  • Extended ROC
46
Q

What is detection rate?

A

objects found / total objects

47
Q

What is false alarm / objects?

A

non-objects reported / total objects

It can be greater than 100% for difficult problems

48
Q

What is the extended ROC curve?

A

It has FA/O on the x-axis and DR on the y-axis

Unlike ROC, models can perform worse than the diagonal

49
Q

What evaluation metrics are used for information retrieval?

A

Precision and recall

50
Q

In the context of IR, what is precision?

A

The quality of records retrieved:

relevant documents retrieved / number of documents retrieved

51
Q

In the context of IR, what is recall?

A

How useful the information retrieved was:

number of relevant documents retrieved / total relevant documents in DB

52
Q

What are the disciplines within computer imaging?

A
  • computer vision* - the outputs are for machines

* image processing* - the outputs are for people

53
Q

What are two tasks to do with improving the quality of images?

A
  • image restoration* - return image to its original appearance
  • image enhancement* - make image better than ever
54
Q

What is dynamic range?

A

The ratio between brightest and darkest values that can be shown i.e. the contrast ratio

55
Q

How many colors does 8 bit color give?

A

256 i.e. 0 to 255

56
Q

What are images outside human’s perceptual range called?

A

Multi-spectral

57
Q

Describe JPEG images.

A

They have 24 bit color (16 million colors); compress well but have artifacts; don’t support transparency or animation

58
Q

Describe PNG images.

A

They use lossless compression. Tehy support 24-bit color but can’t be animated

59
Q

Describe GIF images.

A

Support transparency and animation.

They use a palette of 256 colors. They are lossless if less than this many colors

60
Q

Describe TIFF images.

A

They support a range of color formats i.e. indexed color, 24-bit RGB, and 32-bit RGBA

They can have a range of compression methods and support layers

61
Q

Describe BMP images.

A

They support 8-bit, 16-bit or 24-bit RGB color, but don’t scale well

62
Q

What is the image format taught in this course?

A

PNM (Portable aNyMap). It supports:

  • binary (.pbm)
  • grayscale (.pgm)
  • pixmap (.ppm)
63
Q

What is the format of PNM images?

A
  • A magic number
  • The width of the image
  • The height of the image
  • The maximum grayscale value i.e. 15 or 255
  • Optional: comments starting with #
  • The actual image data: pixels separated by whitespace
64
Q

What are the magic numbers for PNM images?

A

P2 for PGM in ASCII, P5 for PGM in RAWBITS

65
Q

What are the basic types of image pre-processing?

A
  • Region of Interest Extraction
  • Image Algebra
  • Spatial filters
66
Q

What is ROI extraction?

A

Use cropping, zooming, shrinking, translation or rotation focus on part of the image

67
Q

What is the opposite of zooming?

A

Shrinking

68
Q

What are the techniques for zooming?

A

zero-order hold repeats pixel values; first-order-hold uses averaging or convolution etc.

69
Q

What is image algebra used for?

A
  • Addition: superimpose image
  • Subtraction: remove additive noise or detect motion
  • Multiplication: adjust brightness (generally with a constant)
  • Logical operators: masking
70
Q

What are spatial filters?

A

Image operations which change each pixel based on its neighborhood. They include:

  • mean filters
  • median filters
  • enhancement filters
  • image quantization
71
Q

What is special about mean filters?

A

They are form of linear filter, which means that:

  • If the coefficients sum to 1, the average brightness is maintained
  • If the coefficients sum to 0, the average brightness is lost
  • If the coefficients alternate positive and negative, edge information will be returned
  • If the coefficients are all positive, the image will be blurred
72
Q

What are enhancement features? What are the main types?

A

They enhance features in an image

Laplacian filters enhance the image in all directions equally:

  • -1 on edges with 8 in the centre
  • -1 in NESW, 4 in the centre, 0 on the diagonals

Difference filters enhance in a particular direction. Note they have two 1s and one -1

73
Q

What is image quantization?

A

The process of reducing the amount of image data. It includes:

  • Grey-level reduction
  • Spatial reduction
74
Q

What is grey-level reduction?

A

A form of image quantization in which the number of grey levels is reduced.

Thresholding makes it binary based on whether the value is above or below the threshold

75
Q

What is spatial reduction?

A

Image quantization where groups of pixels that are #spatially adjacent# are mapped to a single pixel

This can be done with averaging, median or decimation in which some of the pixels are simply eliminated

76
Q

What is edge detection?

A

The process of identifying sharp changes in brightness over short spatial distances

This is particularly difficult in noisy images

77
Q

What are the methods for edge detection?

A
  • Roberts operator
  • Sobel operator
  • Canny edge detector
78
Q

What is Roberts operator?

A

It only marks edge points. It does so by checking change of intensity in the diagonal direction.

It is generally used for binary images

79
Q

What is the Sobel operator?

A

Detect edges in the horizontal and vertical directions.

The kernels are: [-1 -2 -1; 0 0 0; 1 2 2] for horizontal (S1)

The edge magnitude is sqrt(S1^2 + S2^2). The direction in atan(S1/S2)

80
Q

What is the Canny edge detector?

A

A general process for edge detection:

  • Smooth the image using Gaussian to minimize the impact of noise
  • Estimate the gradients using partial derivatives or Sobel
  • Thin edges using non-maxima suppression (suppress pixels that aren’t larger than their neighbor in the positive and negative #gradient direction#
81
Q

What is image segmentation?

A

It divides images into segments (#connected# pixels)

82
Q

How is connectivity of pixels defined?

A

Four-connectivity, eight-connectivity and six-connectivity (includes top-left and bottom-right)

83
Q

What is grey level segmentation? What are its limitations?

A

Uses thresholding to segment based on brightness.

It is very vulnerable to noise.

The threshold can be chosen manually, or automatically using K-means clustering or by minimizing group variance

84
Q

What is a general algorithm for image segmentation?

A

The splitting and merging algorithm

[1] Define a homogeneity test to measure how similar regions are
[2] Split the image into equally sized regions
[3] Calculate the homogeneity measure for each region. If the test is passed, attempt to merge with neighbors; Otherwise, split
[4] Repeat [3] until all regions pass the homogeneity test

85
Q

What are the variations of the splitting are merging algorithm?

A

Growing - no splitting is applied

Shrinking - no merging is applied

86
Q

How can changes in brightness over space be analysed?

A

By transformation from the spatial domain to the frequency domain (spectral domain)

87
Q

What are the affects of transformation to the frequency domain?

A

The image will have the same size

Areas of rapid brightness change correspond to high frequency.

88
Q

How are images transformed to the frequency domain?

A

A set of basis images is used, where each basis image is the same size of the original image

T(u,v) = SumSum I(r,c)B(r,c;u,v)

Fourier transforms are generally used; cosine transforms don’t use sine functions

89
Q

How is filtering applied to the frequency domain?

A

Low-pass filters eliminate high-frequencies, blurring the image

High-pass frequencies can be used for edge detection

90
Q

What are non-ideal filters?

A

Those that let some of the signal through if it is close to the threshold

91
Q

What are wavelet transforms?

A

Transforms that include both frequency and spatial information

They break the image into four sub-sampled images by high-pass and low-pass filtering the image in both directions

They are commonly used for image compression

92
Q

What is feature extraction for images? What are some key considerations?

A

Feature extraction is the general process of turning image data into feature vectors

The features may be domain specific or domain independent

Good features are RST invariant i.e. unaffected by rotation, scale or transformation

93
Q

What are some key ways of doing feature extraction in images?

A
  • Using binary objects of interest
  • Histogram features
  • Global features
  • LBP
  • SIFT
  • Pixel Statistics
94
Q

How are features extracted from a binary object of interest?

A

Features can be extracted based on its:

  • area
  • center of area
  • axis of the least second moment (gives an indication of its orientation)
  • thinness
  • the Euler number
95
Q

What is thinness?

A

A measure of how thin an object is based on the ratio of its area to perimeter:

T = 4 pi A/P^2

Note that a circle has a thinness of 1, a very thin object approaches 0

96
Q

What is the Euler number of an object?

A

The number of objects in an image minus the number of holes

97
Q

What are histogram features?

A

Global features based on the image’s brightness or contrast

98
Q

How do global and local features compare semantically?

A

Global features generally assume that there is only one object in the image; local filters can be more difficult because you have to implement the case when there are multiple objects

Local features generally perform better than global if there is significant clutter or occlusion

99
Q

What is LBP?

A

Local Binary Pattern is a #global# feature based on:

  • For each central pixel, identify p points at distance r from the central pixel
  • Calculate the difference between the central pixel and each pixel
  • Encode these differences in binary - 1 if the positive values and 0 for negative
  • Create a histogram of these binary codes across the image
100
Q

What is SIFT?

A

An algorithm for matching features between images.

101
Q

What should be considered when evaluating image features?

A
  • Resolution - what is the minimum size objects that can be distinguished
  • Number of features - algorithms take longer if more features are used
  • Invariances - are the features affected by rotation or lighting changes etc.
    Redundancy - could fewer features encode the same amount of information?
    Completeness - can the original image be reconstructed based on the features
102
Q

What are pixel statistics?

A

The image is divided into regions and aggregates of pixel values are computed for each region

103
Q

What are the basic approaches to image classification?

A

Template matching - the template is swept across the image to find matches
Nearest neighbor - find labelled images most similar to the current image

104
Q

How are detected objects classified?

A

With a cutout square. It should be large enough to contain any object, but not large enough to contain multiple objects