[1] Data Mining Flashcards
(104 cards)
What is KDD?
Knowledge Discovery in Databases; the process of extracting knowledge from data
What are the steps for KDD?
[1] Selection - choose which data to use
[2] Pre-processing - handle missing data and errors
[3] Transformation - change the data format; consider feature reduction or extraction
[4] Data mining - apply algorithms to the transformed data to generate results
[5] Interpretation and evaluation - i.e. visualization
Why is data mining important?
The exponential growth of data exceeds human abilities to understand it themselves
What factors influence the selection of DM algorithms?
- The kind of data i.e. numeric or symbolic
- Whether the task is supervised or unsupervised
- What is the desired output i.e. a blackbox or a tree
What are the ways to approach KDD?
- Top-down discovery - analyst suggests hypotheses; results are analysed to see if they support this
- Bottom-up discovery - system automatically suggests patterns, analyst considers if they are relevant
- Mixed - analyst provides an area of search, and refine this based on the hypotheses generated
How are database queries different from DM?
Database queries return a subset of the data or an aggregate of it
DM outputs a KDD object i.e. a rule or tree which dind’t exist before
What is a data warehouse?
A collection of data on a single subject where the number of datasets can change
How is KDD applied to websites?
Web content mining - collect and analyse the content of pages
Web structure mining - use graph theory to study how pages are connected with hyperlinks
What is IR?
Information Retrieval i.e. search
At a high-level, what do classification algorithms do?
Find separating curves. Note that SVM provides the equations for these, while neural networks only give the regions
What defines AI systems?
They are machines that mimic cognitive functions such as learning and problem solving
What are fuzzy sets?
In contrast to crisp sets, they allow partial membership based on an membership function
How do set operators work for fuzzy sets?
AND(x,y) -> m(x N y) = min(m(x), m(y))
OR(x,y) -> m(x V y) = max(m(x), m(y))
NOT(x) -> m(~x) = 1 - m(x)
What is the process for fuzzy logic?
~ Fuzzify the input values into membership functions
~ Use rules to compute fuzzy outputs
~ De-fuzzify the output functions to get crisp outputs
What are the main DM tasks?
Classification - map data into #pre-defined# classes
Regression - map data to real-valued output
Prediction - predict a future state #in time# based on current state
Time Series Analysis - special case of classification, regression and prediction when the attributes vary over time
[C.R.P.T]
Clustering - the groups are pre-defined, so it is unsupervised
Segmentation - special cases of clustering where the DB is partitioned into #dis-jointed# sets
Association Rules - find relationships between instances i.e. items that frequently occur together
Sequence Discovery - find relationships based on time i.e. if X then Y will happen
[C.S.A.S]
What are the assumptions used for Bayes classifiers?
- Only one hypothesis (class) can occur at a time
- Conditional independence - P(X1, X2 | Y) = P(X1|Y) * P(X2|Y)
What is a key practical consideration for Bayes classifiers?
When the conditional etc. probabilities are calculated, start the counts at 1 to avoid multiplying by 0
What at the names of the probabilities in Bayes classifiers?
Posterior - p(hj|xi) - [this is the target]
Prior - p(hj)
Unconditional probability - p(x)
Conditional probability - p(xi|hj)
What do SVMs do?
Find a hyperplane which correctly classifies the data points and #separates them as far as possible#
In the context of SVMs, what are margins and what types are there?
This is a measure of how well they separate the data
A hard margin is the distance to the nearest point; a soft margin penalizes close points
What is a key limitation of SVMs, and how can it be addressed?
They can only classify data if it is linearly separable.
However, kernels transform data into higher dimensions so they do become separable (the actual hyperplanes are always linear)
What are the limitations of SVM?
- They require selection a good kernel
- They use a lot of memory and CPU
- Special consideration is needed for multiple classes i.e. non-binary problems
How do SVMs handle multiple classes?
Transform the problem into a set of binary problems, and either:
- one-vs-rest - train a classifier for each class to determine if it is or isn’t that class
- one-vs-many - train a classifier for each pair of classes, and use a voting protocol
In the context of DM, what is segmentation?
A special case of clustering where the tuples are disjointed i.e. aren’t neighbors