Exam 1 Flashcards

Question

What is an outlier in data mining?

Answer 1

Any value that doesn't really look like most of the others in the data set. May just be a unique data point, or it could really be an outlier (an noise, error).

Answer 2

- Do nothing - Enforce upper and lower bounds - Let binning handle the problem.

Answer 3

In a normal distribution, they could be >3 standard deviations from your mean. With a bimodal distribution, they could be at the middle or at the ends. Some are easy to find such as negative age or an age > 120, negative number of children, gender that's not M/F.

Answer 4

Usually you can't. Don't discard outliers unless you are sure they are really outliers.

Answer 5

- Work with your domain expert. - Try to help identify why the data values are extreme. - Do remove the outliers if you think they will negatively impact your analysis. - Check the source and quality of the raw data.

Answer 6

If they clean the data and remove the outliers, the needle in the haystack you were looking for may have been removed.

Answer 7

Such as house prices with a skewed tail due to extremely high house prices. May need to transform the data to make neater bins or to scale for visualizations.

Answer 8

``` Apply log10 function to numeric values. Log(10)1 = 0 log(10)10 = 1 log(10)100 = 2 log(10)1000 = 3 ``` To care for log0 being undefined, we can bump all values by 1 and take the absolute value to handle negative values. This scales the data making it easier to visualize and handle. In general form: Log(10)|X + 1|. Add sin to regain the sign on negative side: SinXLog(10)|X+ 1|

Answer 9

Try to fit the data to a line, calculate the error, sometimes easy to identify extreme values.

Answer 10

As the dimensionality increases (the number of attributes), the data becomes increasingly sparse in the space it occupies. Some clustering algorithms become less accurate.

Answer 11

- Lots of attributes are hard to work with - Many attributes many not even matter - Many algorithms perform better if we reduce number of attributes being considered.

Answer 12

Remove fields with little or no variability and select the N most important.

Answer 13

Look for redundant data. | Look for irrelevant features.

Answer 14

- Error correction and missing information is a challenging problem. - Incorrect or sloppy data cleaning can result in removing a large amount of data or incorrect data. - There is not enough information or application knowledge to know what to do: consult domain expert - Data cleaning is an iterative, time consuming. - Arguably most time consuming, painful, and expensive part of data mining.

Answer 15

- After cleaning and removing errors, you don't want to repeat it if the collection changes or when new values are added. - When you clean the data, you need to document what you did to the data. - Need a way to audit and measure the resulting data. Backups, deletes, Changes, Access, Last Updated.

Answer 16

- Data quality: "noise" - modification of original values, outliers, missing values, duplicates. Cost- Continuing pressure for larger and faster systems. Driven by the quantity of data and the need for esults in a limited amount of time.

Answer 17

Some classes have a very unequal frequency. For example, 99.99% of Americans are not terrorists and saying that you are correct 97% of the time doesn't say much.

Answer 18

It encodes background knowledge. Can be used to restrict search space, may need to expand metadata attributes into multiple attributes.

Answer 19

What data is available? Where will it come from? Is the data current/relevant? Is other data available? Is historic data available? Who is the domain expert? Which attributes are really important?

Answer 20

- They have no influence on the outcome (e.g customer ID). - May slow things down. - May be illegal to use. - Now always easy (for example increased sales during phases of the Moon).

Answer 21

- Run the risk of overfitting the data. | - WHen model is built to fit training set, it loses the ability to generalize to new data.

Answer 22

Theory: 70% of data for training, 30% for test set. | In practice: 50/50

Answer 23

- Random without replacement. - Systematic: Every nth instance. Don't select the first N - order may influence the data.

Answer 24

Select equal number of instances from each attribute value. Why? If looking for a small #, you are training on biased data.

Answer 25

Think counts, not percentages. Need enough data to saftely generalize. Otherwise, no general rules.

Answer 26

Business Understanding Data Understanding -> Data preparation Modeling -> Evaluation -> business understanding OR deployment

Answer 27

Exploratory data analysis. In marketing discover customer groupings, in astronomy discover groups of similar objects, in earthquakes discover epicenters clusters along faults, in gene research group genes. Identify patterns, document classification, targeting marketing, insurance, taxing.

Answer 28

Form of unsupervised learning as there are no predefined classes. The goal is to find 'natural' grouping of instances. Objects in a cluster are similar to those in the same cluster but dissimilar to objects in a different cluster.

Answer 29

- visual inspection - mathematical measurements such as euclidian distance and manhattan distance. May need to weight more important attributes.

Answer 30

Using euclidean distance or something similar, it is the average of the values corresponding attribute for all points in the cluster.

Answer 31

An exclusive clustering algorithm where each object is assigned to precisely one set of clusters.

Answer 32

By using an objective function and making it as small as possible.

Answer 33

An objective function takes the sum of the squares of the distances of each point from the centroid of the cluster which it is assigned.

Answer 34

That the distance variance is an appropriate way to cluster. Other concenrs: Clusters may be disjoint or overlap. Exclusive clustering - all instances must be clustered.

Answer 35

Need a distance measurement.

Answer 36

1. Choose a value K (number of clusters) 2. Select k objects in an arbitrary fashion and use these as the initial set of k. 3. Assign each object to the nearest centroid. 4. Recalculate the centroids of the k clusters. 5. Repeat step 3 and 4 until the centroids no longer move (convergence).

Answer 37

- Data points no longer move into different clusters or the change in cluster assignments is less than some threshold.

Answer 38

- Does it make sense? - Initial assignment has to be numeric and sensible - Does it end?

Answer 39

- Randomly select the initial K centroids from the data instances - Pick centroids that are the furthest distance - Make a quick pass of the data and use some basic stats to select the initial centroids.

Answer 40

Initially assign the values then calculate the centroids.

Answer 41

- Randomly assign N/K instances to each cluster. - Assign sequentially N/K instances to cluster 1, then next N/K instances to cluster 2, .. - Assign instances to cluster 1.. K using a round robin .. then calculate the centroids.

Answer 42

It address initial centroid selection. The method esentially visualizing pillars which are placed the furthest apart from each other.

Answer 43

Calculate the mean of all points Calculate the distance from each point to the mean. The point with the maximum distance is the 1st centroid. Calculate the distance from each point to the 1st centroid. Find the point the furthest distance from 1st centroid and this becomes the 2nd centroid. Find the distance from all points to the 2nd centroid. Add this distance to the accumulated distance metric. 3rd centroid is the point with the largest distance metric. Continue until all K centroids have been designated.

Answer 44

- Must be able to identify outliers | - Time complexity is on the order of O(N+K) but increases significantly when you add in outlier identification.

Answer 45

K-means is sensitive to outliers and it's not as affected by extreme values.

Answer 46

- Algorithm has a difficult time handling noisy data and outliers. - Since it uses the means, think about what happens to a mean when the data contains an outlier.

Answer 47

- It always terminates but may not find the best clustering - Much depends on the initial selection of centroids - Difficult to visualize high dimension data - Outliers are a problem - Can be sensitive to data order.

Answer 48

When we need to handle very large amounts of data. - Partition the data into clustered subsets - Incremental clustering - parallel implementation of an algorithm.

Answer 49

Instance based clustering

Answer 50

With instance-based incremental clustering, we build the clusters as instances are read in. After adding an instance, look to see if its in the best place; if not, restructure the clusters to make a better fit.

Answer 51

May not build optimal clusters Ordering of data will impact the clusters Restructuring algorithm is often not enough to reverse the impact of a bad initial cluster assignment.

Answer 52

Represent the data in a tree such as a kD-tree. Easy to add new instances. More difficult to understand. Skewed dataset can result in imbalanced trees.

Answer 53

Agglomerative (bottom-up) hierarchical clustering. Start each instance in its own cluster than repeatedly merge the closet pairs. Stop the merging with all instances in a single clusters. Produces a single hierarchical cluster. For n instances, iterate N-1 times.

Answer 54

- Produces a binary tree.

Answer 55

Clusters not clean and neat, may not have specific shapes.

Answer 56

Points are labeled as core, border (edge) or noise. Noise points are ignored initially. Edges are drawn between core points within a certain range (density reachable). Connected core points are assigned to a cluster. Assigns clusters based on the density of the instances.

Answer 57

Sometimes instances could belong to more than one clusters. Assign objects a probability that it belongs in the cluster.

Answer 58

Expectation Maximization. It is a probability-based clustering algorithms. Classifies using mean and standard deviation of values. Uses values if attribute is continuous or categorical if it is nominal attributes.

Answer 59

WUses probablistic clustering, representing the probability of a value being present. Handles categorical attributes and continous attributes. Incremental clustering so sensitive to data order. BUilds a tree representation of the clusters (hierarchical).

Answer 60

End the merger process when we have converted the original N objects to a small enough set of clusters. - Merge clusters only until some predefined number remain - Stop when newly created cluster fails to meet some criterion for its compactness such as average distance between objects is too high.

Exam 1 Flashcards

(84 cards)