Midterm Flashcards

(45 cards)

1
Q

5 V’s of Big Data

A

Value - Turning big data into value
Velocity - Speed at which data is emanating and changes are occurring between the diverse data sets
Volume - The amount of data being generated
Variety - Can use structures as well as unstructured state
Veracity - Data reliability and trust

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Mining

A

Extraction of interesting patterns or knowledge from huge amounts of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Web Mining Framework

A

Data cleaning
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation

AKA

Data pre-processing
Data Mining
Post-processing
Patterns, Info, Knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Mining on what data?

A
  • Text files
  • Database-oriented data sets and applications
  • Advanced data sets and advanced applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Supervised learning (classification)

A

Supervision: The trained data are accompanied by labels indicating the class of the observations
- New data based on the training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Unsupervised learning (clustering)

A
  • The class labels of training data is unknown
  • Given a set of measurements, observations, etc. - try to establish the existence of classes or clusters in the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Classification and label prediction

A
  • construct models based on some training examples
  • describe and distinguish classes or concepts for future prediction
  • predict the class, classify the new example
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Regression

A
  • Predict a value of a given continuous-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Attribute

A

A property or characteristic of an object (columns)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Object

A

A collection of attributes describe an object

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Types of Data sets

A

Record (Data matrix, documents, transactions)
Graph ( World Wide Web, molecular structures)
Ordered (spatial data, temporal data, sequential data, genetic sequence data)
Structured vs unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Important characteristics of structured data

A

Dimensionality - Many attributes per object
Sparsity - only presence counts
Resolution - Patterns depend on the scale
Distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Types of Attributes

A

Nominal - ID numbers, gender, zip codes
Ordinal - rankings, grades, height

Numeric Attribute Types:
Interval - measures on a scale of equal-sized units
Ratio - Inherent zero-point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Properties of Attribute Values

A

The type of an attribute depends on which of the following properties/operations it possesses:
Distinctness
Order
Differences are meaningful
Ratios are meaningful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Discrete vs Continuous Attributes

A

Discrete Attribute - Has only a finite or countably infinite set of values
- Sometimes represented as integer variables
- countable
- number of students, shoe size

Continuous attribute - measurable
- height, weight, length
- represented as floating-point variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Similarity and Dissimilarity Measures

A

Similarity - numerical measure of how alike two data objects are
- Value is higher when objects are more alike
- Often falls in the range [0,1]

Dissimilarity - numerical measure of how different two data objects are
- Value is lower when objects are more alike
- minimum dissimilarity is often 0

Proximity refers to a similarity or dissimilarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Cosine Similarity

A

Cosine measure can be used to measure the similarity between 2 document vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is frequent pattern analysis?

A

Frequent pattern: a pattern that occurs frequently in a data set
Motivation: Finding inherent regularities in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

(absolute) support, or support count of X is

A

The frequency or occurance of an itemset X

20
Q

(relative) support

A

Is the fraction of transactions that contains X

21
Q

An itemset X is frequent IF

A

X’s support is no less than a minimum support threshold

22
Q

support s is the probability that

A

a transaction contains X or Y

23
Q

confidence c, conditional probability that a transaction

A

Having x also contains Y

24
Q

Frequent itemsets

A
  • An itemset that contains k items is a k-itemset
  • rules that satisfy the minimum support and minimum confidence thresholds are considered strong rules
25
Basic association rule process
1. Find all frequent itemsets - each of these itemsets must occur at least as frequently as predetermined by the minimum support count 2. Generate strong association rules from the frequent itemsets: These rules must satisfy the minimum support and minimum confidence
26
Apriori: A candidate generation and test approach
If there is any itemset which is infrequent, its superset should not be generated/tested - in other words, all subsets of a frequent itemset must be frequent
27
General apriori method:
- scan dataset to get frequent 1-itemsets - generate length (k+1) candidate itemsets from length k frequent itemsets - test the candidates against dataset to obtain support counts - terminate when no frequent or candidate set can be generated
28
Major Tasks in Data Preprocessing
Data cleaning - Fill in missing values, smooth noisy data, identify or remove Data Integration - Integration of multiple databases, data streams or files Data reduction - Dimensionality reduction - Numerosity reduction Data transformation and data discretization - Normalization
29
Data Cleaning
incomplete, noisy, inconsistent
30
How to handle missing data?
Ignore the record or fill it automatically with a constant like NA, the attribute mean, or the attribute mean for all samples belonging to the same class - the smartest approach
31
How to handle noisy data?
- Binning - first sort data and partition into equal frequency bins -Regression - smooth by fitting the data into regression funcitons - Clustering - detect and remove outliers - Combined computer and human inspection - detect suspicious values and manually check
32
What is data integration?
Combining data from multiple sources into a coherent dataset Schema integration - integrate metadata from different sources
33
Handling Redundancy in Data Integration
Object identification Derivable data Redundant attributes may be detected by correlation analysis and covariance analysis
34
Correlation Analysis (Nominal Data)
chi-squared test SUM OF (O-E)^2/E The larger the X^2, the more likely the variables are related CORRELATION DOESNT IMPLY CAUSALITY
35
Covariance
How much do attributes change together Positive covariance - If coA,B>) then A and B both tend to be larger than their expected values Negative covariance - If CovA,B<0 then A is larger than its expected value, B is likely to be smaller than its expected value Independence CovA,B=0
36
Data Reduction
Obtain a reducted representation of the data set that is much smaller in volume, but produces the same analytical results
37
Normalization is
Scaling data to fall within a smaller, more specified range
38
Sampling
Main technique for data reduction - Used because obtaining the entire set of data of interest is too expensive or time consuming - Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming
39
Types of Sampling
- Simple random sampling - There is an equal probability of selecting any particular item - Sampling without replacement - Once an object is selected, it is removed from the population - Sampling with replacement - Stratified sampling - Partition data set and draw samples from each partition
40
Curse of Dimesionality
When dimensionality increases, data becomes increasingly sparse in the space that it occupies
41
Discretization
- The process of converting a continuous attribute into an ordinal attribute - A potentially infinite number of values are mapped into a small number of categories - Discretization is commonly used in classification
42
Binning
Partition based on set bin width, partition based on frequency in bin
43
Unsupervised discretization
Finds breaks in the data values
44
Supervised discretization
Uses class labels to find breaks
45
Binarization
- Maps a continuous or categorical attribute into one or more binary values - Typically used for association analysis - continuous to categorical then categorical to binary - Association analysis needs asymmetric binary attributes