Week 5: Data Preparation Flashcards

1
Q

REVERSED

from py_stringmatching import similarity_measure as sm

A

What is the python library for computing similarity measures?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

REVERSED

  • Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the dataset
  • The attribute with the most distinct values is placed at the lowest level of the hierarchy
  • E.g. Country (highest level) -> state -> city -> street (lowest level)
  • This is also a type of data smoothing
A

What is concept hierarchy generation?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

REVERSED

Effective if data is clustered but not if data is “smeared”

A

When is data reduction through clustering useful and when is it not useful?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

REVERSED

Random error or variance in a measured variable

A

What is noise in data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

REVERSED

  • Stepwise forward selection: starts with empty set of attributes. Best of original attributes are determined and added to the set at each step
  • Stepwise backward elimination: starts with full set of attributes. At each step, removes worst of remaining attributes
  • Combination of forward selection and backward elimination: start with empty set, combine methods so that at each step the procedure adds the best attribute to reduced set and removes the worst attribute from initial set
  • Decision tree induction: tree is constructed from given data. All attributes that do not appear in the tree are considered irrelevant
A

What are the 4 heuristic methods for selecting the subset in attribute subset selection?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

REVERSED

Quantifies the local density of a data point with the use of a neighbourhood of size k
-Introduces a smoothing parameter: reachability distance RD
RDk(x,y) = max{K dist(x), dist(x,y)}, where K dist(x) is the distance between x and its K-nearest neighbour
-the local reachability distance of point x is:
LRDk(x) = k/[sum of y in KNN(x) * RDk(x,y)]
-the local outlier factor LOF is:
LOFk(x) = sum of y in [KNN(x)*LRDk(y)/LRDk(x)] / k

-Generally, LOF >1 means x has a lower density than its neighbours

A

What is the local outlier factor for outlier detection?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

REVERSED

lev_sim = sm.levenshtein.Levenshtein()
lev_sim.get_sim_score (s1, s2)

A

How do you compute the levenshtein similarity between strings s1 and s2 in python?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

REVERSED

  • Conformance to schema: evaluate constraints on a snapshot
  • Conformance to business rules: evaluate constraints on changes in the database
  • Accuracy: perform inventory (expensive), or use proxy (track complaints)
  • Glitches in analysis
  • Successful completion of end-to-end process
A

What are examples of data quality metrics? (5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

REVERSED

Novelty detection involves seeing if new data fits with an existing data or would be considered an outlier

A

What is the difference between outlier detection and novelty detection?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

REVERSED

Attributes that duplicate much or all of the information contained in one or more other attributes

A

What are redundant attributes?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

REVERSED

Transform the multi aria text outlier detection task into a univariate outlier detection problem

A

What is the general approach for outlier detection with multivariate data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

REVERSED

  • Supervised: use class information
  • Bottom-up merge: find the best neighbouring intervals to merge
  • Initially each distinct value is an interval, Chi squared tests are performed on every adjacent interval and those with the least chi squared values are merged together. Merge performed recursively until a predefined stopping condition is satisfied
A

What is correlation analysis for discretisation?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

REVERSED

Fit a model to the data and save the model instead

A

What is model based data reduction?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

REVERSED

Problem of identifying and linking/grouping different representations of the same real-world object

A

What is entity resolution?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

REVERSED

df.corr()

A

How do you find the correlation matrix for a dataframe in python?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

REVERSED

global, contextual, collective

A

What are the three kinds of outliers?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

REVERSED

Don’t assume an a-priori statistical model and determine the model from the input data
e.g. histogram and kernel density estimation

A

What are non-parametric methods for outlier detection?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

REVERSED

  • Supervised methods: domain experts examine and label a sample of the underlying data and the sample is used for testing and training. Outlier detection modelled as a classification problem
  • Unsupervised methods: assume that normal objects are somewhat clustered. Outliers are expected to occur far away from any of the groups of normal objects
  • Semi-supervised methods: only a small set of the normal or outlier objects are labelled, but most of the data are unlabelled. The labelled normal objects together with unlabelled objects that are close by, can be used to train a model for normal objects
A

What are the three types of outlier detection methods?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

REVERSED

Simple random sampling may have poor performance in the presence of skew

A

When does simple random sampling have poor performance?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

REVERSED

checking permitted characters
finding type-mismatched data

A

What is data validation?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

REVERSED

  • Reflects the use of the data
  • Leads to improvements in processes
  • Measurable (we can define metrics)
A

What do we need in a definition of data quality? (3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

REVERSED

Assumes that the normal data is generated by a parametric distribution with the parameter theta

  • The probability density function of the parametric distribution f(x, gamma) gives the probability that x is generated by the distribution
  • The smaller this value, the more likely x is an outlier
A

What are parametric methods for outlier detection?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

REVERSED

#fill each na with the value before it 
data.fillna(method=‘pad') or method=‘ffill’ 
#fill each na with the value after it 
data.fillna(method=‘bfill’) or method=‘backfill’ 
#set a limit on the number of forward or backward fills 
data.fillna(method=‘pad’, limit=1)
A

What are the 2 different methods for filling nas in python?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

REVERSED

  • Inconsistent: containing discrepancies in codes or names
  • Intentional: e.g. disguised missing data such as Jan 1st for all birthdays
A

What makes data “dirty”? (2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

REVERSED

capitalisation, white space normalisation, correcting typos, replacing abbreviations, variations, nick names

A

What is data normalisation in text?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

REVERSED

  • Binning
  • Histograms
  • Clustering
  • Classification (e.g. decision trees)
  • Correlation
A

What are discretisation methods? (5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

REVERSED

O(n^2)

A

What is the time complexity of computing pairwise similarity?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

REVERSED

  • Divides range into N intervals, each containing approximately the same number of samples
  • Managing categorical attributes can be tricky
A

What is equal-depth partitioning for discretisation? What is a problem with it?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

REVERSED

Transform the data by moving the decimal points of values of attribute A
v’ = v/10j where j is the smallest integer such that max(|v’|) < 1
e.g. if the maximum absolute value of A is 986, divide each value by 1000 (j=3)

A

How do you normalise data by decimal scaling?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

REVERSED

Global approaches: the reference set contains all other data objects
Local approaches: the reference contains a small subset of data objects and there is no assumption on the number of normal mechanisms

A

What is the difference between global and local approaches to outlier detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

REVERSED

  • Accuracy: the data was recorded correctly
  • Completeness: all relevant data was recorded
  • Uniqueness: entities are recorded once
  • Timeliness: the data is kept up to date
  • Consistency: the data agrees with itself
  • Believability: how much the data is trusted by users
  • Interpretability: how easy the data is understood
A

What is the definition of data quality? (7 parts)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

REVERSED

  • Divides the range into N intervals of equal size: uniform grid
  • If A and B are the smallest and largest values of the attribute, the width of the intervals will be W = (B-A)/N
  • The most straightforward, but outliers may dominate presentation
  • Skewed data is not handled well
A

What is equal width partitioning for discretisation? What are the 2 problems with it?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

REVERSED

  • similarity measures have different scales
  • pairwise similarity between records is expensive?
A

What are issues with computing similarity measures? (2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

REVERSED

Given two records, compute a vector of similarity scores for corresponding features
-Score can be Boolean (match/mismatch) or a continuous value based on specific similarity measure (distance function)

A

What is matching features?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

REVERSED

  • Binning: first sort data and partition into equal frequency (equidepth) bins, then one can smooth by bin means, smooth by bin median, smooth by bin boundaries etc
  • Regression: smooth by fitting the data into regression functions
  • Clustering: detect and remove outliers that do not belong to any of the clusters
  • Combined computer and human inspection: detect suspicious values and check by human
A

What are 4 ways to handle noisy data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

REVERSED

data.fillna() 
#inplace=TRUE replaces the values in the original dataframe
A

What is the python code for filling in missing values?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

REVERSED

unsupervised, top down splitting method

A

What type of discretisation method is binning?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

REVERSED

Attributes that contain no information that is useful for the data mining task at hand

A

What are irrelevant attributes?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

REVERSED

Object is Oc (or conditional outlier) if it deviates significantly based on a selected context
Issue: how to define or formulae meaningful context

A

What is a contextual outlier and what is the issue with detecting them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

REVERSED

  • Ignore the tuple: usually done when class label is missing - not effective when the % of missing values is large
  • Fill in the missing value manually: tedious + inflatable
  • Fill in the missing value automatically (data imputation) with: a global constant e.g. “unknown” or a new class, the attribute mean, the attribute mean for all samples belonging to the same class, the most probable value found through regression, inference or decision tree
A

What are 3 ways of handling missing data? (3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

REVERSED

Transform the data from a given range with [minA, maxA] to a new interval [new_maxA, new_minA] for a given attribute A
v’ = (v - minA)/(maxA - minA) * (newmaxA - newminA) + newminA
where v is the current value

A

What is min-max normalisation?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

REVERSED

remove unimportant attributes

A

What is dimensionality reduction?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

REVERSED

assume the normal data is generated by a mixture of normal distributions
For any object o in the dataset, the probability that o is generated by a mixture of distributions is the sum of the probability density functions at o
Use the EM algorithm to learn the parameters of the data and an object is an outlier if it does not belong to any of the main groups of the data

A

For multivariate data, how do you overcome the simplified assumption that data is generated by a normal distribution? What method for outlier detection can you use for this new assumption?

44
Q

REVERSED

  • Noise
  • Duplicate data
  • Outliers
  • Unreliable sources
  • Inconsistent values
  • Outdated values
  • Missing values
A

What are data quality issues? (7)

45
Q

REVERSED

integrate metadata from different sources

A

What is schema integration?

46
Q

REVERSED

jaro_sim = sm.jaro.Jaro()
jaro_sim.get_raw_score(s1, s2)

A

How do you compute jaro similarity between strings s1 and s2 in python?

47
Q

REVERSED

Removing irrelevant or redundant attributes

A

What is attribute subset selection?

48
Q

REVERSED

  • An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism
  • Outliers are data or model glitches
A

What is an outlier?

49
Q

REVERSED

data set involving two or more attributes or variables

A

What is multivariate data?

50
Q

REVERSED

numeric data only

A

What type of data can you perform principal component analysis on?

51
Q

REVERSED

Using a histogram: use histogram to graph results as a percentage, a number is in outlier if it falls with a very small percentage of the data
Or use a kernel density estimation the probability density distribution of the data. For an object o, the density function f(o) gives the estimated probability that the object is generated by the stochastic process. If f(o) is low the object is likely an outlier

A

What is a non-parametric method for outlier detection with multivariate data?

52
Q

REVERSED

A subset of data objects collectively deviate significantly from the whole data set, even if the individual data object may not be outliers
Need to have the background knowledge on the relationship among the data objects, such as distance or similarity measure on objects

A

What are collective outliers?

53
Q

REVERSED

dividing the range of a continuous attribute into intervals

A

What is data discretisation?

54
Q

REVERSED

aff = sm.affine.Affine(…)
aff.get_raw_score(s1, s2)

A

How do you compute the affine gap similarity in python?

55
Q

REVERSED

obtaining a small sample s to represent the whole data set N. choose a representative subset of the data

A

How do you reduce data by sampling?

56
Q

REVERSED

Given N data vectors from d-dimensions, find k <= d principal components that can accurately represent the data. Steps:

  • Normalise the input data: so that each attribute falls within the same range
  • Compute k orthonormal (unit) vectors i.e. principal components. These are unit vectors that each point in a direction perpendicular to the others. Each input data (vector) is a linear combination of the k principal components
  • The principal components are sorted in order of decreasing significance or strength. The principal components serve as new set of axes for the data. The first axis (first ranked principal component) shows the most variance among the data
  • The components are sorted. Reduce the data dimensionality by eliminating the weak components. Weak components have low variance
A

What are the steps of principal component analysis?

57
Q

REVERSED

Tests the hypothesis that attributes A and B are independent based on the chi-squared statistic

A

What is the chi-squared correlation test for nominal data?

58
Q

REVERSED

Object is a global outlier (Og) (or point anomaly) if it significantly deviates from the rest of the data set
Issue: find an appropriate measure of deviation

A

What is a global outlier and what is the issue with detecting them?

59
Q

REVERSED

Assume that the normal data objects belong to large and dense clusters, whereas outliers belong to small or sparse clusters

A

What is clustering-based outlier detection?

60
Q

REVERSED

Considering the output of an outlier detection algorithm
Labelling approaches: binary output - data objects are labeled either normal or outlier
Scoring approaches: continuous output - for each object an outlier score is computed. E.g. the probability for it being an outlier

A

What is difference between labelling versus scoring for outlier detection:

61
Q

REVERSED

label encoding, one-hot encoding

A

What are the names of 2 techniques for turning categorical data into numerical data?

62
Q

REVERSED

Transform the data by converting the values to a common scale with an average of zero and a standard deviation of one
v’ = (v - mean(A))/sd(A)

A

What is z-score normalisation?

63
Q

REVERSED

  • Simple random sampling: there is an equal probability of selecting any particular item
  • Simple random sampling without replacement: once an object is selected, it is removed from the population
  • Simple random sampling with replacement: a selected object is not removed from the population
  • Cluster sampling: random sampling of clusters
  • Stratified sampling: partition data set and draw samples from each partition proportionally, i.e. approximately the same percentage of the data. Used in conjunction with skewed data
A

What are the 5 types of sampling?

64
Q

REVERSED

data set involving only one attribute or variable

A

What is univariate data?

65
Q

REVERSED

  • Schema matching: e.g. contact number vs phone
  • Compound attributes: e.g. address vs street, city, zip
A

What is schema normalisation?

66
Q

REVERSED

Assume that the data are normally distributed, learn the parameters from the input data. An object is an outlier if it is more than 3sd from the mean. Ie the z-score (x-mean/sd) has absolute value more than 3

A

What is the maximum likelihood method for outlier detection?

67
Q

REVERSED

Blocking: divide the records into blocks, perform pairwise comparison between records in the same block only

A

How can you reduce the time complexity of pairwise similarity

68
Q

REVERSED

  • Assume that an object is an outlier if the nearest neighbours of the object are far away
  • Two types of proximity based methods: distance-based and density-based
A

What are proximity based methods for outlier detection?

69
Q

REVERSED

  • Smoothing: removing noise from the data. includes binning, regression, clustering
  • Attribute/feature construction: new attributes constructed from the given ones
  • Aggregation: summary or aggregation operations applied, data cube construction
  • Normalisation: scaled to fall within a smaller, specified range. Includes min-max normalisation, Z-score normalisation, normalisation by decimal scaling
  • Data reformatting: e.g. Jack Wilsher -> Wilsher, J.
  • Using the same unit: e.g. inches and cm
  • Discretisation: raw values of numeric data attributes by interval labels or conceptual labels
  • Concept hierarchy generation: attributes such as street generalised to higher level concepts like city
A

What are methods for data transformation? (8)

70
Q

REVERSED

from sklearn.preprocessing import StandardScaler
StandardScaler().fit_transform(df)

A

How do you normalise data by z-score in python?

71
Q

REVERSED

Combining data from multiple sources into a coherent data store

A

What is data integration?

72
Q

REVERSED

  • Smoothing by bin means: each value in a bin is replaced by the mean value of the bin
  • Smoothing by bin medians: each value in a bin is replaced by the median value of the bin
  • Smoothing by bin boundary: the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value
A

What ways can you handle noisy data through binning? (3)

73
Q

REVERSED

  • Supervised: given class labels, top down recursive split
  • Using entropy to determine split point (discretisation point)
A

What is the classification/decision tree method of discretisation?

74
Q

REVERSED

split = top down method 
merge = bottom up method
A

What does split and merge mean in discretisation?

75
Q

REVERSED

Let o* be the mean vector for a multivariate dataset. Mahalanobis distance for an object o to o* is:
MDist(o, o*) = (o-o*)^TS^-1(o-o*) where S is the covariance matrix
Use the outlier detection technique of Grubbs test on the MDist to detect outliers

A

What is mahalanobis distance for outlier detection?

76
Q

REVERSED

data.dropna()

A

What is the python code for removing missing values?

77
Q

REVERSED

Redundant attributes can be detected by correlation and covariance analysis

A

How can you detect/handle redundant data attributes?

78
Q

REVERSED

Transformations are applied to obtain a reduced or compressed representation of the original data

A

What is data compression?

79
Q

REVERSED

Partition data set into clusters based on similarity and store cluster representation (e.g. centroid and diameter) only

A

How do you reduce data using clustering?

80
Q

REVERSED

Use a model to summarise the data e.g. linear regression. data points that do not conform to the model are potential outliers

A

What is a model-based approach to outlier detection?

81
Q

REVERSED

Assume that the normal data objects are generated by a stochastic process (a generative model) and that data not following the model are outliers. Learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers

A

What is a statistical approach to outlier detection?

82
Q

REVERSED

Obtain a reduced representation of the dataset that is much smaller in volume but yet produces the same (or almost the same) analytical results

A

What is data reduction?

83
Q

REVERSED

Judge a point based on its distance to its neighbours
Given a radius (r) and a percentage (pi), a datapoint x is considered to be an outlier if the ratio of all other data points that have a distance less than r to x to the total size of the dataset is less than pi

A

How does the distance-based approach to outlier detection work?

84
Q

REVERSED

  • Unmeasurable: accuracy and completeness are extremely difficult, perhaps impossible to measure
  • Context independent: no accounting for what is important
  • Incomplete: what about interpretability, accessibility, metadata, analysis etc
  • Vague: the previous definition provides no guidance towards practical improvements of the data
A

What are the problems in the definition of data quality (4)

85
Q

REVERSED

Must use density. distance based can’t detect local outliers

A

What proximity based approach should you use to detect local outliers?

86
Q

REVERSED

corr(A,B) = cov(A,B)/sd(A)*sd(B)

A

How are correlation and covariance related?

87
Q

REVERSED

when dimensionality increases, data becomes increasingly spare and density and distance between points becomes less meaningful

A

What is the curse of dimensionality?

88
Q

REVERSED

#take sample of 3 rows without replacement: 
df.sample(3) 
#take sample of 3 rows with replacement: 
df.sample(3, replace=True)
A

How do you take a sample of a dataframe with and without replacement in python?

89
Q

REVERSED

  • Principal component analysis (PCA)
  • Singular value decomposition (SVD)
  • Feature subset selection, feature creation
A

What are 3 strategies for dimensionality reduction?

90
Q

REVERSED

  • Divide data into buckets and store average sum for each bucket
  • Partitioning rules: equal-width (equal bucket range) and equal-frequency (equal depth) (each bucket contains same number of data points)
A

How do you reduce data with histograms?

91
Q

REVERSED

contextual attributes define the context, behavioural attributes define the characteristics of the object used in outlier evaluation

A

What are contextual and behavioural attributes?

92
Q

REVERSED

df = pd.DataFrame(np.arange(20).reshape(5, 4))

A

What is the python code to generate a dataframe with 20 elements with 5 rows and 4 columns?

93
Q

REVERSED

The cost of obtaining a sample is proportional to the size of the sample s, not the size of the dataset N. Therefore sampling complexity is potentially sublinear to the size of the data

A

What is an advantage of sampling?

94
Q

REVERSED

A function that maps the entires set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new value

A

What is data transformation?

95
Q

REVERSED

  • Modelling normal objects and outliers properly
  • Application-specific outlier detection
  • Handling noise in outlier detection
  • Understandability
  • A data set may have multiple types of outlier
  • One object may belong to more than one type of outlier
A

What are challenges of outlier detection? (6)

96
Q

REVERSED

data[“column1”].fillna(data.groupby(“column2)[“column1”].transform(“mean”))

A

What is the code in python to: fill nas in column 1 with mean values of column 1 grouped by column 2

97
Q

REVERSED

  • Data can be aggregated for example if you have the sales for each quarter, create a new variable with yearly sales. the resulting dataset is smaller
  • Data cubes store multidimensional aggregated information
A

What is data cube aggregation?

98
Q

REVERSED

Business understanding
Data understanding
Data preparation
Modelling
Evaluation
Deployment

A

What are the steps of CRISP-DM (Cross industry processing for data mining) (6)

99
Q

REVERSED

O(k(n/k)^2)

A

What is the time complexity of doing pairwise similarity in blocks with k blocks and block size n/k?

100
Q

REVERSED

The closest cluster is far from x

A

What does a low local reachability distance mean?

101
Q

REVERSED

np.add(A, B)

A

How do you add two lists A and B by element addition using numpy as np in python?

102
Q

REVERSED

  • Too many bins, won’t smooth data, will keep the noise, lot of computation required
  • Too little bins, hide a lot of details in the data
A

What are the disadvantages of too many or too little bin numbers for smoothing data?

103
Q

REVERSED

  • Find a projection that captures the largest amount of variation in data
  • We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space
A

What is principal component analysis?

104
Q

REVERSED

  • Difference between numerical values
  • Jaro for comparing names
  • Edit distance for typos
  • Phonetic-based
  • Jaccard for sets
  • Cosine for vectors
A

What similarity measures can be used for matching features? (6)

105
Q

REVERSED

np.mean(data)

A

How do you get summary statistics such as mean using numpy as np in python?

106
Q

REVERSED

  • Replace the original data volume by alternative, smaller forms of data representation
  • Includes modelling, histograms, clustering, sampling and data cube aggregation
A

What is numerosity reduction?