Chapter 2 - Data Flashcards

1
Q

Elaborate on the “type of data”

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is quantitive data and what is qualitative data?

A

quantitative data can be measured. Qualitative is not measured.

We find quantitative data is the shape of “how many…” how often”…
Quantitative data usually applies some aggregation method in order to compute some result. Very suitable for statistics.
Strengths of quantiative: It is objective and concise.
Weakness of quantatiave: it lacks context. People can pull numbers from everywhere and show graphs, but it can be difficult to understand the actual context.

Qualitative data is descriptive and concerns itself with questions like “why” and “how”. It is typically found by using interviews, focus groups, observations. Qualitative data is about trying to understand nuances in human behavior and understand concepts.
Collection methods include pattern recognization and theme identifying.
Strengths include data being very rich and detailed. It provides details that we can never get from quantitative data. This can help us identify interconnectedness between aspects of the data.
Weakness: Difficult to gather and perhaps use.

For me, this means that different types of data serve different purposes.

ANOTHER WORD for quantitative data, is “numerical”.
ANOTHER WORD for qualitative data, is “categorical”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Name some common dangers of data quality

A

Outliers and noise.

Missing, inconsistent or duplicate data.

Data that is biased.

Examples:
Asking only people with solar panels questions that try to answer how the general propulation view solar panels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What could be done in the “preprocessing” step of data mining?

A

Covnert continious attributes into discrete ones. FOr instance, length sometimes needs to be converted to “short” or “medium” or “long” in order to apply some technique. NB: This is not data type regarding size of a variable, but rather a division that is made on the basis of the actual length value.

It is also common to reduce the number of attributes in a data set. The reason for this is that many techniques are more effective when being run on fewer attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the main issues regarding data?

A

There are four which we consider:

1) Types of data.

2) Data quality

3) Data preprocessing

4) Measures of similarity and dissimilarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define a “data set”

A

A data set is a collection of data objects.

Other names for “data objects” are:
- vector
- record
- point
- observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are data objects described?

A

Data objects are described by a number of attributes that defines each object. The attributes capture the charactersitics of the object. FOr instance, “mass” of some physical object.

Attributes are commonly also referred to as:
- Variables
- field
- feature
- dimension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are attributes?

A

A property or characteristic of a data object that can vary with object or time.

This definition highlights variability. If ALL instances/data objects share the same value for some attribute, then it is not worth storing and analyzing, as it is trivial.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a “measurement scale”?

A

A measurement scale is a rule (function) that associates a numerical or symbolic value with an attribute of an object.

Therefore, we can sort of describe a measurement scale as the domain of values that some attribute can take.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the process of “measurement”?

A

The process of “measurement” is the application of measurement scale to associate a value with a particular attribute of a specific object.

Ex: WE perform measurements whenever we for instance step on a scale and measure our weight etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the “type of an attribute”?

A

The type of an attribute is commonly referred to as the type of a measurement scale. If the measurement scale is an integer, say “years old”, then the attribute type is integer because the measurement scale type is integer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is important regarding the type of attributes?

A

The properties of an attributes need not be the same as the properties of the values used to measure it. For instance, even though integers can be used to compute average, it makes no sense to compute average on the attribute “ID”.

In other words: the values we use to represent an attribute can have properties that are NOT properties of the attribute itself, and vice versa.

For us, this means that we should know what the attributes really represent before analyzing data for correlation etc. It would look stupid to announce correlation between attributes that have nothing in common.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the four types of attributes?

A

Nominal (distinctness)
Ordinal (order)
Interval (subtraction and addition)
Ratio (multiplication and division)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are nominal attributes?

A

Nominal attributes refer to attributes that have a value that is just some name. We can only use nominal attributes for equality-check.

The usual suspects include: Employee-ID number, blood-type, eye color etc.

The point is that nominal data attributes are very limited in terms of aggregation. Nominal attributes are in fact a qualitative attribute.

However, we can do things like “count distinct blood-type” or “count ID GROUP BY Blood-type” etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are ordinal attributes?

A

Ordinal attributes are also a part of qualitative data attributes. Ordinal refers to the fact that the values are numbers, but the numbers dont mean shit as they are not necessarily following rules regarding even spacing.

The key point is that ordinal attributes can be used for ORDERING. Recall preferences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are interval attributes?

A

Attributes where the difference between values are actually meaningful. This means that we can now also perform addition and subtraciton in these attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are ratio attributes?

A

Ratio attributes are attributes whose values are such that we can perform multiplication and division on them. In other words, all of the “above” applies.

Examples include “mass”, “length”, “monetary value”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Can we represent the four types of attributes as transformations?

A

Yes, like this:

Nominal attributes must be a one-to-one mapping. This means, we can always exchange all values to new values IF we follow a strict one to one function. FOr instance, if all employee-ID’s are changed, this is fine as long as we use one-to-one function.

Ordinal attributes follow a transformation that is ORDER PRESERVING. This means, we can change all values as long as the function we use is monotonic.

Interval attributes must contain the fact that the differneces in values matter. We do this with a linear function f(x)=ax+b.

Ratio attributes follow transformaiton on the shape of newValue = a*Oldvalue

IMPORTANT: THe meaning behind all this is that the four attributes can be defined as transformations where their meaning is exactly the same before and after.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Can binary attributes be nominal?

A

Yes. We dont restrict ourself to numbers. Binary attributes can take on {male, female}. Although, we can definitely model this using 1 and 0 anyways.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are asymmetric attributes?

A

Asymmetric attributes are aattributes where the only values of interest are those that are not zero.

Asymmetric attributes is all about the likelihood of values. It all makes sense when placed in the context of “measuring for similarity or dissimilarity”. If the attribute is such that we only consider similarity of the values that are not 0, then the attributes are asymmetric. For instance, the grocery case. Since we would NEVER say that two grocery bags are similar just because they are not including a lot of the same items, these attributes are asymmetric.

However, if the attributes are such that the values 1 and 0 happen relatively often, then the attributes are symmetric. For instance, gender is a symmetric attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Do the four tyeps of data attributes include all possible types of attributes?

A

No, there are many other. But they are a very good starting point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

We focus on 3 types of data sets. There are of course other sets as well. But, name and briefly explain the 3 we focus on

A

Record data

Graph based data

Ordered data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Elaborate on the general characteristics of data sets

A

We focus on 3 characteristics of data sets:
1) dimensionality
2) Distribution
3) Resolution

These 3 characteristics apply to very many data sets and usually have quite some impact.

Dimensionality is the number of attributes.
Increasing dimensions makes everything more difficult (curse of dimensionality: exponential increase). Therefore, we often do a lot of work with dimensionality reduction.

Distribution refers to the frequency of values or sets of values for the attributes comprising data objects.
There are different types of distributions. We have normal (gaussian) etc.
However, many data sets have distributions of data that do not follow known statistical distributions.
Therefore, many data mining algorithms do not assume anything regarding the statistical distribution of the data it analyze.

Resolution refers to the granularity of data. The important part about resolution is that the data seem to have different properties when viewed at different resolutions. We also reduce the level of noise by moving to a less fine level of resolution. However, too course and we risk loosing important details. Therefore, I suppose there is a fine middleway here.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are “record data”?

A

Record data is just basic observations of data, following the same set of attributes. There is no explicit relationship among the records.

The records are usually stored in a basic flat file or in a relational database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is transaction data?

A

Transaction data, also called “market basket data”, is a specific type of record data.

Each record (transaction) involves as set of items. For instance, grocery list.

Transaction data is a collection of sets of items, but it can be viewed as a set of records whose fields are asymmetric attributes. Most often, the attributes are binary, indicating whether some item was purchased or not, but it can of course also be general and represent hte number of items bought (discrete or continous, no longer binary).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the “data matrix”?

A

If all the data objects in a collection of data have the same numeric attributes, then we can view the entire shit as a matrix of vectors in some n-dimensional space. Each dimension represent a attribute describing the object.

The data matrix is a type of record data as well, but it has key strengths.

Since all values are numerical, we can perform matrix operations on it to transform and manipulate the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the “sparse data matrix”?

A

Special case of the data matrix where all attributes are of the same type, and are asymmetric. Therefore, transaction data is an example of sparse data matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Elaborate on graph-based data

A

There are especially two cases where graph-based data is a useful structure:

1) To capture relationships among data objects

2) The data objects themselves are represented as graphs

The first case is self-explained, but the second is more interesting.

The second case is useful when our data objects have some sort of structure, perhaps with sub-objects inside of it and these sub-objects have relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Elaborate on ordered data

A

Ordered data have attributes that involve some order in either time or space.

There are 4 types:
1) Sequential, timing matters, but no specific time necessary.
2) Sequence, the order of values matters, ex genetic code or document.
3) Time series data, time stamped data, ex stock charts.
4) Spatial, positional attributes

For instance, if we extend transaction data to include a time-stamp, it becomes “sequential transaction data”. When we use such data, we can find information in timed events, such as sales before holidays etc.

Important: the ordered aspect allows us to use a structure of the data to find key information. this is not limited to just time. If the observed data has an order, it can be used. For instance, genetic code. The order can be used to establish characteristics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Elaborate briefly on data quality issues

A

Data mining algorithms are usually applied on data that was originally intended for other purposes. Therefore, we cannot rely on fixing quality issues at the source of the data.
Preventing data quality issues is usually not an option. THerefore, we focus on how we can handle a level of quality issue.

We focus on 2 things regarding data quality issues
1) Detection and correction of data quality problems

2) Using algorithms that tolerate some level of quality problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are some sources of bad data quality?

A

Human error, measurement errors, flaws in the data collection process.

Sometimes, entire data objects can be missing. Sometimes, we may experience “double counting” resulting in duplicates.

Sometimes there are errors, such as writing 20 instead of 200 and similar cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is “measurement error”?

A

Measurement error refers to any error stemming from the measurement process.
The difference between actual value and measured value is called the “error”, and is one type of the measurement error category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is a “data collection error”?

A

A data collection error represent any error concerning the “omitting data objects or attributes, or inappropiraterly including a data object”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is “noise”?

A

Noise is the random component of a measurement error.

Noise is often used in the case of spatial and or temporal components.

Furthermore, noise is data that is distorted somehow.

An exmaple of noise in recording audio is the level of background noise. If it is very windy, it is difficult to interpret the actual data of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is an “artifact”?

A

An artifact in data quality refers to a deterministic distortion of data. For instance, if we take pictures with a camera that has a lense with a streak, this streak will impact the images consistently in a very deterministic way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Define precision

A

Precision is defined as the closeness in repeated measurements to one another.
typically measured by standard deviation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Define bias

A

A systematic variation in the measurements from the quantity being measured.

Bias is measured as the difference between the “mean” of the data set and the known value of the quantity being measured.

This means that bias can only be quantified if we know the actual value we’re measuring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Defien accuracy

A

The closeness of measurements to the true value of the quantity being measured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are outliers?

A

Outliers are either one of :
1) Data objects with characteristics that are different from most of the other data objects

2) Values of an attribute that are unusual for that attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Elaborate on the distinction between noise and outliers

A

Outliers are very often legitimate data. In fact, in many cases we are actually only interested in anomaly/outlier detection, such as in fraud detection.

Outliers are legitimate, but in most cases they do not help us reaching our goal of finding the underlying pattern. In classification, we can risk overfitting based on outliers.

Noise on the other hand, is the random component of measurements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How can we deal with “missing attribute values”?

A

There are multiple choices, each with their own strengths and weaknesses.

We can eliminate the objects that have missing values. The strength is easy to identify: Makes our job easy.
Weakness is also very apparant: We risk loosing significant values on the remaining attributes that actually have values.
The key is that this strategy of data object deletion should only be considered if the number of data objects with some missing attributes is very low compared to the total amount of data objects.

Another strategy is to estimate the missing values. This is often possible with time-values. Typically done by interpolation.

Also, sometimes we can simply ignore the missing values. For instance, if we are comparing data objects for similarity, we can ignore the atribute and compare on the remaining ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Name examples of inconsistent values

A

Negative height values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Give a definition of high quality data

A

We can say that data is of high quality if it is suitable for the intended use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What are the 3 main issues related to applications in terms of data quality?

A

Timeliness: Some data start to age very fast.

Relevance: The data must be representative of what we are measuring. For instance, sampling. Say we want data on car accidents. Relevant data should at least include age and gender.

Knowledge about the data: This is about the level of documentation linked with the data. Ideally, we want docs that explain the data, possible dependencies and correlations etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Briefly, what is data preprocessing?

A

Strategies that we perform before data mining with the goal of making the data mining itself more effective.

We are essentially trying to make the data more suitable for data mining.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Name the most popular approaches to data preprocessing

A

1) Aggregation
2) Sampling
3) Dimensionality reduction
4) Feature subset selection
5) Feature creation
6) Discretization and binarization
7) Variable transformation

Roughly speaking, all of these preprocessing methods fall into one of the two following categories:
1) Selecting a subset of the data
2) Creating new attributes/variables that are interesting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Elaborate on “aggregation” in the context of data preprocessing

A

Aggregation here refers to combining data objects into a single data object.
For instance reduce the entire set of sales at each department to “sales in worldwide” or “sales in country x”.

Aggregation allows for more expensive data mining algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Elaborate on “sampling” in the context of data preprocessing

A

Sampling is a term used for selecting a subset of a population of values.

NB: The motivation of sampling is different in the field of statistics and in the field of data mining. In statistics, sampling is used because it is practically impossible to achieve the entire population.
In data mining, we use sampling because we want to reduce the computational load required.

KEY: Effective sampling requires that the sampling set is representative of the population set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Elaborate on the different sampling methods

A

SIMPLE RANDOM SAMPLING: Equal probability of selecting any particular object. Can be done with or without replacement.
Simple random sampling fails if we absolutely need to capture different characteristics of different groups of data objects.

Stratified sampling: We draw an equal number of random samples from each group of interest.

Progressive sampling: Increase sample until appropriate size is obtained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Name 3 examples of applications of “simillarity” and “dissimiraity”

A

They are used in a wide number of data mining applications, such as:
Clustering.
Nearest neighbor classification.
Anomoly detection.

49
Q

Informally, what is “similarity”?

A

Similarity is informally defined as a numerical measure of how much two data objects are alike. We use a scale where higher degree of similarity equals higher number/measure.

It is common to use a scale of [0, 1] to describe similarity, where 0 is nothing alike and 1 is exact copies.

50
Q

Informally, what is “dissimilarity”?

A

Dissimilarity is a numerical measure of how far two data objects are each other. In other words, a measure of how two objects are different. it is common to use the term “distance”.

51
Q

Elaborate on transformations in the context of similarities and dissimilarities

A

Some algorithms and some software packages work on a pre-determined scale of these measures. Some work only on similarities, while others work only on dissimilarities. Therefore, there is a need to know some transformation techniques, where we convert a measure of similarity into something else etc.

52
Q

What is the most common transformaiton regarding similarities etc

A

Most common is to use a transformation to convert a proximity measure into the scale of [0,1] similarity. The reason for this is that there is great benefit in having the similarity represented as a fraction of some object.

Such transformations are easy:

newValue = (oldValue - 1)/(numberOfOldScale-1)

Given at the image

53
Q

Consider data objects having a single attribute only. Let us say that this attribute is a “nominal attribute”. How does this impact similarity?

How about single ordinal attribute?

how about single interval or ratio attributes?

A

We know nominal attributes are defined as one-to-one types. This means that they only convey information about distinctiveness. Therefore, when trying to capture similarity, we can actually ONLY say whether they are equal or not.

Ordinal support orderings, like preferences. Therefore, we can use this to make some statement about the distance between two data objects. Consider for instance a scale like this: {poor, fair, ok, good, wonderful}. We can assign a numerical value to these, like this: {0,1,2,3,4}. Then distance(wonderful, fair)=4-1=3. Then we could transform it into [0,1] with (3-0)/5 = 0.6. NB: This gives us the distance between them, which is a DISSIMILARITY thing. Therefore, we need to do 1-d = s to get the similarity.

For interval or ratio, we just use a simple subtraction, because these types are defined in a way where the distance between points are equal.

54
Q

What “distances” do we have?

A

We have:
Euclidean.

Minkowski.

Hamming distance.

Supremum distance.

55
Q

Elaborate on EUCLIDEAN distance

A

We consider “n” to be the number of dimensions, “k” is the k’th attribute.

So, we’re basically just squaring the distances, summing over all attributes, and taking the root.

56
Q

Elaborate on MINKOWSKI distance

A

The euclidean distance is actually a special case of the more general “Minkowski distance”. Minkowski distance is defined like this:

root(∑(|xk - yk|)^r[k=1, n])[1/r]

The cool thing about Minkowski distance is that different values of “r” gives us different distances. For instance, if r=1, we get Manhattan distance.
If r=2, we get Euclidean distance.

if r=infinity, we get a limt-value that gives us the distance as r–>infinity, which is the maximum distance possible.

57
Q

What are the properties of distances?

A

Positivity, symmetry, triangle inequality.

For similarities, the triangle inequality typically does not hold.

58
Q

What can we say about similarity measures for binary data?

A

Say we have two data objects X and Y, that have N binary attributes. We are therefore comparing two binary vectors. When comparing two binary vectors, we get the following FOUR quantities:
f00 = number of attributes where X was 0 and Y was 0
f01 = number of attributes where X was 0 and Y was 1
f10 = number of attributes where X was 1 and Y was 0
f11 = number of attributes where both X and Y was 1.

one way to use this, is SMC = Simple Matching Coefficient.

SMC = (f00+f11)/(f00+f01+f10+f11) = numberOfEquals/totalNumberOfAttributes

EXAMPLE: WE could use this to find matchings of students’ reports, checking for cheating. SMC of 1 would indicate more probability of cheating than SMC of 0.

59
Q

What is the weakness of SMC (Simple Matching Coefficient)?

A

Consider a transaction from grocery store. These are asymmetric variables, and the number of 0’s far outweighs the number of 1’s. Therefore, SMC would always say that each transaction is similar. This is not good.

Therefore, when dealing with asymmetric variables, we use “Jaccard Coefficient” instead.
J = numberOfMatchingPresences / (number of attributes not involved in f00) = f11/(f01+f10+f11)

60
Q

Elaborate on document similarity

A

We want to keep an attribute for every possible word, and maintain the frequency of each word. This resemble Jaccard case, but we’re not using binary variables. Recall however, that when comparing documents, the number of 0’s will greatly outnumber the other values. Therefore we need to do what Jaccard does, but with non-binary attribute values.
In order to do this, we use “cosine similarity”.

cosine similarity is defined like scalar product.
Cosine maps the values to some value between -1 and 1. Therefore, it is well suited.
The geometric 2D visualization is a great way to get some intution of the concept.

61
Q

elaborate on correlation

A

Correlation is defined as covariance(x, y) / STD(x)STD(y)

Correlation is ALWAYS in the range of [-1, 1].
Correlation of 0 indicates no linear relationship, but we can still have non-linear correlation.

A correlation of (1) indicates a perfect positive linear relationship: Xk = aYk+b where a and b are constants.
A correlation of (-1) indicates a perfect negative linear relationship between X and Y: Xk = -aYk+b, a is positive.

Dont make it more complicated than it is: the constant (a), in the case of perfect correlation, is found as the multiple/ratio between EACH individual vector value.

62
Q

What kind of data quality problems are we considering?

A

Noise and outliers.

Wrong data.

Fake data.

Missing data.

Duplicate data.

63
Q

What is a typical issue when merging data from different heterogenous sources?

A

The risk of duplicate values.

64
Q

What is the curse of dimensionality?

A

Bellman: The number of samples required to estimate an arbitrary function with a given level of accuracy grows exponentially with the number of variables, that is, with the dimensionality of the function.

65
Q

Regarding “Feature selection”, what alternatives do we have?

A

We have 3:

1) Embedded approaches: Algorithm itself decides what attributes to use and which to ignore
2) Filter approaches: Features are selected indpendently from the algorithm.
3) Wrapper approaches:

66
Q

What is discretization?

A

Discretization is about converting a continous attribute into a categorical attribute.

67
Q

There are 3 types of attribute characerizations..?

A

Based on measurements

Based on number of values (discrete vs continouis, binary as part of discrete).

Based on importance.

68
Q

What is the outcome of using an interval attribute? give axample

A

The typical mistake is to assume that “since the differences makes sense, the ratio also makes sense”. For instance, temperature is an interval attribute. It makes sense to say that if it is 20 degrees today and 10 tomorrow, the difference is 10 degrees. HOWEVER, it makes not sense to say that 20/10=2 so it is twice as hot today compared to tomorrow.

69
Q

if we are given a nominal attribute, what kind of operations can we do with it?

A

Mode
Entropy
Contingency
Correlation
Chi-squared test

70
Q

if we are given an ordinal attribute, what kind of operations can we do with it?

A

Median,
percentiles,
rank correlation,

71
Q

if we are given an interval attribute, what kind of operations can we do with it?

A

Mean,
standard deviation,

72
Q

if we are given a ratio attribute, what kind of operations can we do with it?

A

Geometirc mean,
harmonic mean

73
Q

When we cateogirze data attributes on importance, what are we talking about?

A

We’re talking about the preference for certain outcomes.

For instance, say we have binary attribute. If trhere is no prefernce regarding which one is given 0 and which is given 1, we say we have a symmetric attribute. Both are equally important.
If for instance we have market basket transactions, we are dealing with asymmetric attribute.

74
Q

There are 3 data characteristics, what are they=

A

Dimensionality: The number of attributes a particular object has. Curse of dimensionality: Difficulties associated in analyzing high-dimensional data. Curse of dimensionality can be overcome with dimensiuonality reduction techniques.

Sparsity: When the number of non-zero values is significantly less than zero values. This is an advantage.

Resolution: Resoluition is about the level of which we make measurements. For instance, do we measure stock prices on daily, weekly etc. The important thing is that patterns appear and disappear at various granularities.

75
Q

What types of data sets do we have?

A

Record based: Data matrix, document data, transaction data.

Graph based: WWW, molecular structures

Ordered: Spatial, temporal, sequential

76
Q

elaborate on document data

A

each document becomes a term vector.

Each term is an attribute of hte vector.

77
Q

What is the key about record based data?

A

No interesting relationship between individual records

78
Q

What happens if there are relationships between data instances?

A

a graph structure emerge. The relationships are edges.

79
Q

There are two types of graph data

A

objects twith relationships

Actual graphs, like molecular structures.

80
Q

name some data quality issues

A

noise and outliers
wrong data
fake data
missing values
duplicate data

81
Q

What is noise=

A

Random compoinent, typically measurement error.

Typically used with spatial or temporal components.

Robust algorithns: Algos that can handle noise.

82
Q

What are robust algoritms?

A

Robust algorithms are those that can handle a lot of noise.

83
Q

How do we deal with missing values?

A

We can eliminate the entire object

We can try to measure the missing value

We can ignore it (only work in some cases)

84
Q

When are duplicate data a prlbem?

A

Tyically when we pull data from several heterogenuous data sources. Can be difficult to know if there are duplicates.

85
Q

What is data preprocessing?

A

techniques to improve the data before we use it on the mining methods.

Improvemnt can be in: time, cost, quality.

Typical steps include:
1) Aggregation
2) Sampling
3) Discretization
4) Binarization
5) Attribute transformation
6) Dimensionality reduction
7) Feature subset selection
8) Feature creation

86
Q

What is aggregation+

A

combioning either multiple attributes, or multiple objects, to create a single attribute, or a single object.

Ex: cities can be aggregated into regions, countries etc.

87
Q

What is sampling?

A

A preprocessing step.

Sampling is the main technique of data reduciton.

Very useful because obtainign the entire set of population is in many cases impoissible.

Sampling can be used when the entire set is too expensive to work with.

IMPORTANT: The sampling must be representative. It is representartive if the sample has the same properties as the entire population.

3 methods:
1) Simple random sampling
2) Stratified sampling
3) Progressive sampling

88
Q

elaborate on simple random sampling

A

equal probability of selecting any particular item.

Sampling like this can be done wiht or without replacement.

89
Q

elaborate on stratified sampling

A

simple random sampling struggle when we have “rare” cases that are at risk of not being properly represented in the sample. To combat this, we can use stratified sampling.

In stratfied sampling, and equal number of objects are drawn from each group. We can also draw a nubmer from eahc group that is proportional to the size of the group.

90
Q

What is curse of dimensionality?

A
91
Q

name methods of dimensionality reduction

A

PCA: Principal Component Analysis

Singular value decomposition.

92
Q

Elaborate on feature selection

A

We use domain knowledge to identify features we need, and features we dont need. This allows us to cut some dimensions, which simplify our analysis.

93
Q

what is discretization?

A

transform a continuous attribute into a categorical attribute

94
Q

What is binarization?

A

transform either continuous or coategorical discrete attributes into binary attributes.

95
Q

elaborate on types of attributes and why it is crucial to know our types

A

We consider 4 types of attributes:
1) Nominal
2) Ordinal
3) Interval
4) Ratio

1) it makes sense to check for distintiveness
2) it makes sense to check for ordering
3) it makes sense to check for differences
4) it makes sense to check for ratios (ex twice as heavy etc).

The reason it is crucial to know the types we’re working with, is so that we dont do something foolish, like computing the average of ID’s or something similar.
ID’s can only be compared for equal or not-equal, nothing else. Therefore ID is nominal.

96
Q

What is the relationship between the types of attributes?

A

Any operation we can do on nominal attributes, we can also do on the others. Any operation we can do on ordinal attributes, we can do on intervals and ratio as well. etc.

97
Q

Opposite of categorical attribute

A

quantitative/numerical

98
Q

elaborate on describing attributes by the number of values

A

Number of values refer to whether the values are continuous or discrete. Finite as well is useful. Binary restrictions come into play here as well.

99
Q

What data sets do we have?

A

We consider 3.
1) Record based
2) Graph based
3) Ordered data

100
Q

how does transaction data differ from regular record data?

A

The attributes are more liquid. We typically dont list the attributes straight out.
ex:{TID: Bread, Milk, Eggs}
However, if we want to use binary representation, we can list them all.

101
Q

What is the sparse data matrix?

A

A special case of the data matrix where the attributes are of the same type and are asymmetric.

A common example of sparse data matrix is document data. In such a case, we consider the “bag of words approach” and omit order to words.
The outcome is a matrix with words/terms on the columns and documents on the rows. Each entry therefore holds the number of times each term is present in each document.

102
Q

What is ordered data?

A

ordered data is data that is almost defined completely by their spatial or temporal component. For instance stock price data.

Sequential data is a type of ordered data, and is regarded as an extension of record data. Sequential data is transaction data that has a time stamp included.

Time series data is a type of ordered data and could be stock prices.

Sequence data is a type of ordered data, and is sort of like sequential data, but without timestamp. there is just a tight order. For instance genetic information.

103
Q

What is important regarding temporal data?

A

Temporal autocorrelation. Basically, the stock price of one day is likely going to be similar to yesterday because of how close it is in time.

104
Q

What is important regarding spatial data?

A

Spatial autocorrelation. Objects that are physically close tend to be similar in other ways as well.

105
Q

Name a type of data set that carry both spatial and temporal information

A

Planes, cars, anything that moves basically.

106
Q

Difference between dimensionality reduction and feature subset selection/feature selection?

A

Dimensionality reduction involve combining dimensions to reduce the total number of dimensions.

Feature subset selection is about simply choosing a subset of the attributes/dimensions to work with, thereby eliminating dimensions.

107
Q

What is curse of dimensionality?

A

As the number of dimensions increase, the data becomes increasingly sparse in the space that it occupies.

this means that if we want to increase dimensions, we really need a shit load of more data.

the outcome is similar for all types of data mining algorithms. The results go bad, we get poor results on high dimensional data - typically.

108
Q

Technique for dimensionality reduction particularily for continuous data…?

A

PCA - Principal Component Analysis

109
Q

If we have nominal attribute, how is similarity typically defined?

A

We give value 1 if they match, 0 if they dont match.

If we measure dissimilarity instead, we assign 0 if they are the same, and 1 if they are not the same.

NB: this is only for nominal attributes. the case is more complicated for ordinal.

110
Q

if we have ordinal attribute, how is simialrity typically defined?

A

We map the values in the domain to successive integers. We can also make it with a transformation so that the final result is a ratio between 0 and 1.

111
Q

measures that satisfy properties of positivity, symmetry and triangle, are called

A

metrics. We usually use the term distance when we’re talking about metrics.

112
Q

similarity measures between objects that are binary are called…?

A

Similarity coefficients. They typically have values between 0 and 1.

One example of a similarity coefficeint is the “SMC”, simple matching coefficient.

113
Q

Elaborate on the SMC

A

SMC - Simple Matching Coefficient - is equal to [the number of matching attribute values] / [number of attributes].

Thus, if all are the same, we get 1. If we all are different, we get 0.

114
Q

Elaborate on Jaccard

A

Jaccard coefficient is another similairty coefficient.

Jaccard coefficient can be viewed as an upgrade on the weak spots of SMC. SMC is weak on the following scenario: Consider what happens if we use SMC on asymmetric transactional data. It would basically say that all data objects are similar as a result of the asymmetry.

Jaccard J = [number of matching presences] / [number of attributes not included in 00 matches]

115
Q

When should we immediately think of Jaccard?

A

Jaccard coefficient should be used whenever we have asymmetric data objects and want to measure similarity.

116
Q

What is special about similarity of document data?

A

Document data, like transaction data, tend to be very asymmetric. However, document data holds non-binary data. Therefore, we cannot use the Jaccard coefficeint.

So, we need something that can ignore the 00-matches while dealing with non binary attribute values. The solution is cosine similarity.

117
Q

Defie pearsons coerrelation

A

covariance(x,y) / (std(x) times std(y))

118
Q

how do we handle curse of dimensionality+

A

We need to do dimensionality reduction. Principal Component analysis is the way to go here. Can also use singular value decomposition.Both principal components analyssi and singular value decomposition are linear algebra techniques.

We can also handle the curse by feature subset selection.

119
Q

What approaches do we have to feature selection?

A

embedded: The algorithm determines what dimensions are trash.

filter approaches: the features are selected outside of the algorithm being run.

wrapper approaches: Same as the “ideal” enumereate all possibilitires and check performance of each, but we dont brute force all possible enumerations.

120
Q

what happens when we let minkowski distance run towards infinity?

A

We get chebyshev distance, which is actually max( |xi - yi| )

121
Q
A