Chapter 2 - Data Flashcards

Question 1

Q

Elaborate on the “type of data”

Question 2

Q

What is quantitive data and what is qualitative data?

Answer

A

quantitative data can be measured. Qualitative is not measured.

We find quantitative data is the shape of “how many…” how often”…
Quantitative data usually applies some aggregation method in order to compute some result. Very suitable for statistics.
Strengths of quantiative: It is objective and concise.
Weakness of quantatiave: it lacks context. People can pull numbers from everywhere and show graphs, but it can be difficult to understand the actual context.

Qualitative data is descriptive and concerns itself with questions like “why” and “how”. It is typically found by using interviews, focus groups, observations. Qualitative data is about trying to understand nuances in human behavior and understand concepts.
Collection methods include pattern recognization and theme identifying.
Strengths include data being very rich and detailed. It provides details that we can never get from quantitative data. This can help us identify interconnectedness between aspects of the data.
Weakness: Difficult to gather and perhaps use.

For me, this means that different types of data serve different purposes.

ANOTHER WORD for quantitative data, is “numerical”.
ANOTHER WORD for qualitative data, is “categorical”.

Question 3

Q

Name some common dangers of data quality

Answer

A

Outliers and noise.

Missing, inconsistent or duplicate data.

Data that is biased.

Examples:
Asking only people with solar panels questions that try to answer how the general propulation view solar panels.

Question 4

Q

What could be done in the “preprocessing” step of data mining?

Answer

A

Covnert continious attributes into discrete ones. FOr instance, length sometimes needs to be converted to “short” or “medium” or “long” in order to apply some technique. NB: This is not data type regarding size of a variable, but rather a division that is made on the basis of the actual length value.

It is also common to reduce the number of attributes in a data set. The reason for this is that many techniques are more effective when being run on fewer attributes.

Question 5

Q

What are the main issues regarding data?

Answer

A

There are four which we consider:

1) Types of data.

2) Data quality

3) Data preprocessing

4) Measures of similarity and dissimilarity

Question 6

Q

Define a “data set”

Answer

A

A data set is a collection of data objects.

Other names for “data objects” are:
- vector
- record
- point
- observation

Question 7

Q

How are data objects described?

Answer

A

Data objects are described by a number of attributes that defines each object. The attributes capture the charactersitics of the object. FOr instance, “mass” of some physical object.

Attributes are commonly also referred to as:
- Variables
- field
- feature
- dimension

Question 8

Q

What are attributes?

Answer

A

A property or characteristic of a data object that can vary with object or time.

This definition highlights variability. If ALL instances/data objects share the same value for some attribute, then it is not worth storing and analyzing, as it is trivial.

Question 9

Q

What is a “measurement scale”?

Answer

A

A measurement scale is a rule (function) that associates a numerical or symbolic value with an attribute of an object.

Therefore, we can sort of describe a measurement scale as the domain of values that some attribute can take.

Question 10

Q

What is the process of “measurement”?

Answer

A

The process of “measurement” is the application of measurement scale to associate a value with a particular attribute of a specific object.

Ex: WE perform measurements whenever we for instance step on a scale and measure our weight etc.

Question 11

Q

What is the “type of an attribute”?

Answer

A

The type of an attribute is commonly referred to as the type of a measurement scale. If the measurement scale is an integer, say “years old”, then the attribute type is integer because the measurement scale type is integer.

Question 12

Q

What is important regarding the type of attributes?

Answer

A

The properties of an attributes need not be the same as the properties of the values used to measure it. For instance, even though integers can be used to compute average, it makes no sense to compute average on the attribute “ID”.

In other words: the values we use to represent an attribute can have properties that are NOT properties of the attribute itself, and vice versa.

For us, this means that we should know what the attributes really represent before analyzing data for correlation etc. It would look stupid to announce correlation between attributes that have nothing in common.

Question 13

Q

What are the four types of attributes?

Answer

A

Nominal (distinctness)
Ordinal (order)
Interval (subtraction and addition)
Ratio (multiplication and division)

Question 14

Q

What are nominal attributes?

Answer

A

Nominal attributes refer to attributes that have a value that is just some name. We can only use nominal attributes for equality-check.

The usual suspects include: Employee-ID number, blood-type, eye color etc.

The point is that nominal data attributes are very limited in terms of aggregation. Nominal attributes are in fact a qualitative attribute.

However, we can do things like “count distinct blood-type” or “count ID GROUP BY Blood-type” etc

Question 15

Q

What are ordinal attributes?

Answer

A

Ordinal attributes are also a part of qualitative data attributes. Ordinal refers to the fact that the values are numbers, but the numbers dont mean shit as they are not necessarily following rules regarding even spacing.

The key point is that ordinal attributes can be used for ORDERING. Recall preferences.

Question 16

Q

What are interval attributes?

Answer

A

Attributes where the difference between values are actually meaningful. This means that we can now also perform addition and subtraciton in these attributes.

Question 17

Q

What are ratio attributes?

Answer

A

Ratio attributes are attributes whose values are such that we can perform multiplication and division on them. In other words, all of the “above” applies.

Examples include “mass”, “length”, “monetary value”.

Question 18

Q

Can we represent the four types of attributes as transformations?

Answer

A

Yes, like this:

Nominal attributes must be a one-to-one mapping. This means, we can always exchange all values to new values IF we follow a strict one to one function. FOr instance, if all employee-ID’s are changed, this is fine as long as we use one-to-one function.

Ordinal attributes follow a transformation that is ORDER PRESERVING. This means, we can change all values as long as the function we use is monotonic.

Interval attributes must contain the fact that the differneces in values matter. We do this with a linear function f(x)=ax+b.

Ratio attributes follow transformaiton on the shape of newValue = a*Oldvalue

IMPORTANT: THe meaning behind all this is that the four attributes can be defined as transformations where their meaning is exactly the same before and after.

Question 19

Q

Can binary attributes be nominal?

Answer

A

Yes. We dont restrict ourself to numbers. Binary attributes can take on {male, female}. Although, we can definitely model this using 1 and 0 anyways.

Question 20

Q

What are asymmetric attributes?

Answer

A

Asymmetric attributes are aattributes where the only values of interest are those that are not zero.

Asymmetric attributes is all about the likelihood of values. It all makes sense when placed in the context of “measuring for similarity or dissimilarity”. If the attribute is such that we only consider similarity of the values that are not 0, then the attributes are asymmetric. For instance, the grocery case. Since we would NEVER say that two grocery bags are similar just because they are not including a lot of the same items, these attributes are asymmetric.

However, if the attributes are such that the values 1 and 0 happen relatively often, then the attributes are symmetric. For instance, gender is a symmetric attribute.

Question 21

Q

Do the four tyeps of data attributes include all possible types of attributes?

Answer

A

No, there are many other. But they are a very good starting point.

Question 22

Q

We focus on 3 types of data sets. There are of course other sets as well. But, name and briefly explain the 3 we focus on

Answer

A

Record data

Graph based data

Ordered data

Question 23

Q

Elaborate on the general characteristics of data sets

Answer

A

We focus on 3 characteristics of data sets:
1) dimensionality
2) Distribution
3) Resolution

These 3 characteristics apply to very many data sets and usually have quite some impact.

Dimensionality is the number of attributes.
Increasing dimensions makes everything more difficult (curse of dimensionality: exponential increase). Therefore, we often do a lot of work with dimensionality reduction.

Distribution refers to the frequency of values or sets of values for the attributes comprising data objects.
There are different types of distributions. We have normal (gaussian) etc.
However, many data sets have distributions of data that do not follow known statistical distributions.
Therefore, many data mining algorithms do not assume anything regarding the statistical distribution of the data it analyze.

Resolution refers to the granularity of data. The important part about resolution is that the data seem to have different properties when viewed at different resolutions. We also reduce the level of noise by moving to a less fine level of resolution. However, too course and we risk loosing important details. Therefore, I suppose there is a fine middleway here.

Question 24

Q

What are “record data”?

Answer

A

Record data is just basic observations of data, following the same set of attributes. There is no explicit relationship among the records.

The records are usually stored in a basic flat file or in a relational database.

Question 25

Q

What is transaction data?

Answer

A

Transaction data, also called “market basket data”, is a specific type of record data.

Each record (transaction) involves as set of items. For instance, grocery list.

Transaction data is a collection of sets of items, but it can be viewed as a set of records whose fields are asymmetric attributes. Most often, the attributes are binary, indicating whether some item was purchased or not, but it can of course also be general and represent hte number of items bought (discrete or continous, no longer binary).

Question 26

Q

What is the “data matrix”?

Answer

A

If all the data objects in a collection of data have the same numeric attributes, then we can view the entire shit as a matrix of vectors in some n-dimensional space. Each dimension represent a attribute describing the object.

The data matrix is a type of record data as well, but it has key strengths.

Since all values are numerical, we can perform matrix operations on it to transform and manipulate the data.

Question 27

Q

What is the “sparse data matrix”?

Answer

A

Special case of the data matrix where all attributes are of the same type, and are asymmetric. Therefore, transaction data is an example of sparse data matrix.

Question 28

Q

Elaborate on graph-based data

Answer

A

There are especially two cases where graph-based data is a useful structure:

1) To capture relationships among data objects

2) The data objects themselves are represented as graphs

The first case is self-explained, but the second is more interesting.

The second case is useful when our data objects have some sort of structure, perhaps with sub-objects inside of it and these sub-objects have relationships.

Question 29

Q

Elaborate on ordered data

Answer

A

Ordered data have attributes that involve some order in either time or space.

There are 4 types:
1) Sequential, timing matters, but no specific time necessary.
2) Sequence, the order of values matters, ex genetic code or document.
3) Time series data, time stamped data, ex stock charts.
4) Spatial, positional attributes

For instance, if we extend transaction data to include a time-stamp, it becomes “sequential transaction data”. When we use such data, we can find information in timed events, such as sales before holidays etc.

Important: the ordered aspect allows us to use a structure of the data to find key information. this is not limited to just time. If the observed data has an order, it can be used. For instance, genetic code. The order can be used to establish characteristics.

Question 30

Q

Elaborate briefly on data quality issues

Answer

A

Data mining algorithms are usually applied on data that was originally intended for other purposes. Therefore, we cannot rely on fixing quality issues at the source of the data.
Preventing data quality issues is usually not an option. THerefore, we focus on how we can handle a level of quality issue.

We focus on 2 things regarding data quality issues
1) Detection and correction of data quality problems

2) Using algorithms that tolerate some level of quality problems

Question 31

Q

What are some sources of bad data quality?

Answer

A

Human error, measurement errors, flaws in the data collection process.

Sometimes, entire data objects can be missing. Sometimes, we may experience “double counting” resulting in duplicates.

Sometimes there are errors, such as writing 20 instead of 200 and similar cases.

Question 32

Q

What is “measurement error”?

Answer

A

Measurement error refers to any error stemming from the measurement process.
The difference between actual value and measured value is called the “error”, and is one type of the measurement error category.

Question 33

Q

What is a “data collection error”?

Answer

A

A data collection error represent any error concerning the “omitting data objects or attributes, or inappropiraterly including a data object”.

Question 34

Q

What is “noise”?

Answer

A

Noise is the random component of a measurement error.

Noise is often used in the case of spatial and or temporal components.

Furthermore, noise is data that is distorted somehow.

An exmaple of noise in recording audio is the level of background noise. If it is very windy, it is difficult to interpret the actual data of interest.

Question 35

Q

What is an “artifact”?

Answer

A

An artifact in data quality refers to a deterministic distortion of data. For instance, if we take pictures with a camera that has a lense with a streak, this streak will impact the images consistently in a very deterministic way.

Question 36

Q

Define precision

Answer

A

Precision is defined as the closeness in repeated measurements to one another.
typically measured by standard deviation.

Question 37

Q

Define bias

Answer

A

A systematic variation in the measurements from the quantity being measured.

Bias is measured as the difference between the “mean” of the data set and the known value of the quantity being measured.

This means that bias can only be quantified if we know the actual value we’re measuring.

Question 38

Q

Defien accuracy

Answer

A

The closeness of measurements to the true value of the quantity being measured.

Question 39

Q

What are outliers?

Answer

A

Outliers are either one of :
1) Data objects with characteristics that are different from most of the other data objects

2) Values of an attribute that are unusual for that attribute.

Question 40

Q

Elaborate on the distinction between noise and outliers

Answer

A

Outliers are very often legitimate data. In fact, in many cases we are actually only interested in anomaly/outlier detection, such as in fraud detection.

Outliers are legitimate, but in most cases they do not help us reaching our goal of finding the underlying pattern. In classification, we can risk overfitting based on outliers.

Noise on the other hand, is the random component of measurements.

Question 41

Q

How can we deal with “missing attribute values”?

Answer

A

There are multiple choices, each with their own strengths and weaknesses.

We can eliminate the objects that have missing values. The strength is easy to identify: Makes our job easy.
Weakness is also very apparant: We risk loosing significant values on the remaining attributes that actually have values.
The key is that this strategy of data object deletion should only be considered if the number of data objects with some missing attributes is very low compared to the total amount of data objects.

Another strategy is to estimate the missing values. This is often possible with time-values. Typically done by interpolation.

Also, sometimes we can simply ignore the missing values. For instance, if we are comparing data objects for similarity, we can ignore the atribute and compare on the remaining ones.

Question 42

Q

Name examples of inconsistent values

Answer

A

Negative height values.

Question 43

Q

Give a definition of high quality data

Answer

A

We can say that data is of high quality if it is suitable for the intended use.

Question 44

Q

What are the 3 main issues related to applications in terms of data quality?

Answer

A

Timeliness: Some data start to age very fast.

Relevance: The data must be representative of what we are measuring. For instance, sampling. Say we want data on car accidents. Relevant data should at least include age and gender.

Knowledge about the data: This is about the level of documentation linked with the data. Ideally, we want docs that explain the data, possible dependencies and correlations etc.

Question 45

Q

Briefly, what is data preprocessing?

Answer

A

Strategies that we perform before data mining with the goal of making the data mining itself more effective.

We are essentially trying to make the data more suitable for data mining.

Question 46

Q

Name the most popular approaches to data preprocessing

Answer

A

1) Aggregation
2) Sampling
3) Dimensionality reduction
4) Feature subset selection
5) Feature creation
6) Discretization and binarization
7) Variable transformation

Roughly speaking, all of these preprocessing methods fall into one of the two following categories:
1) Selecting a subset of the data
2) Creating new attributes/variables that are interesting

Question 47

Q

Elaborate on “aggregation” in the context of data preprocessing

Answer

A

Aggregation here refers to combining data objects into a single data object.
For instance reduce the entire set of sales at each department to “sales in worldwide” or “sales in country x”.

Aggregation allows for more expensive data mining algorithms.

Question 48

Q

Elaborate on “sampling” in the context of data preprocessing

Answer

A

Sampling is a term used for selecting a subset of a population of values.

NB: The motivation of sampling is different in the field of statistics and in the field of data mining. In statistics, sampling is used because it is practically impossible to achieve the entire population.
In data mining, we use sampling because we want to reduce the computational load required.

KEY: Effective sampling requires that the sampling set is representative of the population set.

Question 49

Q

Elaborate on the different sampling methods

Answer

A

SIMPLE RANDOM SAMPLING: Equal probability of selecting any particular object. Can be done with or without replacement.
Simple random sampling fails if we absolutely need to capture different characteristics of different groups of data objects.

Stratified sampling: We draw an equal number of random samples from each group of interest.

Progressive sampling: Increase sample until appropriate size is obtained.

Question 50

Q

Name 3 examples of applications of “simillarity” and “dissimiraity”

Answer

A

They are used in a wide number of data mining applications, such as:
Clustering.
Nearest neighbor classification.
Anomoly detection.

Question 51

Q

Informally, what is “similarity”?

Answer

A

Similarity is informally defined as a numerical measure of how much two data objects are alike. We use a scale where higher degree of similarity equals higher number/measure.

It is common to use a scale of [0, 1] to describe similarity, where 0 is nothing alike and 1 is exact copies.

Question 52

Q

Informally, what is “dissimilarity”?

Answer

A

Dissimilarity is a numerical measure of how far two data objects are each other. In other words, a measure of how two objects are different. it is common to use the term “distance”.

Question 53

Q

Elaborate on transformations in the context of similarities and dissimilarities

Answer

A

Some algorithms and some software packages work on a pre-determined scale of these measures. Some work only on similarities, while others work only on dissimilarities. Therefore, there is a need to know some transformation techniques, where we convert a measure of similarity into something else etc.

Question 54

Q

What is the most common transformaiton regarding similarities etc

Answer

A

Most common is to use a transformation to convert a proximity measure into the scale of [0,1] similarity. The reason for this is that there is great benefit in having the similarity represented as a fraction of some object.

Such transformations are easy:

newValue = (oldValue - 1)/(numberOfOldScale-1)

Given at the image

Question 55

Q

Consider data objects having a single attribute only. Let us say that this attribute is a “nominal attribute”. How does this impact similarity?

How about single ordinal attribute?

how about single interval or ratio attributes?

Answer

A

We know nominal attributes are defined as one-to-one types. This means that they only convey information about distinctiveness. Therefore, when trying to capture similarity, we can actually ONLY say whether they are equal or not.

Ordinal support orderings, like preferences. Therefore, we can use this to make some statement about the distance between two data objects. Consider for instance a scale like this: {poor, fair, ok, good, wonderful}. We can assign a numerical value to these, like this: {0,1,2,3,4}. Then distance(wonderful, fair)=4-1=3. Then we could transform it into [0,1] with (3-0)/5 = 0.6. NB: This gives us the distance between them, which is a DISSIMILARITY thing. Therefore, we need to do 1-d = s to get the similarity.

For interval or ratio, we just use a simple subtraction, because these types are defined in a way where the distance between points are equal.

Question 56

Q

What “distances” do we have?

Answer

A

We have:
Euclidean.

Minkowski.

Hamming distance.

Supremum distance.

Question 57

Q

Elaborate on EUCLIDEAN distance

Answer

A

We consider “n” to be the number of dimensions, “k” is the k’th attribute.

So, we’re basically just squaring the distances, summing over all attributes, and taking the root.

Question 58

Q

Elaborate on MINKOWSKI distance

Answer

A

The euclidean distance is actually a special case of the more general “Minkowski distance”. Minkowski distance is defined like this:

root(∑(|xk - yk|)^r[k=1, n])[1/r]

The cool thing about Minkowski distance is that different values of “r” gives us different distances. For instance, if r=1, we get Manhattan distance.
If r=2, we get Euclidean distance.

if r=infinity, we get a limt-value that gives us the distance as r–>infinity, which is the maximum distance possible.

Question 59

Q

What are the properties of distances?

Answer

A

Positivity, symmetry, triangle inequality.

For similarities, the triangle inequality typically does not hold.

Question 60

Q

What can we say about similarity measures for binary data?

Answer

A

Say we have two data objects X and Y, that have N binary attributes. We are therefore comparing two binary vectors. When comparing two binary vectors, we get the following FOUR quantities:
f00 = number of attributes where X was 0 and Y was 0
f01 = number of attributes where X was 0 and Y was 1
f10 = number of attributes where X was 1 and Y was 0
f11 = number of attributes where both X and Y was 1.

one way to use this, is SMC = Simple Matching Coefficient.

SMC = (f00+f11)/(f00+f01+f10+f11) = numberOfEquals/totalNumberOfAttributes

EXAMPLE: WE could use this to find matchings of students’ reports, checking for cheating. SMC of 1 would indicate more probability of cheating than SMC of 0.

Question 61

Q

What is the weakness of SMC (Simple Matching Coefficient)?

Answer

A

Consider a transaction from grocery store. These are asymmetric variables, and the number of 0’s far outweighs the number of 1’s. Therefore, SMC would always say that each transaction is similar. This is not good.

Therefore, when dealing with asymmetric variables, we use “Jaccard Coefficient” instead.
J = numberOfMatchingPresences / (number of attributes not involved in f00) = f11/(f01+f10+f11)

Question 62

Q

Elaborate on document similarity

Answer

A

We want to keep an attribute for every possible word, and maintain the frequency of each word. This resemble Jaccard case, but we’re not using binary variables. Recall however, that when comparing documents, the number of 0’s will greatly outnumber the other values. Therefore we need to do what Jaccard does, but with non-binary attribute values.
In order to do this, we use “cosine similarity”.

cosine similarity is defined like scalar product.
Cosine maps the values to some value between -1 and 1. Therefore, it is well suited.
The geometric 2D visualization is a great way to get some intution of the concept.

Question 63

Q

elaborate on correlation

Answer

A

Correlation is defined as covariance(x, y) / STD(x)STD(y)

Correlation is ALWAYS in the range of [-1, 1].
Correlation of 0 indicates no linear relationship, but we can still have non-linear correlation.

A correlation of (1) indicates a perfect positive linear relationship: Xk = aYk+b where a and b are constants.
A correlation of (-1) indicates a perfect negative linear relationship between X and Y: Xk = -aYk+b, a is positive.

Dont make it more complicated than it is: the constant (a), in the case of perfect correlation, is found as the multiple/ratio between EACH individual vector value.

Question 64

Q

What kind of data quality problems are we considering?

Answer

A

Noise and outliers.

Wrong data.

Fake data.

Missing data.

Duplicate data.

Answer 64

A

The risk of duplicate values.

Answer 65

A

Bellman: The number of samples required to estimate an arbitrary function with a given level of accuracy grows exponentially with the number of variables, that is, with the dimensionality of the function.

Answer 66

A

We have 3:

1) Embedded approaches: Algorithm itself decides what attributes to use and which to ignore
2) Filter approaches: Features are selected indpendently from the algorithm.
3) Wrapper approaches:

Answer 67

A

Discretization is about converting a continous attribute into a categorical attribute.

Answer 68

A

Based on measurements

Based on number of values (discrete vs continouis, binary as part of discrete).

Based on importance.

Answer 69

A

The typical mistake is to assume that “since the differences makes sense, the ratio also makes sense”. For instance, temperature is an interval attribute. It makes sense to say that if it is 20 degrees today and 10 tomorrow, the difference is 10 degrees. HOWEVER, it makes not sense to say that 20/10=2 so it is twice as hot today compared to tomorrow.

Answer 70

A

Mode
Entropy
Contingency
Correlation
Chi-squared test

Answer 71

A

Median,
percentiles,
rank correlation,

Answer 72

A

Mean,
standard deviation,

Answer 73

A

Geometirc mean,
harmonic mean

Answer 74

A

We’re talking about the preference for certain outcomes.

For instance, say we have binary attribute. If trhere is no prefernce regarding which one is given 0 and which is given 1, we say we have a symmetric attribute. Both are equally important.
If for instance we have market basket transactions, we are dealing with asymmetric attribute.

Answer 75

A

Dimensionality: The number of attributes a particular object has. Curse of dimensionality: Difficulties associated in analyzing high-dimensional data. Curse of dimensionality can be overcome with dimensiuonality reduction techniques.

Sparsity: When the number of non-zero values is significantly less than zero values. This is an advantage.

Resolution: Resoluition is about the level of which we make measurements. For instance, do we measure stock prices on daily, weekly etc. The important thing is that patterns appear and disappear at various granularities.

Answer 76

A

Record based: Data matrix, document data, transaction data.

Graph based: WWW, molecular structures

Ordered: Spatial, temporal, sequential

Answer 77

A

each document becomes a term vector.

Each term is an attribute of hte vector.

Answer 78

A

No interesting relationship between individual records

Answer 79

A

a graph structure emerge. The relationships are edges.

Answer 80

A

objects twith relationships

Actual graphs, like molecular structures.

Answer 81

A

noise and outliers
wrong data
fake data
missing values
duplicate data

Answer 82

A

Random compoinent, typically measurement error.

Typically used with spatial or temporal components.

Robust algorithns: Algos that can handle noise.

Answer 83

A

Robust algorithms are those that can handle a lot of noise.

Answer 84

A

We can eliminate the entire object

We can try to measure the missing value

We can ignore it (only work in some cases)

Answer 85

A

Tyically when we pull data from several heterogenuous data sources. Can be difficult to know if there are duplicates.

Answer 86

A

techniques to improve the data before we use it on the mining methods.

Improvemnt can be in: time, cost, quality.

Typical steps include:
1) Aggregation
2) Sampling
3) Discretization
4) Binarization
5) Attribute transformation
6) Dimensionality reduction
7) Feature subset selection
8) Feature creation

Answer 87

A

combioning either multiple attributes, or multiple objects, to create a single attribute, or a single object.

Ex: cities can be aggregated into regions, countries etc.

Answer 88

A

A preprocessing step.

Sampling is the main technique of data reduciton.

Very useful because obtainign the entire set of population is in many cases impoissible.

Sampling can be used when the entire set is too expensive to work with.

IMPORTANT: The sampling must be representative. It is representartive if the sample has the same properties as the entire population.

3 methods:
1) Simple random sampling
2) Stratified sampling
3) Progressive sampling

Answer 89

A

equal probability of selecting any particular item.

Sampling like this can be done wiht or without replacement.

Answer 90

A

simple random sampling struggle when we have “rare” cases that are at risk of not being properly represented in the sample. To combat this, we can use stratified sampling.

In stratfied sampling, and equal number of objects are drawn from each group. We can also draw a nubmer from eahc group that is proportional to the size of the group.

Answer 91

A

PCA: Principal Component Analysis

Singular value decomposition.

Answer 92

A

We use domain knowledge to identify features we need, and features we dont need. This allows us to cut some dimensions, which simplify our analysis.

Answer 93

A

transform a continuous attribute into a categorical attribute

Answer 94

A

transform either continuous or coategorical discrete attributes into binary attributes.

Answer 95

A

We consider 4 types of attributes:
1) Nominal
2) Ordinal
3) Interval
4) Ratio

1) it makes sense to check for distintiveness
2) it makes sense to check for ordering
3) it makes sense to check for differences
4) it makes sense to check for ratios (ex twice as heavy etc).

The reason it is crucial to know the types we’re working with, is so that we dont do something foolish, like computing the average of ID’s or something similar.
ID’s can only be compared for equal or not-equal, nothing else. Therefore ID is nominal.

Answer 96

A

Any operation we can do on nominal attributes, we can also do on the others. Any operation we can do on ordinal attributes, we can do on intervals and ratio as well. etc.

Answer 97

A

quantitative/numerical

Answer 98

A

Number of values refer to whether the values are continuous or discrete. Finite as well is useful. Binary restrictions come into play here as well.

Answer 99

A

We consider 3.
1) Record based
2) Graph based
3) Ordered data

Answer 100

A

The attributes are more liquid. We typically dont list the attributes straight out.
ex:{TID: Bread, Milk, Eggs}
However, if we want to use binary representation, we can list them all.

Answer 101

A

A special case of the data matrix where the attributes are of the same type and are asymmetric.

A common example of sparse data matrix is document data. In such a case, we consider the “bag of words approach” and omit order to words.
The outcome is a matrix with words/terms on the columns and documents on the rows. Each entry therefore holds the number of times each term is present in each document.

Answer 102

A

ordered data is data that is almost defined completely by their spatial or temporal component. For instance stock price data.

Sequential data is a type of ordered data, and is regarded as an extension of record data. Sequential data is transaction data that has a time stamp included.

Time series data is a type of ordered data and could be stock prices.

Sequence data is a type of ordered data, and is sort of like sequential data, but without timestamp. there is just a tight order. For instance genetic information.

Answer 103

A

Temporal autocorrelation. Basically, the stock price of one day is likely going to be similar to yesterday because of how close it is in time.

Answer 104

A

Spatial autocorrelation. Objects that are physically close tend to be similar in other ways as well.

Answer 105

A

Planes, cars, anything that moves basically.

Answer 106

A

Dimensionality reduction involve combining dimensions to reduce the total number of dimensions.

Feature subset selection is about simply choosing a subset of the attributes/dimensions to work with, thereby eliminating dimensions.

Answer 107

A

As the number of dimensions increase, the data becomes increasingly sparse in the space that it occupies.

this means that if we want to increase dimensions, we really need a shit load of more data.

the outcome is similar for all types of data mining algorithms. The results go bad, we get poor results on high dimensional data - typically.

Answer 108

A

PCA - Principal Component Analysis

Answer 109

A

We give value 1 if they match, 0 if they dont match.

If we measure dissimilarity instead, we assign 0 if they are the same, and 1 if they are not the same.

NB: this is only for nominal attributes. the case is more complicated for ordinal.

Answer 110

A

We map the values in the domain to successive integers. We can also make it with a transformation so that the final result is a ratio between 0 and 1.

Answer 111

A

metrics. We usually use the term distance when we’re talking about metrics.

Answer 112

A

Similarity coefficients. They typically have values between 0 and 1.

One example of a similarity coefficeint is the “SMC”, simple matching coefficient.

Answer 113

A

SMC - Simple Matching Coefficient - is equal to [the number of matching attribute values] / [number of attributes].

Thus, if all are the same, we get 1. If we all are different, we get 0.

Answer 114

A

Jaccard coefficient is another similairty coefficient.

Jaccard coefficient can be viewed as an upgrade on the weak spots of SMC. SMC is weak on the following scenario: Consider what happens if we use SMC on asymmetric transactional data. It would basically say that all data objects are similar as a result of the asymmetry.

Jaccard J = [number of matching presences] / [number of attributes not included in 00 matches]

Answer 115

A

Jaccard coefficient should be used whenever we have asymmetric data objects and want to measure similarity.

Answer 116

A

Document data, like transaction data, tend to be very asymmetric. However, document data holds non-binary data. Therefore, we cannot use the Jaccard coefficeint.

So, we need something that can ignore the 00-matches while dealing with non binary attribute values. The solution is cosine similarity.

Answer 117

A

covariance(x,y) / (std(x) times std(y))

Answer 118

A

We need to do dimensionality reduction. Principal Component analysis is the way to go here. Can also use singular value decomposition.Both principal components analyssi and singular value decomposition are linear algebra techniques.

We can also handle the curse by feature subset selection.

Answer 119

A

embedded: The algorithm determines what dimensions are trash.

filter approaches: the features are selected outside of the algorithm being run.

wrapper approaches: Same as the “ideal” enumereate all possibilitires and check performance of each, but we dont brute force all possible enumerations.

Answer 120

A

We get chebyshev distance, which is actually max( |xi - yi| )

Brainscape's Knowledge GenomeTM

Chapter 2 - Data Flashcards

Brainscape's Knowledge Genome^TM