Chapter 2 - Data Flashcards
Elaborate on the “type of data”
What is quantitive data and what is qualitative data?
quantitative data can be measured. Qualitative is not measured.
We find quantitative data is the shape of “how many…” how often”…
Quantitative data usually applies some aggregation method in order to compute some result. Very suitable for statistics.
Strengths of quantiative: It is objective and concise.
Weakness of quantatiave: it lacks context. People can pull numbers from everywhere and show graphs, but it can be difficult to understand the actual context.
Qualitative data is descriptive and concerns itself with questions like “why” and “how”. It is typically found by using interviews, focus groups, observations. Qualitative data is about trying to understand nuances in human behavior and understand concepts.
Collection methods include pattern recognization and theme identifying.
Strengths include data being very rich and detailed. It provides details that we can never get from quantitative data. This can help us identify interconnectedness between aspects of the data.
Weakness: Difficult to gather and perhaps use.
For me, this means that different types of data serve different purposes.
ANOTHER WORD for quantitative data, is “numerical”.
ANOTHER WORD for qualitative data, is “categorical”.
Name some common dangers of data quality
Outliers and noise.
Missing, inconsistent or duplicate data.
Data that is biased.
Examples:
Asking only people with solar panels questions that try to answer how the general propulation view solar panels.
What could be done in the “preprocessing” step of data mining?
Covnert continious attributes into discrete ones. FOr instance, length sometimes needs to be converted to “short” or “medium” or “long” in order to apply some technique. NB: This is not data type regarding size of a variable, but rather a division that is made on the basis of the actual length value.
It is also common to reduce the number of attributes in a data set. The reason for this is that many techniques are more effective when being run on fewer attributes.
What are the main issues regarding data?
There are four which we consider:
1) Types of data.
2) Data quality
3) Data preprocessing
4) Measures of similarity and dissimilarity
Define a “data set”
A data set is a collection of data objects.
Other names for “data objects” are:
- vector
- record
- point
- observation
How are data objects described?
Data objects are described by a number of attributes that defines each object. The attributes capture the charactersitics of the object. FOr instance, “mass” of some physical object.
Attributes are commonly also referred to as:
- Variables
- field
- feature
- dimension
What are attributes?
A property or characteristic of a data object that can vary with object or time.
This definition highlights variability. If ALL instances/data objects share the same value for some attribute, then it is not worth storing and analyzing, as it is trivial.
What is a “measurement scale”?
A measurement scale is a rule (function) that associates a numerical or symbolic value with an attribute of an object.
Therefore, we can sort of describe a measurement scale as the domain of values that some attribute can take.
What is the process of “measurement”?
The process of “measurement” is the application of measurement scale to associate a value with a particular attribute of a specific object.
Ex: WE perform measurements whenever we for instance step on a scale and measure our weight etc.
What is the “type of an attribute”?
The type of an attribute is commonly referred to as the type of a measurement scale. If the measurement scale is an integer, say “years old”, then the attribute type is integer because the measurement scale type is integer.
What is important regarding the type of attributes?
The properties of an attributes need not be the same as the properties of the values used to measure it. For instance, even though integers can be used to compute average, it makes no sense to compute average on the attribute “ID”.
In other words: the values we use to represent an attribute can have properties that are NOT properties of the attribute itself, and vice versa.
For us, this means that we should know what the attributes really represent before analyzing data for correlation etc. It would look stupid to announce correlation between attributes that have nothing in common.
What are the four types of attributes?
Nominal (distinctness)
Ordinal (order)
Interval (subtraction and addition)
Ratio (multiplication and division)
What are nominal attributes?
Nominal attributes refer to attributes that have a value that is just some name. We can only use nominal attributes for equality-check.
The usual suspects include: Employee-ID number, blood-type, eye color etc.
The point is that nominal data attributes are very limited in terms of aggregation. Nominal attributes are in fact a qualitative attribute.
However, we can do things like “count distinct blood-type” or “count ID GROUP BY Blood-type” etc
What are ordinal attributes?
Ordinal attributes are also a part of qualitative data attributes. Ordinal refers to the fact that the values are numbers, but the numbers dont mean shit as they are not necessarily following rules regarding even spacing.
The key point is that ordinal attributes can be used for ORDERING. Recall preferences.
What are interval attributes?
Attributes where the difference between values are actually meaningful. This means that we can now also perform addition and subtraciton in these attributes.
What are ratio attributes?
Ratio attributes are attributes whose values are such that we can perform multiplication and division on them. In other words, all of the “above” applies.
Examples include “mass”, “length”, “monetary value”.
Can we represent the four types of attributes as transformations?
Yes, like this:
Nominal attributes must be a one-to-one mapping. This means, we can always exchange all values to new values IF we follow a strict one to one function. FOr instance, if all employee-ID’s are changed, this is fine as long as we use one-to-one function.
Ordinal attributes follow a transformation that is ORDER PRESERVING. This means, we can change all values as long as the function we use is monotonic.
Interval attributes must contain the fact that the differneces in values matter. We do this with a linear function f(x)=ax+b.
Ratio attributes follow transformaiton on the shape of newValue = a*Oldvalue
IMPORTANT: THe meaning behind all this is that the four attributes can be defined as transformations where their meaning is exactly the same before and after.
Can binary attributes be nominal?
Yes. We dont restrict ourself to numbers. Binary attributes can take on {male, female}. Although, we can definitely model this using 1 and 0 anyways.
What are asymmetric attributes?
Asymmetric attributes are aattributes where the only values of interest are those that are not zero.
Asymmetric attributes is all about the likelihood of values. It all makes sense when placed in the context of “measuring for similarity or dissimilarity”. If the attribute is such that we only consider similarity of the values that are not 0, then the attributes are asymmetric. For instance, the grocery case. Since we would NEVER say that two grocery bags are similar just because they are not including a lot of the same items, these attributes are asymmetric.
However, if the attributes are such that the values 1 and 0 happen relatively often, then the attributes are symmetric. For instance, gender is a symmetric attribute.
Do the four tyeps of data attributes include all possible types of attributes?
No, there are many other. But they are a very good starting point.
We focus on 3 types of data sets. There are of course other sets as well. But, name and briefly explain the 3 we focus on
Record data
Graph based data
Ordered data
Elaborate on the general characteristics of data sets
We focus on 3 characteristics of data sets:
1) dimensionality
2) Distribution
3) Resolution
These 3 characteristics apply to very many data sets and usually have quite some impact.
Dimensionality is the number of attributes.
Increasing dimensions makes everything more difficult (curse of dimensionality: exponential increase). Therefore, we often do a lot of work with dimensionality reduction.
Distribution refers to the frequency of values or sets of values for the attributes comprising data objects.
There are different types of distributions. We have normal (gaussian) etc.
However, many data sets have distributions of data that do not follow known statistical distributions.
Therefore, many data mining algorithms do not assume anything regarding the statistical distribution of the data it analyze.
Resolution refers to the granularity of data. The important part about resolution is that the data seem to have different properties when viewed at different resolutions. We also reduce the level of noise by moving to a less fine level of resolution. However, too course and we risk loosing important details. Therefore, I suppose there is a fine middleway here.
What are “record data”?
Record data is just basic observations of data, following the same set of attributes. There is no explicit relationship among the records.
The records are usually stored in a basic flat file or in a relational database.
What is transaction data?
Transaction data, also called “market basket data”, is a specific type of record data.
Each record (transaction) involves as set of items. For instance, grocery list.
Transaction data is a collection of sets of items, but it can be viewed as a set of records whose fields are asymmetric attributes. Most often, the attributes are binary, indicating whether some item was purchased or not, but it can of course also be general and represent hte number of items bought (discrete or continous, no longer binary).
What is the “data matrix”?
If all the data objects in a collection of data have the same numeric attributes, then we can view the entire shit as a matrix of vectors in some n-dimensional space. Each dimension represent a attribute describing the object.
The data matrix is a type of record data as well, but it has key strengths.
Since all values are numerical, we can perform matrix operations on it to transform and manipulate the data.
What is the “sparse data matrix”?
Special case of the data matrix where all attributes are of the same type, and are asymmetric. Therefore, transaction data is an example of sparse data matrix.
Elaborate on graph-based data
There are especially two cases where graph-based data is a useful structure:
1) To capture relationships among data objects
2) The data objects themselves are represented as graphs
The first case is self-explained, but the second is more interesting.
The second case is useful when our data objects have some sort of structure, perhaps with sub-objects inside of it and these sub-objects have relationships.
Elaborate on ordered data
Ordered data have attributes that involve some order in either time or space.
There are 4 types:
1) Sequential, timing matters, but no specific time necessary.
2) Sequence, the order of values matters, ex genetic code or document.
3) Time series data, time stamped data, ex stock charts.
4) Spatial, positional attributes
For instance, if we extend transaction data to include a time-stamp, it becomes “sequential transaction data”. When we use such data, we can find information in timed events, such as sales before holidays etc.
Important: the ordered aspect allows us to use a structure of the data to find key information. this is not limited to just time. If the observed data has an order, it can be used. For instance, genetic code. The order can be used to establish characteristics.
Elaborate briefly on data quality issues
Data mining algorithms are usually applied on data that was originally intended for other purposes. Therefore, we cannot rely on fixing quality issues at the source of the data.
Preventing data quality issues is usually not an option. THerefore, we focus on how we can handle a level of quality issue.
We focus on 2 things regarding data quality issues
1) Detection and correction of data quality problems
2) Using algorithms that tolerate some level of quality problems
What are some sources of bad data quality?
Human error, measurement errors, flaws in the data collection process.
Sometimes, entire data objects can be missing. Sometimes, we may experience “double counting” resulting in duplicates.
Sometimes there are errors, such as writing 20 instead of 200 and similar cases.
What is “measurement error”?
Measurement error refers to any error stemming from the measurement process.
The difference between actual value and measured value is called the “error”, and is one type of the measurement error category.
What is a “data collection error”?
A data collection error represent any error concerning the “omitting data objects or attributes, or inappropiraterly including a data object”.
What is “noise”?
Noise is the random component of a measurement error.
Noise is often used in the case of spatial and or temporal components.
Furthermore, noise is data that is distorted somehow.
An exmaple of noise in recording audio is the level of background noise. If it is very windy, it is difficult to interpret the actual data of interest.
What is an “artifact”?
An artifact in data quality refers to a deterministic distortion of data. For instance, if we take pictures with a camera that has a lense with a streak, this streak will impact the images consistently in a very deterministic way.
Define precision
Precision is defined as the closeness in repeated measurements to one another.
typically measured by standard deviation.
Define bias
A systematic variation in the measurements from the quantity being measured.
Bias is measured as the difference between the “mean” of the data set and the known value of the quantity being measured.
This means that bias can only be quantified if we know the actual value we’re measuring.
Defien accuracy
The closeness of measurements to the true value of the quantity being measured.
What are outliers?
Outliers are either one of :
1) Data objects with characteristics that are different from most of the other data objects
2) Values of an attribute that are unusual for that attribute.
Elaborate on the distinction between noise and outliers
Outliers are very often legitimate data. In fact, in many cases we are actually only interested in anomaly/outlier detection, such as in fraud detection.
Outliers are legitimate, but in most cases they do not help us reaching our goal of finding the underlying pattern. In classification, we can risk overfitting based on outliers.
Noise on the other hand, is the random component of measurements.
How can we deal with “missing attribute values”?
There are multiple choices, each with their own strengths and weaknesses.
We can eliminate the objects that have missing values. The strength is easy to identify: Makes our job easy.
Weakness is also very apparant: We risk loosing significant values on the remaining attributes that actually have values.
The key is that this strategy of data object deletion should only be considered if the number of data objects with some missing attributes is very low compared to the total amount of data objects.
Another strategy is to estimate the missing values. This is often possible with time-values. Typically done by interpolation.
Also, sometimes we can simply ignore the missing values. For instance, if we are comparing data objects for similarity, we can ignore the atribute and compare on the remaining ones.
Name examples of inconsistent values
Negative height values.
Give a definition of high quality data
We can say that data is of high quality if it is suitable for the intended use.
What are the 3 main issues related to applications in terms of data quality?
Timeliness: Some data start to age very fast.
Relevance: The data must be representative of what we are measuring. For instance, sampling. Say we want data on car accidents. Relevant data should at least include age and gender.
Knowledge about the data: This is about the level of documentation linked with the data. Ideally, we want docs that explain the data, possible dependencies and correlations etc.
Briefly, what is data preprocessing?
Strategies that we perform before data mining with the goal of making the data mining itself more effective.
We are essentially trying to make the data more suitable for data mining.
Name the most popular approaches to data preprocessing
1) Aggregation
2) Sampling
3) Dimensionality reduction
4) Feature subset selection
5) Feature creation
6) Discretization and binarization
7) Variable transformation
Roughly speaking, all of these preprocessing methods fall into one of the two following categories:
1) Selecting a subset of the data
2) Creating new attributes/variables that are interesting
Elaborate on “aggregation” in the context of data preprocessing
Aggregation here refers to combining data objects into a single data object.
For instance reduce the entire set of sales at each department to “sales in worldwide” or “sales in country x”.
Aggregation allows for more expensive data mining algorithms.
Elaborate on “sampling” in the context of data preprocessing
Sampling is a term used for selecting a subset of a population of values.
NB: The motivation of sampling is different in the field of statistics and in the field of data mining. In statistics, sampling is used because it is practically impossible to achieve the entire population.
In data mining, we use sampling because we want to reduce the computational load required.
KEY: Effective sampling requires that the sampling set is representative of the population set.
Elaborate on the different sampling methods
SIMPLE RANDOM SAMPLING: Equal probability of selecting any particular object. Can be done with or without replacement.
Simple random sampling fails if we absolutely need to capture different characteristics of different groups of data objects.
Stratified sampling: We draw an equal number of random samples from each group of interest.
Progressive sampling: Increase sample until appropriate size is obtained.