03_data and features Flashcards
What is data?
-Info output by sensing devide or organ
-includes both useful and irrelevant or redundant info
-must be processed to be meaningful
- info in digital form that can be transmitted or processed
-factual information (eg measurements or statistics)
-used as a basis for reasoning, discussion or calculation
What is the pipeline/process associated with data? (3 steps)
1) data acquisition
2) data storage (used to be a bottleneck)
3) data analysis
Is all existing data technically accessible for analysis?
No, most of it is privately owned
What are two types of data?
1) structured data
2) unstructured data
What is structured data?
preprocessed and formatted data that is easily queryable
eg quantitative data in a table
most data analysis techniques require data to be available in a structured form for easier processing
How is structured data represented?
Always in a database schema
(eg a table in 2 dimensions)
What is unstructured data?
Unprocessed and unformatted data is not easily queryable
eg qualitative data, textual data, image data, data stream, audio data, video data (with increasing data complexity)
What is quantitative data?
can be measured,
distances can be defined
What are two kinds of quantitative data?
1) continuous data
2) discrete data
What is continuous data?
real-valued numbers;
potentially within a given range
eg
- temperatures
- a person’s height
- prices
What is discrete data?
discrete numbers;
whole numbers or real numbers;
potentially within a given range
eg
- number of people in a room
- inventory counts
What is qualitative (categorical) data?
cannot be measure,
distances not defined
What are two types of qualitative data?
1) nominal data
2) ordinal data
What is nominal data?
Labels for different categories
without ordering
eg
- color of hair
- names of persons
- types of fruit
What is ordinal data?
Labels for different categories
following an inherent ranking scheme
eg
- rank in a competition
- grades
- day of the week
What is feature engineering?
Turning unstructured data into structured data
Why do we need feature engineering?
Before ML methods can be applied to unstructured data, we have to process those and extract useful features from them
What are features?
features are quantitative and independent variables
based on which our ML model learns
What is the process of data analysis?
1) raw data (qualitative)
- feature engineering
2) features (quantitative), x
- input
3) ML model, f(x)
- output
4) target (supervised setup), y
What does feature engineering do?
extract or create features that may provide a ML model with rich info on its task based on domain knowledge
can be applied to raw data, resulting in quantitative data that can be directly fed into the ML model (features)
What is feature engineering for quantitative data?
create meaningful features through mathematical transformations
What are examples for mathematical transformations for feature engineering with quantitative data?
- arithmetic:
eg for difference of two variables - aggregation of features:
eg for aggregation of two business units to one overall result - geometric transformations:
eg to identify common wind speed patterns, vector-calculations
What is feature engineering for qualitative data?
since qualitative/categorical data cannot be fed into ML models directly, they have to be turned into quantitative data first
What are two methods to feature engineer qualitative data?
- label encoding
- one-hot encoding