Part 2: Chapter 1, 2, 3 Flashcards

1
Q

Data

A

Collection of facts usually obtained as the result of experiences, web page visits, observations, or experiments. Data may consists of numbers, words, images, …
- Data is the lowest level of abstractions (from which information and knowledge are derived).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

CRISP-DM

A

Cross industry standard to do certain projects. 6 steps:

  • Business understanding: what is it all about and what do we want to achieve.
  • Data understanding
  • Data preparation: collect data from all sources and clean them.
  • Model building: select the best model that derives the results you want to know.
  • Testing and evaluation
  • Deployment: how are we going to use the results in real-time.
First 3 steps = +/- 85% of the project time
Last 3 steps are supported by 'SEMMA':
- Sample
- Explore
- Modify
- Model
- Assess
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Datamining

A

The process of discovering new valuable knowledge in databases. Datamining = machine learning (Ai) + statistics + databases.
2 types of datamining:
1. Hypothesis-driven data mining. This is the classical statistical hypothesis testing. Beforehand you have a hypothesis that you want to test.
2. Discovery-driven data mining. This is modern explorative research. You are exploring the data and the hypothesis develops while looking at the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Similarities between datamining and statistics.

A
  • Scientific fields for analyzing data

- Complex processes for learning from data that require profound understanding and mastering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Differences between datamining and statistics

A
  • Purpose of data: statistics data are primarily collected for checking a hypothesis formulated beforehand. Datamining deals mostly with data gathered from operational processes.
  • Amount of data: Datamining usually deal with vast amounts of data.
  • Data analysis: the results from datamining process should be easy to understand by and explain to the human decision makers and should comply with the business objectives.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

5 v’s for big data

A
  • Volume: the incredible amounts of data generated from different sources.
  • Variety: all the different types of data we can use.
  • Velocity: the speed at which vast amounts of data are being generated, collected, and analyzed.
  • Value: the worth of the data being extracted (often forgotten).
  • Veracity: the quality or trustworthiness of the data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Principles from logic

A
  • Deduction: all birds can fly, Koko is a bird, Koko can fly. -> it is sounds
  • Abduction: all birds can fly, Koko can fly. Therefor, Koko is a bird. -> not sounds, airplanes can fly as well.
  • Induction: Koko is a bird, Koko can fly, Tweety is a bird, Tweety can fly. All birds can fly. -> it is not sounds, because we only have 2 cases. However, this is used in data mining.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Selection bias

A

Make sure that your dataset (sample) is a good representation of the real world.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Social and legal aspects

A
  • Social impacts:
    + privacy: access to detailed personal information.
    + ethics: handling misinformation and discrimination.
  • Legal aspects:
    + National and EU
    + General Data Protection Rule (GDPR)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly