Preprocessing Flashcards

1
Q

Welche Dimension gibt es, um die Qualität von Daten zu messen?

A
  • Completeness: is the data fully available? What to do if not?
  • Consistency: differences in data units or name conventions?
  • Timeliness: measurements from different epochs?
    Old measure devices?
  • Believability: is the data source reliable?
  • Interpretability: how easily can the data be understood?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Für das Data Cleaning, über welche Typen an Fehlern sollte man Bescheid wissen?

A
  • Incomplete: lacking attribute values, lacking certain attributes of
    interest, or only aggregate data available
  • Noisy: containing noise, errors, or outliers
  • Inconsistent: containing discrepancies in codes or names
  • Intentionally imprecise
    – Jan. 1 as everyone’s birthday
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Wie kann man mit fehlenden Daten umgehen?

A
  • Ignorieren: kein großer Effekt bei großen Daten
  • manuell die Einträge überarbeiten
  • automatisch die Einträge überarbeiten (global constant, mean, most probable value using inference such as Bayesian
    formula or decision tree based on other attributess)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Was ist Data Integration?

A

Data integration combines data from
multiple sources into a coherent store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Mit welcher Methode kann man redundante Attribute erkennen?

A

chi-square test (nominal)
correlation analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Was sind die Vorteile von Data Integration?

A
  • reduce/avoid redundancies and inconsistencies and
  • improve mining speed and quality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Beschreib den Chi-Square Test mathematisch

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Was bedeutet ein hohes Chi-Quadrat?

A

→ data distributions are statistically different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Was bedeutet ein niedriges Chi-Quadrat?

A

distributions are similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Wie funktioniert ChiMerge?

A

Man hat Intervalle und checkt rekursiv, ob die Verteilung der Label in den beiden ähnlich ist anhand des Chi-Quadrat tests und mergt diese, falls dies stimmt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Beschreib den Pearson’s product
moment coefficient mathematisch

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pearson’s product moment coefficient

Was bedeutet es, wenn r > 0?

A

A and B are positively correlated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Pearson’s product moment coefficient

Was bedeutet es, wenn r = 0?

A

uncorrelated, not necessarily independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Pearson’s product moment coefficient

Was bedeutet es, wenn r < 0?

A

negatively correlated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Wie berechent man die Kovarianz?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Was bedeutet eine Kovarianz von größer als null?

A

A and B tend to be together
larger or together smaller than their expected values

17
Q

Was bedeutet eine Kovarianz von kleiner als null?

A

if A is larger than its expected
value, B is likely to be smaller than its expected value.

18
Q

Wie kann die Kovarianz vereinfacht werden?

A
19
Q
A
20
Q

Wie berechnet man element ij einer Kovarianzmatrix?

A

it computes the covariance between feature i and feature j

21
Q

Welche Strategien für das Binning existieren?

A
  • equal-width
  • equal-first (same number of samples)
22
Q

Welche Smoothing Strategien im Anschluss des Binnings existieren?

A
23
Q

Welche zwei Wege zur Dimensionsreduzierung existieren?

A
  • Feature selection: A process that chooses an optimal
    subset of features according to an objective function
  • Feature extraction: refers to the mapping of the original
    high-dimensional data onto a lower-dimensional space
24
Q

Was minimiert deskriptive Dimensionsreuzierung?

A

den Informationsverlust

25
Q

Was maximiert prädiktive Dimensionsreduzierung?

A

die Klassendiskrimination

26
Q

Wie bestimmt man die separation quality?

A