Danny's Zenne Stof Flashcards

1
Q

What is data analytics in exploratory analytics?

A

It’s about the extraction of useful information and knowledge from large volumes of data, in order to improve decision making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we do data exploration?

A

We explore our data in order to understand it better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you get started on data exploration?

A
  • Import the data in the right format.
  • Understand the meaning of the variables.
  • Understand their typical values.
  • Understand how values interact with each other.
  • Understand how to combine different datasets.
  • Understand the data types.
  • Are there missing values?
  • Are there outliers?
  • What is the overall quality of our data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Exploring data is two-folded: explain.

A

Descriptive statistics: distributions, relationships, …

Visualizations: scatterplots, histograms, barplots, …

Data visualization is arguably the most important. The information conveyed via visuals can be very quickly absorbed by the human brain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

You cannot prepare the data without understanding the data. Explain?

A

You need to know the quality of your data to know what to do in your preparation steps. Are there outliers, missing values, …etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A good star is half the battle… why?

A

You can’t begin working on your project unless you know and understand your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Datasets typically consist of rows and columns. What do they mean?

A

Rows are the observations/data points/entities.

Columns are the attributes/features/attributes/variables of your observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which kinds of data sources are there?

A

Internal and External.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give some examples of internal data sources.

A
  • Company Website
  • Customer Information: make sure to contact the privacy responsible before working with Personally Identifiable Information! (GDPR)
  • Operations/Logistics data
  • Financial data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Give some examples of external data sources

A
  • APIs (e.g. tweets)
  • Public Records (open source data, available to anyone, e.g. government)
  • Manually Labelled (e.g. reCaptcha, labeled customer reviews, …)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What kinds of data storage exists?

A
  • Servers on Premise (small- to medium-sized datasets)
  • Cloud (any kind of dataset)

e.g. Amazon AWS, Google Cloud, Azure Cloud, …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What kinds of data do you have? Give some examples.

A

Structured:

  • Tabular data
  • Customer information
  • Transactional data

Unstructured:

  • Text
  • Email
  • Video
  • Audio
  • Web Pages
  • Social Media
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What kind of databases are used for structured data?

A

Relational databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What kind of databases are used for unstructured data?

A

Document databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What query language is used to access document databases?

A

NoSQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What query language is used to access relational databases?

A

SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can we turn unstructured data into structured data?

A

By means of feature extraction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is character encoding and why is it important?

A

Character encoding is used to tell the software how to interpret the bytes of your data. This is important to that your data is accurately/correctly interpreted.

Default encodings include UTF-8 and Latin1. Latin1 cannot interpret Kanji, for example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are missing values?

A

Missing values are values that are missing from your dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are some important steps to consider when importing data?

A
  • Are we using the correct character encoding?

- Are there any missing values?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What types of data are there?

A

Categorical:

  • Nominal (unranked)
  • Ordinal (ranked)

Numerical:

  • Discrete (counted, not measured)
  • Continuous (measured, not counted)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is nominal data and give some examples.

A

Categorical data that does not indicate an order between the values.

  • Male/Female
  • Colours (red, green, blue)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is ordinal data and give some examples.

A

Categorical data that does have some kind of order.

  • Small, Medium, Large
  • First Class, Second Class, Third Class
  • Temperature labeled as “cold, mild, hot”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is continuous data and give some examples.

A

Continuous data is data that can be measured, but not counted.

  • Length
  • Weight
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is discrete data and give some examples.

A

Discrete data can be counted, but not measured.

  • Number of students
  • Number of pens in the box.
  • Number of chickens that walked out of the chicken coop.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What kind of statistics can you do with nominal data?

A

You can count the frequencies.

You can count the proportions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How can you visualize nominal data?

A

Barcharts and piecharts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What kind of statistics can you do with ordinal data?

A

Frequencies, proportions.

Percentiles and median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What kind of statistics can you do with continuous and discrete data?

A

You can summarize your data using percentiles, median, mean, standard deviation, range …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How can you visualize numeric data?

A

Histograms.

Boxplots.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Which type of plot can show outliers? Histograms or Boxplots?

A

Boxplots. Histograms only show tendencies of your data, not individual outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What do you call a variable that identifies a sample?

A

An object identifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Give some examples of object identifiers.

A

Row indexes, names, database ids.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What kind of information does a histogram give you?

A

The general tendencies of your data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are descriptive statistics?

A

Descriptive statistics give you insights by summarizing the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Give some examples of descriptive statistics.

A
  • Average of the annual income.
  • Median home prices in the neighbourhood.
  • Range of credit scores of a population.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is univariate exploration?

A

This is the analysis of one attribute at a time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the mean?

A

This is the average of all observations in a dataset for a certain variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the median?

A

This is the value of the central point in the distribution of the dataset for a certain variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is variability?

A

Variability is the range between which valid values are posed. For instance, two ranges with similar means and median values can have vastly different variabilities if their minimums and maximums are different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is range?

A

Range is the difference between the minimum and maximum value.

The range is very susceptible to the presence of outliers and fails to consider the distribution of all data points in the attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is spread?

A

Spread is quantified by the deviation and variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is deviation?

A

The difference between the observation and the mean of a value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is variance?

A

Variance is the squared deviation of a variable from its mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is standard deviation?

A

The squared deviation of the variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What does it mean where an attribute has a high standard deviation?

A

The datapoints are spread widely from its central point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What does it mean when an attribute has a low standard deviation?

A

It means that the datapoints are spread closely around the central point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is multivariate exploration?

A

It means that we study more than one attribute simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is correlation?

A

Correlation measures the statistical relationship between two attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is spurious correlation?

A

A correlation that happens by accident, or because of an (unseen) third factor.

It’s a correlation that’s not causal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is the pearson correlation coefficient?

A

A value (r) that can be between -1 and 1. It describes how strongly correlated two variables are.

-1 : strongly negatively correlated
1 : strongly positively correlated
0 : no correlation at all

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Pearson’s correlation coefficient is sensitive to outliers. Correct?

A

Yes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What do we use scatterplots for?

A

We use scatterplots to compare 2 numerical attributes. We can compare more attributes by using colours, shapes, etc. to plot a third attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is a histogram? What do you use it for?

A

A histogram can be used to visualize the distribution of data by plotting the frequency of occurrence in a range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What’s the optimal number of bins or binwidth in a histogram?

A

There is no optimal number, it depends on the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

How can we compare the histograms of a categorical third factor?

A

By using colours. This could be useful to see how the X and Y attribute compares for various values of a third categorical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is a boxplot?

A

A boxplot is a simple but powerful visual way of showing the distribution of a numerical variable. A boxplot shows useful information like outliers and interquartiles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What makes boxplots interesting?

A

You can compare them easily.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is Q1, Q2 and Q3 in a boxplot?

A

Q1 and Q3 indicate the edges of the box. Q2 indicates the mean of the distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What is R²?

A

The model fit. A higher number indicates a better model fit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Where are the data samples located on a linear regression between two highly correlated numerical variables?

A

Very close to the linear regression line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Where are the data samples located on a linear regression between two lowly correlated numerical variables?

A

Very scattered and not along the linear regression line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Do outliers strongly influence the linear regression calculation?

A

Yes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What is a scatter matrix?

A

It’s annoying to calculate scatterplots for each numerical attribute in datasets with many numerical features.

You can use a scatter matrix to quickly show comparisons for all of them.

A scatter matrix will show scatter plots for each pair of attributes below the main diagonal.

The main diagonal will show histograms of the attribute it represents.

Above the main diagonal will be the r-value that shows how correlated it is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What problems does a parallel chart solve? What can you do with them?

A

In what way does type X differ from type Y.

Which classes have the highest numerical values for attribute A?

How do classes X and Y compare for attribute B and C?

66
Q

What is the data exploration roadmap?

A
  1. Import & Organize
  2. Data Quality

Check the data quality. Are there missing or incorrect values? How will you deal with it?

This is an iterative process with data scientists & the business.

Values might have to be imputed.

  1. Univariate Statistics

Calculate mean and median for each numerical attribute and the class label.

If they are very different, it may indicate the presence of an outlier or a non-normal distribution for the attribute.

Calculate the standard deviation and spread. Compare the standard deviation with the mean to understand the data.

  1. Univariate Visualizations

Display the histogram and distribution plots for each attribute. Repeat for class-stratified histograms. Use colour coding for each class to make comparisons.

Compare with and without outliers.

  1. Multivariate Statistics

Calculate correlation between attributes and develop a correlation matrix. Notice what attributes are dependent on each other. Investigate why they are dependent. Ask your business for help to explain these results.

  1. Multivariate Visualizations

Plot a scattermatrix to show correlation between multiple attributes at once.

Remember to stratify by class if applicable.

  1. High Dimensional Visualization

Create parallel charts to observe the class differences exhibited by each attribute.

Group box plots to compare them for each attribute.

67
Q

What is a root node?

A

Each decision tree has a root node at the top of the decision tree.

68
Q

What is a (child) node?

A

A node that has child nodes.

69
Q

What is a leaf?

A

A node without child nodes.

70
Q

What is the purpose of Decision Trees?

A

Exploration. Gain insights in large number of candidate input variables.

Classification & Estimation. Easy understandable rules for predicting most likely classes or value of continuous variable.

71
Q

What makes decision trees useful or interesting compared to other algorithms?

A

Because they provide an insight in their decision making. They are a white-box approach.

72
Q

How to read/interpret a decision tree?

A

Each decision tree starts with a root node and asks a question: for example “sex = male?”.

Each node contains 3 rows of information.
Row 1: has 1 number, the majority class of the class we’re trying to predict.
Row 2: two numbers, the proportion of observations belonging to each class.
Row 3: the % of total observations in this node.

Example from the Titanic case:

0

73
Q

What is decision tree node purity?

A

This indicates how pure a node is. A node is pure if it contains many of one class, but not many of the other class.

74
Q

How does a decision tree find the best splits?

A

By using purity. A decision tree automatically tries to find the best splits by feature and value of the feature to make its splits as pure as possible.

This means that each split contains one dominant class.

75
Q

Decision Trees are exhaustive algorithms. What does that mean?

A

It’s an exhaustive algorithm because it tries all possibilities to make the best mathematical decision of purity.

76
Q

What is a recursive algorithm?

A

An algorithm that will continue to apply itself again and again until it completes its task.

77
Q

How is purity quantified?

A

Using GINI, Entropy or Information Gain Ratio.

78
Q

How do you choose the decision tree splitting criteria?

A

Splits are evaluated based on the effect on the node purity in terms of the target variable.

This means that the choice of an appropriate spltiting criteria depends on the type of the target variable.

With a categorical -> GINI is OK.

Continuous/Numeric -> other tests.

79
Q

What are appropriate decision tree splitting criteria for categorical target variables?

A

GINI,
Entropy,
Information Gain Ratio

80
Q

What is GINI?

A

GINI is the sum of squares of the proportions of the classes.

81
Q

How do you calculate GINI if you have a dataset of 5 triangles and 8 squares?

A

(5/13)² + (8/13)² = 0.527

82
Q

How do you calculate a GINI split that takes an original dataset of 5 triangles and 8 squares, and splits it in two groups. Group 1: 6 squares, 1 triangle. Group 2: 2 squares, 4 triangles.

A
GINI1 = (6/7)² + (1/7)² = 0.755
GINI2 = (2/6)² + (4/6)² = 0.556

GINI(split) = (7/13)0.755 + (6/13)0.556 = 0.755

83
Q

When does a decision tree stop growing?

A

If it can no longer split the data, otherwise it will keep growing.

84
Q

What is pruning?

A

Eliminating unstable splits by merging smaller splits.

85
Q

What do the lower levels of a decision tree typically indicate?

A

The subtle patterns in the training set. They do not generalize well.

86
Q

Why are decision trees not ideal for regression?

A

Because they make lumpy estimations.

87
Q

Should you use Decision Trees for multiclass problems?

A

You can, but in practice it might not give good results. They are better suited for binary classifications.

88
Q

What is a model structure?

A

Mathemetical function of a set of numeric attributes.

89
Q

What is parameter learning / parametric modelling?

A

These are models that learn by tweaking a set of parameters that are closely tied to the attributes that are used for learning.

For example, y = a + bx1 + cx2. In this case, x1 and x2 are the chosen attributes of our training set. The variables a, b and c are chosen by the model in such a way that the mathematical result has the best result on the training set (the lowest cost).

90
Q

What is a linear discriminant?

A

This is the line you can draw through a dataset in order to determine the target class.

This is also a mathematical function, like:

Class(x) = 400 - 5 * size (where 400 and 5 were determined by the model).

This can then say that if the result is < 0, it’s the negative class and if it’s > 0 it’s the positive class.

91
Q

When the linear discriminant is for two attributes, we speak of a…

A

The discriminant is a line.

92
Q

When the linear discriminant is used for three attributes, we speak of a…

A

A plane.

93
Q

When the linear discriminant is used for four or more attributes, we speak of a…

A

Hyperplane.

94
Q

Can you visualize a hyperplane?

A

No, because it has more than 3 dimensions.

95
Q

What are objective functions?

A

These are the algorithms you use when building a parametric model.

Example: SVM, Logistic Regression, Linear Regression, …

96
Q

What is scoring/ranking?

A

Determining how likely (probability) of belonging to a class.

97
Q

How does SVM work?

A

It classifies observations based on the linear function of the features.

98
Q

What are support vectors?

A

The data points that are closest to the linear line.

99
Q

What’s the goal with SVM?

A

Fitting the fattest line between the classes.

100
Q

What is better, a narrower or a wider margin?

A

Wider margin.

101
Q

What does SVM do with points that are on the wrong side of the line?

A

It penalizes the model for them. Distance increases penalty. Further away from the line and on the wrong side => bigger penalty.

102
Q

What are the pros of SVM?

A

Accuracy.
Works good on small clean datasets.
Robust against overfitting.

103
Q

What are the cons of SVM?

A

Computational power for large datasets.

104
Q

What is RMSE?

A

Root Mean Squared Error. It is used in regression to estimate how well your model performs. You want your model to fit its parameters in such a way that minimizes the RMSE on the training and test sets.

105
Q

What are the pros of linear regression?

A

Easy to fit and apply.

Less prone to overfitting.

Interpretable.

106
Q

What are the cons of linear regression?

A

They can only express linear and additive relationships.

Prone to colinearity - when input variables are partially correlated.

Sensitive to outliers.

107
Q

What does it mean when the R² value is close to zero?

A

That the model isn’t very good, not really much better than guessing the answer.

108
Q

What does it mean when the R² value is close to one?

A

That the model is very good at predicting the value. It’s a good fit for the data.

109
Q

What does a good fitting linear regression model look like?

A

x = y line runs roughly through the center points

110
Q

What does a poorly fitting linear regression model look like?

A

Points are diverting from the x = y line.

  • > need to model more complex relationships
  • > not all necessary variables are included (model is too simple)
111
Q

What is a residual plot?

A

It shows the goodness of fit. It shows the distances between the real value and the model. Minimizes the sum of square errors, indicating a better model as this value gets minimized.

112
Q

What is logistic regression?

A

A model that indicates the probability that an observation belongs to the class of interest.

113
Q

What kind of variables does logistic regression need?

A

It needs numerical input variables and a categorical output variable.

114
Q

What’s the grey area surrounding a logistic regression line?

A

It’s the confidence interval??

115
Q

Why is logistic regression a stable algorithm for binary classification tasks?

A

They’re not sensitive to outliers. Only points around the classification boundary (the middle) have a large influence on the model.

116
Q

Can you use linear regression for binary classification?

A

Yes, but it doesn’t tell us much as it’s very sensitive to outliers, so we preferably don’t do it.

117
Q

What is contractual churn?

A

Example: if you cancel your cable TV account.

118
Q

What is non-contractual churn?

A

Example: you stop shopping at Delhaize and go to Colruyt instead.

119
Q

What is voluntary churn?

A

When the user decides to stop doing business.

Example: doesn’t use their TV, so doesn’t want to pay anymore.

120
Q

What is involuntary churn?

A

When the company decides to stop doing business with the customer.

Example: customer violates terms of service and company bans the customer.

121
Q

When should you aim for high precision?

A

If the cost of making false positives is high: for example, if you want to make an effort to target churners and the cost of targeting a non-churner is high.

Example: you could offer better benefits to potential churners, if you start handing these out nilly-willy, it costs your company a lot of money.

122
Q

When should you aim for high recall?

A

If the cost of not identifying your positive class is high.

Example: if losing customers you didn’t think would churn is more expensive than acquiring new ones.

You don’t want to misclassify “real” churners.

123
Q

What does the ROC curve plot?

A

True Positive Rate over False Positive Rate.

124
Q

At what threshold do you have a good AUC?

A

0.7-0.8

125
Q

When comparing multiple models in an ROC curve, which one should you choose?

A

The one that’s closest to the top left corner, the model with the biggest area-under-curve (AUC).

126
Q

What’s the F1-score?

A

This is the combined value of precision and recall. It’s a harmonic mean between these two values.

A high F1-score means you have a very well performing model, even with unbalanced classes.

127
Q

How does bagging work?

A

In parallel.

128
Q

How does boosting work?

A

By building on the results of the previous classifier sequentially.

129
Q

What should you take into account with regards to variables for machine learning algorithsm?

A

Most machine learning algorithms make assumptions about the distribution of your data.

Variables should be on the same scale (= standardization)

Variables should be of numeric types.

130
Q

Which models require standardized data?

A

Models that use linear objective functions.

131
Q

How do you turn categorical values into numeric values?

A

You use binary or one hot encoding.

132
Q

What is feature selection?

A

Dropping unnecessary features.

Dropping highly correlated features.

133
Q

What’s important when dealing with imbalanced datasets with regards to test/train sets?

A

Use stratified sampling.

134
Q

What is feature engineering?

A

Create new features based on existing features.

Creates insights into relationships between features.

Should consult with business and subject matter experts.

135
Q

What is distance between datapoints? What do we mean?

A

We mean: how similar or dissimilar are two data points?

136
Q

How to calculate dissimilarity for nominal attributes?

A

d(i, j) = p-m / p

m = number of matches
p = total number of attributes describing the object
137
Q

How to calculate similarity of nominal attributes?

A

1 - d(i, j)

138
Q

How to calculate dissimilarity of objects described by numeric attributes?

A

Use the Euclidian distance:

d(i,j) = sqrt( (xi1 - xj1)² + (xi2 - xj2)² + … + (xin - xjn)² )

We want to minimise dissimilarity => more similar!

139
Q

Why is scaling important for numeric similarity calculations?

A

Because large numbers swamp small numbers making them very unimportant.

140
Q

What are the characteristics of a standard scaled numeric attribute?

A

It’s mean is 0 and it’s standard deviation is 1.

141
Q

What is the simplest form of K-Nearest Neighbours?

A

1-Nearest Neighbours.

142
Q

How should you decide on the ideal number of neighbours in KNN?

A

Test accuracy for multiple K’s and use the best result. Alternatively, plot test/train errors on graph. When test data errors start going up again when training still goes down –» overfitting.

143
Q

How is the standard distance in K-NN calculated?

A

Euclidian distance

144
Q

What happens if we use all possible neighbours for KNN?

A

Every new label will receive the label of the majority class.

145
Q

Why would you not use a 1-nearest neighbour?

A

It overfits very hard.

146
Q

What are the pros of KNN?

A

Very little time to build.

Can handle missing values in new observation.

147
Q

What are the cons of KNN?

A

Storage.
No model description.
Curse of dimensionality. There could be many irrelevant attributes that might pose a problem for predicting a neighbour.

148
Q

What type of learning does KNN use?

A

Instance-based learning.

149
Q

What do we use unsupervised learning for? Like clustering?

A

We might want to find groups within our data, without looking for any specific classification.

150
Q

What is another name for clustering?

A

Unsupervised segmentation.

151
Q

What is hierarchical clustering?

A

Number of clusters not known ahead of time.

152
Q

What do we use a dendogram for?

A

To indicate hierarchies in clusters.

153
Q

What is the lowest level in a dendogram?

A

Each data point.

154
Q

What is the highest level in a dendogram?

A

The entire dataset.

155
Q

How does hierarchical clustering works?

A

First, it groups the two data points that are closest together. Then, it groups the second 2 closest together. It keeps doing that until it has grouped everything in one large dataset group.

It groups based on distances.

156
Q

What does the dendogram show?

A

It creates a collection of ways to group points. You can cut at a certain horizontal point (which indicates the number of groups).

157
Q

What is the advantage of hierarchical clustering?

A

It allows the data scientist to see the groups, the landscape of data similarity before deciding on the amount of groups to select.

158
Q

What is a centroid?

A

The center of a cluster.

159
Q

What is K-Means?

A

Breaks observations into pre-defined number of clusters. K = number of clusters.

160
Q

What does K-Means output?

A

The cluster centroids and the clustered datapoints.

161
Q

How does K-Means work?

A
  1. Randomly assign each of observation to one of the K clusters.
  2. Then, calculate the centers for each clusters. The center is the average position of all observations of that cluster.
  3. Then, we have the first iteration. Each observation is assigned to the first cluster center.
  4. Then, recalculate the center of the cluster.
  5. Keep repeating until the center stops changing.

Stopping can also be done early by specifying how many times it can change the center.