# Lecture 8 – Grouping data Flashcards

1
Q

Segmenting data

A

Sometimes the segmenting of data is because of the context of the data (e.g. sources)

Sometimes we don’t have pre-determined segments, but we want segmentation
- e.g. identifying customer segments

2
Q

What is a segmentation model?

A
3
Q

A. Group all the shopping items available on web.
B. Identification of areas of similar land use in an earth observation database
C. Weather prediction based on last month’s temperature

A

(probably C)

4
Q

What are regression trees?

A
• A regression tree is a supervised machine learning algorithm that predicts a continuous-valued response variable by learning decision rules from the predictors (or independent variables)
• Two main steps:
• divide the data into subsets of similar values - - estimate the response within each subset.
5
Q

What is ANOVA?
How does it work to split regression trees?

A

ANOVA = analysis of variance
–> type of statistical test used to determine if there is a statistically significant difference between two or more categorical groups by testing for differences of means using variance.

6
Q

What are pros and cons of regression trees?

A

Pros:
- easy to understand
- Visualizing the tree can reveal crucial information, such as how decision rules are formed, the importance of different predictors and the effect of the splitting points in the predictors.
- Implicitly performs feature selection as some of the predictors may not
be included in the tree.
- Not sensitive to the presence of missing values and outliers.
- No assumptions about the shape and the distribution of the data.

Cons:
- The fit has a high variance, meaning small changes in the data set can lead to an entirely different tree.
- Overfitting is a problem for tree-based models, but we can adjust the stopping conditions and prune the tree.
- Can be inefficient when performing an exhaustive search for the splitting points of continuous numerical predictors.
- Greedy algorithms cannot guarantee the return of the globally optimal regression tree.

7
Q

When do you use regression trees and when classification trees?

A
• Regression tasks relate to determining quantitative numerical variables based on input variables
• Classification tasks about determining a qualitative value (e.g., category or class) based on the input variables
–> Categorical variables (nominal data, ordinal data)
8
Q

How do you split classification trees?

A

Most popular split criteria are Gini and Entropy

9
Q

Clustering and segmentation

A
10
Q

Explain the use of clustering

A

Use of clustering:

Text documents, e.g.,patents,legalcases, webpages, questions and feedback ==> Topic modelling

• Clients, e.g.,recommendation systems
• Fault detection, e.g., fraud, networksecurity
• Missing data
• A clustering task may require a number of different algorithms/approaches.
11
Q

What are elements of a cluster?

A
• Are similar in some attributes
• May consider some attributes to weigh more than others ==> Not all attributes are as important as others (feature selection)

May be considered to be close to each other ==> Needs distance measurements

12
Q

What are the two clustering approaches you learned in the lecture?

A
1. k-means algorithm
2. Hierarchical
13
Q

Explain the k-means algorithm

A
1. Randomly select centroids for K clusters
2. Select nearest data points as cluster population
3. Find mean values in each cluster and use that as new centroid
4. Re-evaluate populations and centroids until stable/convergance
• Does not work with categorical data and it is susceptible to outliers
• Have to predefine avalue for K
• No guarantee there are actually clusters to find
14
Q

Explain hierarchical clustering

A

Clusters with in clusters!
* Agglomerative (bottom-up) vs Divisive (Top-down)

• Agglomerative:
• Treat each data point as a centroid in a cluster of population 1
• Form new clusters by merging nearby clusters
• Continue until only one cluster
• Various ways to calculate which clusters should be merged, often looking at (min or max) distances of the cluster population to each other
• The results of hierarchical clustering are usually presented in a dendrogram
• Greedy!
• Can be costly, due to having to calculate a lot of distances for each level of the tree.
• But with no randomness, the same tree will be produced each time.
• Can cut the tree at any level so as to get the population of a certain number of clusters.
15
Q

What is a network made up of?

A

Nodes and arcs

• Node (vertices) –> entities in the data
• Edges (arcs) –> relationships between the entities
16
Q

Explain the specific terms for network data

A
• Directed graphs – direction of connections, visualised as arrows, e.g., retweets, power relationships, dispersion of resources
• Weight – the strength of a connection, e.g., number of instances
• Degree – How many connections a node has:
• Can incorporate the weight of the connections
• Can distinguish between in-degree and out-degree
17
Q

How do you evaluate nodes?

A

Significant nodes
- The closeness of nodes is not an Euclidean distance
- The centrality of a node can be measured in various ways.
- Degree
- Betweenness
- Closeness

• Clustering nodes
• You can still identify clusters of nodes