Lecture 10 Flashcards

1
Q

10.1 Explain the motivation for data visualisation

A
  • Reveals characteristics of the data, relationships between objects or relationships between features
  • Simplifies the data
  • Humans are very good at analysing information in a visual format
  • Spot trends, patterns, outliers
  • Visualisation can help show data quality
  • Visualisation helps tell a story
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

10.5 What is a dissimilarity matrix and what are the steps for its construction?

A

Compute all pairwise distances between objects. This gives a dissimilarity matrix.
Find the Euclidean distance from each feature to every other feature. This can be visualized by a heat map where the colour of the cell represents its value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

10.7 What are the steps for reordering a dissimilarity matrix using the VAT algorithm?

A
  1. Find the objects that are furthest away and pick one at random. This is where you’ll start.
  2. The next object will be the one that is closest to this. Find the next closest object from this object.
    Repeat step 2 until you are done.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

10.8 Why is the VAT algorithm useful?

A

VAT lets you easily identify clusters by grouping close objects together.
However:
• VAT algorithm won’t be effective in every situation
– For complex shaped datasets (either significant overlap or irregular geometries between different clusters), the quality of the VAT image may significantly degrade.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

10.9 How can VAT be used to estimate the number of clusters in a dataset?

A

A diagonal dark block appears in the VAT image only when a tight group exists in the data (low within-cluster dissimilarities)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

10.10 What are the advantages and disadvantages of using parallel coordinates to visualise a dataset?

A

+ Often, the lines representing a distinct class of objects group together, at least for some features
– Hard for high dimensional datasets
– Scaling axes: Affects the visualisation. May choose to scale all features into the range [0,1] via a pre-processing step
– Ordering of axes: Influences the relationships that can be seen. Correlations between pairs of features may only be visible in certain orderings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

10.12 What are the motivations for dimensionality reduction?

A
  • Reduce amount of time and memory required by data processing algorithms
  • Allow data to be more easily visualised
  • Help eliminate irrelevant features or noise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

10.13 understand the concept of dimensionality reduction of a dataset (what is the input and what is the output and what is their relationship)

A

Input: A dataset with N features and K objects
Output: A transformed dataset with n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. 14 What methods can be used to preform dimensionality reduction?
A
  • Selecting a subset of the original features
  • Generating a small number of new features (The new features can be functions that encapsulate the data from the old features they represent)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

10.15 What is the purpose of using PCA for dimensionality reduction?

A

PCA – Principal Components Analysis

Goal is to find a projection that captures the largest amount of variation in data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

10.16 How does PCA work?. (not necessary to understand the mathematical formulas used for PCA)

A
  • Find a new set of features that better captures the variability of the data
  • First dimension chosen to capture as much of the variability as possible.
  • The second dimension is orthogonal to the first and, and subject to that constraint, captures as much of the remaining variability as possible,
  • The third dimension is orthogonal to the first and second, and subject to that constraint, captures as much of the remaining variability as possible.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

10.5 What are the properties of a dissimilarity matrix?

A

• The diagonal of D is all zeros (because the distance from an object to itself is 0)
• D is symmetric about its leading diagonal
• In general, visualising the (raw) dissimilarity matrix may not reveal enough useful information
– Further processing is needed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how do you interpret a dissimilarity matrix that has been reordered using the VAT algorithm?

A

heat maps where the colour of the cell is based on the value in the matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the potential benefits of using PCA for data visualisation?

A

Solves the Dimensionality reduction problem and all the issues that come with it. (Hard to visualise data, the higher the number of dimensions the closer objects appear to one another etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly