Lecture 10 Flashcards

Question 1

Q

10.1 Explain the motivation for data visualisation

Answer

A

Reveals characteristics of the data, relationships between objects or relationships between features
Simplifies the data
Humans are very good at analysing information in a visual format
Spot trends, patterns, outliers
Visualisation can help show data quality
Visualisation helps tell a story

Question 2

Q

10.5 What is a dissimilarity matrix and what are the steps for its construction?

Answer

A

Compute all pairwise distances between objects. This gives a dissimilarity matrix.
Find the Euclidean distance from each feature to every other feature. This can be visualized by a heat map where the colour of the cell represents its value.

Question 3

Q

10.7 What are the steps for reordering a dissimilarity matrix using the VAT algorithm?

Answer

A

Find the objects that are furthest away and pick one at random. This is where you’ll start.
The next object will be the one that is closest to this. Find the next closest object from this object.
Repeat step 2 until you are done.

Question 4

Q

10.8 Why is the VAT algorithm useful?

Answer

A

VAT lets you easily identify clusters by grouping close objects together.
However:
• VAT algorithm won’t be effective in every situation
– For complex shaped datasets (either significant overlap or irregular geometries between different clusters), the quality of the VAT image may significantly degrade.

Question 5

Q

10.9 How can VAT be used to estimate the number of clusters in a dataset?

Answer

A

A diagonal dark block appears in the VAT image only when a tight group exists in the data (low within-cluster dissimilarities)

Question 6

Q

10.10 What are the advantages and disadvantages of using parallel coordinates to visualise a dataset?

Answer

A

+ Often, the lines representing a distinct class of objects group together, at least for some features
– Hard for high dimensional datasets
– Scaling axes: Affects the visualisation. May choose to scale all features into the range [0,1] via a pre-processing step
– Ordering of axes: Influences the relationships that can be seen. Correlations between pairs of features may only be visible in certain orderings

Question 7

Q

10.12 What are the motivations for dimensionality reduction?

Answer

A

Reduce amount of time and memory required by data processing algorithms
Allow data to be more easily visualised
Help eliminate irrelevant features or noise

Question 8

Q

10.13 understand the concept of dimensionality reduction of a dataset (what is the input and what is the output and what is their relationship)

Answer

A

Input: A dataset with N features and K objects
Output: A transformed dataset with n

Question 9

Q

14 What methods can be used to preform dimensionality reduction?

Answer

A

Selecting a subset of the original features
Generating a small number of new features (The new features can be functions that encapsulate the data from the old features they represent)

Question 10

Q

10.15 What is the purpose of using PCA for dimensionality reduction?

Answer

A

PCA – Principal Components Analysis

Goal is to find a projection that captures the largest amount of variation in data.

Question 11

Q

10.16 How does PCA work?. (not necessary to understand the mathematical formulas used for PCA)

Answer

A

Find a new set of features that better captures the variability of the data
First dimension chosen to capture as much of the variability as possible.
The second dimension is orthogonal to the first and, and subject to that constraint, captures as much of the remaining variability as possible,
The third dimension is orthogonal to the first and second, and subject to that constraint, captures as much of the remaining variability as possible.

Question 12

Q

10.5 What are the properties of a dissimilarity matrix?

Answer

A

• The diagonal of D is all zeros (because the distance from an object to itself is 0)
• D is symmetric about its leading diagonal
• In general, visualising the (raw) dissimilarity matrix may not reveal enough useful information
– Further processing is needed

Question 13

Q

how do you interpret a dissimilarity matrix that has been reordered using the VAT algorithm?

Answer

A

heat maps where the colour of the cell is based on the value in the matrix

Question 14

Q

What are the potential benefits of using PCA for data visualisation?

Answer

A

Solves the Dimensionality reduction problem and all the issues that come with it. (Hard to visualise data, the higher the number of dimensions the closer objects appear to one another etc.)

Lecture 10 Flashcards

(14 cards)