CS165 Flashcards

Question

Why should visualisations be chosen carefully?

Answer 1

Because identical statistics can produce very different distributions (Anscombe's Quartet).

Answer 2

To standardise the range of features so models treat them equally.

Answer 3

Normalization: scale to [0,1]; Standardization: mean=0, std=1.

Answer 4

To predict a continuous variable based on one or more inputs.

Answer 5

A regression model that uses a sigmoid function to classify between two categories.

Answer 6

When a model performs well on training data but poorly on unseen data.

Answer 7

By using simpler models, more data, regularisation, or validation techniques.

Answer 8

To evaluate performance of classification models.

Answer 9

Precision: TP / (TP + FP), Recall: TP / (TP + FN).

Answer 10

A graph showing TPR vs FPR at various thresholds.

Answer 11

Grouping data points so that similar points are in the same group.

Answer 12

Iteratively assigns points to clusters and updates centroids to minimise intra-cluster distance.

Answer 13

Measures how similar a point is to its own cluster vs other clusters; ranges from -1 to 1.

Answer 14

A technique to choose optimal K in K-means by plotting inertia vs K.

Answer 15

Principal Component Analysis reduces dimensionality while preserving variance.

Answer 16

By projecting data onto new axes with maximum variance, derived from eigenvectors.

Answer 17

NumPy, pandas, scikit-learn, matplotlib, seaborn.

Answer 18

Ensures quality input to models; handles missing data, scaling, and formatting.

Answer 19

Supervised uses labeled data; unsupervised uses unlabeled data.

Answer 20

A correct prediction where the model predicts the positive class, and it is actually positive.

Answer 21

The model predicts negative when the actual value is positive.

Answer 22

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Answer 23

Precision = TP / (TP + FP)

Answer 24

Recall = TP / (TP + FN)

Answer 25

True Positive Rate vs False Positive Rate at various threshold settings.

Answer 26

Area Under the Curve measures overall performance of a classifier.

Answer 27

IQR = Q3 - Q1; it shows the range of the middle 50% of data.

Answer 28

Outliers are typically values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.

Answer 29

The left tail is longer; the mean is less than the median.

Answer 30

When the data contains outliers or is skewed.

Answer 31

It centres the data and reduces the influence of differing scales.

Answer 32

When data is bounded or you want values strictly in [0,1].

Answer 33

Distance-based models like K-means, KNN, and gradient descent.

Answer 34

To reduce the number of features while retaining as much variance as possible.

Answer 35

By calculating principal components with the highest variance.

Answer 36

PCA is sensitive to the scale of features.

Answer 37

A direction along which data varies the most.

Answer 38

To minimise the intra-cluster sum of squared distances to the centroid.

Answer 39

Randomly selects K centroids from the data.

Answer 40

It measures how similar a point is to its own cluster vs others.

Answer 41

The total sum of squared distances of samples to their closest cluster centre.

Answer 42

Find the 'knee' point where inertia starts decreasing slowly.

Answer 43

A clustering method that builds a hierarchy of clusters using merges or splits.

Answer 44

A tree diagram used to illustrate hierarchical clustering.

Answer 45

A model that performs well on training data but poorly on new data.

Answer 46

A model that is too simple to capture patterns in the training data.

Answer 47

By splitting data into training and test sets, or using cross-validation.

Answer 48

Balancing the error from model simplicity (bias) and sensitivity (variance).

Answer 49

A statement asserting no effect or relationship exists.

Answer 50

Predicts the direction of the effect (e.g., A > B).

Answer 51

To control variables and test specific hypotheses.

Answer 52

The known output value we want to predict.

Answer 53

An individual measurable property of the phenomenon being observed.

Answer 54

variance explained by each component, helping us identify the optimal number of principal components to keep by locating the “elbow” point

Answer 55

reduce dimensionality while preserving key information

CS165 Flashcards

(79 cards)