CS165 Flashcards

(79 cards)

1
Q

What is the PPDAC Model

A

Structural approach to carry out investigative research. Explorative and inquisitive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the 1st stage of the PPDAC Model

A

Problem - requires underlying preliminary understanding of the area, identify the kind of question to ssolve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the 2nd stage of the PPDAC Model

A

Plan - What you need to do to address that problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the 3rd stage of the PPDAC Model

A

Data - Collection, processing, management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Tabular

A

2D table, each row an observation and each column a measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Structured

A

Each observation represented by a dictionary of keys and values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Semi-structured

A

Not all records are represented by the same keys

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the 4th stage of the PPDAC Model

A

Analysis - Visualising the data, develop initial questions , communicate findings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Bar charts

A

Use bars to represent counts of categorical features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Histogram

A

Shows distribution (frequency of occurrence)
across a range, with values binned into brackets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Scatter Plot

A

Shows relationships between two variables within multivariate data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Top-down

A

Applies previous knowledge to data,
commonly via rules or choices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Bottom-up

A

Builds knowledge from the data, allowing
a system to learn its own behaviour based on what it
observes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the 5th stage of the PPDAC Model

A

Conclusion - Summarise and communicate , Reflect and look forward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the PPDAC model used for in data science?

A

To structure investigative research: Problem, Plan, Data, Analysis, Conclusion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does the CRISP-DM model differ from PPDAC?

A

CRISP-DM is more business-centric, includes deployment and business understanding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is data science?

A

An interdisciplinary field combining statistics, computing, and domain expertise to extract insights from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the two main categories of data?

A

Quantitative (numerical) and Qualitative (categorical).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the difference between nominal and ordinal data?

A

Nominal has no order (e.g. color), ordinal has order (e.g. ratings).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Define a scalar, vector, matrix, and tensor in terms of data representation.

A

Scalar: single value, Vector: 1D array, Matrix: 2D array, Tensor: multi-dimensional array.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the three measures of central tendency?

A

Mean, Median, and Mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is standard deviation and why is it useful?

A

Measures spread of data around the mean; useful for understanding variability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do you calculate the IQR?

A

IQR = Q3 - Q1, where Q1 and Q3 are the first and third quartiles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

When is a boxplot useful?

A

For visualising the distribution, spread, and outliers of a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Why should visualisations be chosen carefully?
Because identical statistics can produce very different distributions (Anscombe's Quartet).
26
What is feature scaling and why is it important?
To standardise the range of features so models treat them equally.
27
Differentiate between normalization and standardization.
Normalization: scale to [0,1]; Standardization: mean=0, std=1.
28
What is linear regression used for?
To predict a continuous variable based on one or more inputs.
29
Define logistic regression.
A regression model that uses a sigmoid function to classify between two categories.
30
What is overfitting?
When a model performs well on training data but poorly on unseen data.
31
How can overfitting be prevented?
By using simpler models, more data, regularisation, or validation techniques.
32
What is the confusion matrix used for?
To evaluate performance of classification models.
33
Define precision and recall.
Precision: TP / (TP + FP), Recall: TP / (TP + FN).
34
What is the ROC curve?
A graph showing TPR vs FPR at various thresholds.
35
What is clustering?
Grouping data points so that similar points are in the same group.
36
Describe the K-means clustering algorithm.
Iteratively assigns points to clusters and updates centroids to minimise intra-cluster distance.
37
What is the Silhouette Score?
Measures how similar a point is to its own cluster vs other clusters; ranges from -1 to 1.
38
What is the elbow method?
A technique to choose optimal K in K-means by plotting inertia vs K.
39
What is PCA and when is it used?
Principal Component Analysis reduces dimensionality while preserving variance.
40
How is PCA implemented?
By projecting data onto new axes with maximum variance, derived from eigenvectors.
41
What Python libraries are common in data science?
NumPy, pandas, scikit-learn, matplotlib, seaborn.
42
Why is data pre-processing crucial?
Ensures quality input to models; handles missing data, scaling, and formatting.
43
What is the difference between supervised and unsupervised learning?
Supervised uses labeled data; unsupervised uses unlabeled data.
44
What is a true positive (TP) in a classification task?
A correct prediction where the model predicts the positive class, and it is actually positive.
45
What is a false negative (FN)?
The model predicts negative when the actual value is positive.
46
How is accuracy calculated?
Accuracy = (TP + TN) / (TP + TN + FP + FN)
47
Define precision in classification.
Precision = TP / (TP + FP)
48
Define recall in classification.
Recall = TP / (TP + FN)
49
What does the ROC curve represent?
True Positive Rate vs False Positive Rate at various threshold settings.
50
What does AUC stand for and what does it measure?
Area Under the Curve measures overall performance of a classifier.
51
What is the interquartile range (IQR)?
IQR = Q3 - Q1; it shows the range of the middle 50% of data.
52
How do you identify outliers using IQR?
Outliers are typically values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.
53
What does a negative skew in a distribution indicate?
The left tail is longer; the mean is less than the median.
54
When is the mean not a good measure of central tendency?
When the data contains outliers or is skewed.
55
Why is standardisation useful in modelling?
It centres the data and reduces the influence of differing scales.
56
When should you prefer normalisation over standardisation?
When data is bounded or you want values strictly in [0,1].
57
Which types of models are sensitive to feature scaling?
Distance-based models like K-means, KNN, and gradient descent.
58
What is the main goal of PCA?
To reduce the number of features while retaining as much variance as possible.
59
How does PCA decide which directions to keep?
By calculating principal components with the highest variance.
60
Why must data be standardised before PCA?
PCA is sensitive to the scale of features.
61
What is an eigenvector in PCA?
A direction along which data varies the most.
62
What is the objective of K-means clustering?
To minimise the intra-cluster sum of squared distances to the centroid.
63
How does K-means clustering initialise?
Randomly selects K centroids from the data.
64
How does the silhouette score evaluate clustering?
It measures how similar a point is to its own cluster vs others.
65
What is inertia in K-means?
The total sum of squared distances of samples to their closest cluster centre.
66
How does the elbow method help determine K?
Find the 'knee' point where inertia starts decreasing slowly.
67
What is hierarchical clustering?
A clustering method that builds a hierarchy of clusters using merges or splits.
68
What is a dendrogram?
A tree diagram used to illustrate hierarchical clustering.
69
What is overfitting?
A model that performs well on training data but poorly on new data.
70
What is underfitting?
A model that is too simple to capture patterns in the training data.
71
How can you guard against overfitting?
By splitting data into training and test sets, or using cross-validation.
72
What is the bias-variance tradeoff?
Balancing the error from model simplicity (bias) and sensitivity (variance).
73
What is a null hypothesis?
A statement asserting no effect or relationship exists.
74
What is a directional hypothesis?
Predicts the direction of the effect (e.g., A > B).
75
What is the purpose of experiment design in data science?
To control variables and test specific hypotheses.
76
What is a label in supervised learning?
The known output value we want to predict.
77
What is a feature in machine learning?
An individual measurable property of the phenomenon being observed.
78
scree plot
variance explained by each component, helping us identify the optimal number of principal components to keep by locating the “elbow” point
79
Main point of PCA
reduce dimensionality while preserving key information