Data Science Flashcards

(64 cards)

1
Q

What are the 2 types of data?

A
  • Ordinal data has a natural order
  • Nomial data cannot be ranked or measured in any way
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is big data?

A
  • Big data is data that does not fit on computers all at once
  • Data with large volume, variety and velocity
  • Big data science addresses issues with big data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a hypothesis?

A

A hypothesis is a statement which is either true or false and must be disprovable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the 7-step data science process?

A
  1. Frame the problem
  2. Get the raw data
  3. Pre-process and clean data
  4. Data exploration
  5. Analyse and model data
  6. Valudate/ evaluate results
  7. Use and communicate results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does pre-processing and cleaning data consist of?

A
  • Handling missing data (interpolation)
  • Deleting incomplete data
  • Data wrangling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What 3 things does descriptive statistics do?

A
  • Summarises data (makes it manageable)
  • Extracts insights from data (underlying trends)
  • Gathers knowledge (make targeted decisions about data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 4 measures of central tendancy?

A
  • Arithmetic mean
  • Weighted arithmetic mean
  • Median
  • Mode
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When can you use mean and what is it mathematically?

A
  • Can only use when data is symmetrical with no outliers
  • Mean is point closest to all data in squared-euclidian distance (gives larger values higher weight)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is weighted arithmetic mean and what are its benefits?

A
  • Values have different weightings to them
  • More representative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When is it best to use the median?

A

When data has outliers or is skewed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is mode best for?

A
  • Skewed distribution
  • Best for categorical data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the 2 types of measures of spread?

A
  • Empirical (sample) measures for when there is a subset of data
  • True (population) measures for when there is data for entire population
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the variance and what does it mean to be biased?

A
  • Variance is spread around the mean
  • Biased when we don’t have entire data set (empirical mean)
  • Subtracting 1 from dataset helps to remove bias
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Give the definition of experiment, sample space and event

A
  • Experiment: A procedure which yields one of a set of possible outcomes
  • Sample space: The set of possible outcomes of an experiment
  • Event: A specified subets of the set of outcomes of an experiment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the probability and complement of an event?

A
  • Probability: The sum of probabilities of the outcomes of an experiment
  • Complement: 1-P(E)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Give the definition of random variable and expected value

A
  • Random Variable: A numerical function of the outcomes of a probability space
    -** Expected value:** The sum of the probabilities multiplied by their randome variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the probability mass function?

A
  • Used for discrete random variables
  • Sums over the specific values of the variable and gived exact probabilities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the probability density function?

A
  • Used for continuous probabilties
  • Integrated to get probabilities over intervals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the cumulative distribution function?

A
  • Used for cumulative probability
  • Takes a value less than or equal to a certain point
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explain objective vs subjective probability

A

Objective probability:
- Repeatable events
Subjective probability:
- Unrepeatable events
- Used in bayesian interpetation
- Degree of plausiblity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the central limit theorem?

A
  • States that the sampling distribution of a sample mean is well-approximated by a gaussian/normal distribution as the sample gets large
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Give 4 assumptions of the central limit theorem

A
  • Variables are independant
  • **Identical distribution **(same mean and var)
  • Finite mean and variance
  • Sufficiently large sample size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the 3 types of uncertainty?

A
  • Epistemic uncertainty
  • Aleatoric uncertainty
  • Ontological uncertainty
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Explain epistemic uncertainty

A
  • Predictable randomness
  • Reducible
  • Reduced by taking more measurements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Explain aleatoric uncertainty
- **Inherent/intrinsic** randomness - **Irreducible**
26
Explain ontological uncertainty
- Model is **incomplete** - Unaware of **factors** affecting the system
27
What are the 2 main **sources** of uncertainty we deal with in data science, and what do they mean?
- **Measurement uncertainty**: related to accuracy of the tool or method used - **Sampling uncertainty**: arises from representativeness and size of subset
28
What are the 3 different types of behaivours and explain them
- **Deterministic**: behaivours are **pre-determined** and always the same - **Stochastic**: **largely the same** with random components - **Random**: Usually **pseudo-random** as computers cannot produce true randomness
29
Explain accuracy and the term for the lack thereof
- Degree to which values are **arranged around the true value** - Lack of accuracy = **bias** (systematic error)
30
Explain precision and the term for the lack thereof
- Degree to which values are **close to each other** (high repeatbility) - Lack of precision = **variablity**
31
Give the 5 sampling techniques with short explanations
- **Random sampling**: random subset -** Systematic sampling**: select values at regular intervals - **Stratified sampling**: takes relative samples from different Strata - **Cluster sampling**: uses all data in a random cluster - **Weighted sampling**: assignes probabilities based on volume
32
Give definition of and 4 advantages of Bootstrap sampling
- Sample with replacement to estimate distribution **Advantages**: - Used from small datasets - No assumption needed - Deal with non-normal data - Applied to any measureable quantity
33
Classical statistics vs Computational statistics
- Classical: asymptotic distributions and frequentist probability - Uses computation to make decisions about data
34
What is the null hypothesis, alpha value and p-value?
- **Null H**: no differences and any differences are due to chance - **Alpha level**: probability level at which you consider difference to be real - **P value**: probability that both means came from the null distibution
35
What does the z-score show?
- The z-score is **negative** if the observed proportion is **less** than the expected proportion
36
If we want to compare 3 means, what do we need to assume?
- Independantly sampled - Free from outliers - CLT applies, so empirical means are approx normal
37
What is joint probabilty and conditional probability?
- **Joint probability**: intersection between x and y - **Conditional probability**: probability of x given y
38
What is the Posterior, Likelihood, Prior and Normalisation?
- θ is the parameter - **Posterior**: P(θ |Data) Updated belief about θ after seeing data - **Likelihood**: P(Data|θ ) Probability of seeing data given θ - **Prior**: P(θ ) Our belief about θ before we aquired the data - **Normalisation**: P(Data) The evidence/ probability of observing
39
What is Bayes' rule?
Posterior = (likelihood x prior)/ evidence
40
What is Parameter?
- A parameter is a number that defines how a probability distribution behaves - Parameters of **normal**: mean and variance - Parameters of **binomial**: success probability and number of trials
41
What is the MLE and its properties?
- Maximum likelihood estimation - The parameter that maximises the likelihood **Properties**: - Not bayesian - Widely used - Returns single best estimate - In small dataset may be no successes
42
Give the method for MLE
1. Take log of likelihood 2. Differentiate with respect to theta 3. Solve derivative 4. Second derivative must be negative (max likelihood)
43
What is the MAP and its properties?
- Maximum Aposteriori Estimation - Find parameter that maximises posterior **Properties**: - Bayesian version of MLE - Incorporates prior - More probable paramter after seeing data - Uses laplace to ensur dist is never 0 or 1
44
Give 2 methods for approximating the full/ normalised posterior
- **Laplace's method**: approximates **sharply peaked posterior** by normal centered at MAP estimate - **Markov chain monte carlo (MCMC)**: sample from complex posterior distribution using **Metropolis-hastings algorithm**
45
What are the steps for Laplace's method?
1. Compute **unnormalised** posterior (likelihood x prior) 2. Find log: L(θ)=log(Likelihood)+log(Prior) 3. Find MAP estimate such that L′(θ∗)=0 4. Compute L''(θ∗) 5. Use **Taylor Expansion** formula for normal approximation (mean and var.)
46
What are the mean and variance in Laplace's method?
- **Mean**: peak θ∗ = MAP estimate = mode - **Variance**: inverse of second derivative of log(posterior) = 1/L''(θ∗)
47
Explain the general steps behind MCMC
- Generates markov chain that eventually converges to the posterior - Using **metropolis-hastings algorithm**
48
Why is MCMC used in Bayesian inference?
- To generate samples from posterior where exact computation is difficult
49
In a Markov chain, what determines the next state?
- Only current state - Transitions are memory-less
50
What components are needed for Metropolis-hastings?
1. Parameter space θ 2. Unnormalised posterior 3. Proposal distribution T(θ′∣θ) 4. Accept/reject based on posterior ratio
51
What are the exact steps for Metropolis-hastings?
- Propose new θ′ using 𝑇(𝜃′∣𝜃) - Accept 𝜃′ with probability based on posterior ratio - If accepted, next θ is θ' and if not θ stays the same
52
In machine learning, explain **examples**, **features**, **class/labels** and **inputs/outputs**
- **Examples**: Observations, data points/entries - **Features**: Independant variables or predictors - **Class/label:** Dependant variables/ outcomes being predicted - **Inputs/Outputs**: Features are inputs, results are outputs
53
What are the 3 dominant learning paradigms in ML?
- **Supervised learning**: Uses labeled input-output pairs for training - **Reinforcement training**: Agent learns by recieving special rewards/penalties - **Unsupervised learning**: Finds patterns in unlabelled data with no outputs
54
What is supervised learning?
- Uses examples with **known inputs** for training - Algorithm maps inputs to correct outputs - Used for **classification** and **regression** - Outputs could be labels, probabilities or predcitions
55
What is reinforcement learning?
- Agent learns by **interacting** with an environment - Agent recieves **rewards** or **penalties** for actions - Learns a policy mapping **states to actions** - **Maximises** cumulative rewards over time
56
What is unsupervised learning?
- Works with **unlabelled** data (no pre-defined outputs) - Discovers hidden patterns - **No prior knowledge** or labeled data is used for learning
57
What are the two main problem types in ML?
- **Classification**: Assigns input to specific category - **Regression**: Predicts a continuous value based on input
58
What is generalisation in ML?
- The ability for a model to perform well on new data - Tested using seperate test set - Helps **avoid overfitting** to the training set - Ensures **unbiased** evaluation
59
What is the difference between training set and testing set?
Training set: - Used to **train** the model + learn patterns - Can be overly **optimistic** Testing set: - **Evaluates** models performance on unseen data - Provides **unbiased** estimate of accuracy
60
What are the 2 measures of performance in Classification?
- **Classification accuracy**: Percentage of correct predictions - **Misclassification error**: Percentage of incorrect preditcions - **Confusion matrix** helps visualise and evaluate these
61
What is the measure of performance in regression?
- Mean squared error (**MSE**) - Average **squared difference** between the predicted and actual values - Always on **test set**
62
What are the two approaches to classification in ML?
- **Discriminative approach**: Focusses on decision boundaries between classes - **Generative approach**: Models how data is generated for each class
63
What is Bayes' rule in classification?
- **Posterior**: updated probability of a class after observing data - **Likelihood**: probability of observing data given a class - **Prior**: initial belief about the probability of each class - **Evidence**: overall probability of data across all classes
64
What are the steps in bayesian classification?
1. Estimate **priors** based on class frequency 2. Model feature **distributions** 3. Compute **likelihood** for new data point 4. Apply Bayes' rule to compute **posterior**