Data Science Flashcards
(64 cards)
What are the 2 types of data?
- Ordinal data has a natural order
- Nomial data cannot be ranked or measured in any way
What is big data?
- Big data is data that does not fit on computers all at once
- Data with large volume, variety and velocity
- Big data science addresses issues with big data
What is a hypothesis?
A hypothesis is a statement which is either true or false and must be disprovable
What is the 7-step data science process?
- Frame the problem
- Get the raw data
- Pre-process and clean data
- Data exploration
- Analyse and model data
- Valudate/ evaluate results
- Use and communicate results
What does pre-processing and cleaning data consist of?
- Handling missing data (interpolation)
- Deleting incomplete data
- Data wrangling
What 3 things does descriptive statistics do?
- Summarises data (makes it manageable)
- Extracts insights from data (underlying trends)
- Gathers knowledge (make targeted decisions about data)
What are the 4 measures of central tendancy?
- Arithmetic mean
- Weighted arithmetic mean
- Median
- Mode
When can you use mean and what is it mathematically?
- Can only use when data is symmetrical with no outliers
- Mean is point closest to all data in squared-euclidian distance (gives larger values higher weight)
What is weighted arithmetic mean and what are its benefits?
- Values have different weightings to them
- More representative
When is it best to use the median?
When data has outliers or is skewed
What is mode best for?
- Skewed distribution
- Best for categorical data
What are the 2 types of measures of spread?
- Empirical (sample) measures for when there is a subset of data
- True (population) measures for when there is data for entire population
What is the variance and what does it mean to be biased?
- Variance is spread around the mean
- Biased when we don’t have entire data set (empirical mean)
- Subtracting 1 from dataset helps to remove bias
Give the definition of experiment, sample space and event
- Experiment: A procedure which yields one of a set of possible outcomes
- Sample space: The set of possible outcomes of an experiment
- Event: A specified subets of the set of outcomes of an experiment
What is the probability and complement of an event?
- Probability: The sum of probabilities of the outcomes of an experiment
- Complement: 1-P(E)
Give the definition of random variable and expected value
-
Random Variable: A numerical function of the outcomes of a probability space
-** Expected value:** The sum of the probabilities multiplied by their randome variables
What is the probability mass function?
- Used for discrete random variables
- Sums over the specific values of the variable and gived exact probabilities
What is the probability density function?
- Used for continuous probabilties
- Integrated to get probabilities over intervals
What is the cumulative distribution function?
- Used for cumulative probability
- Takes a value less than or equal to a certain point
Explain objective vs subjective probability
Objective probability:
- Repeatable events
Subjective probability:
- Unrepeatable events
- Used in bayesian interpetation
- Degree of plausiblity
What is the central limit theorem?
- States that the sampling distribution of a sample mean is well-approximated by a gaussian/normal distribution as the sample gets large
Give 4 assumptions of the central limit theorem
- Variables are independant
- **Identical distribution **(same mean and var)
- Finite mean and variance
- Sufficiently large sample size
What are the 3 types of uncertainty?
- Epistemic uncertainty
- Aleatoric uncertainty
- Ontological uncertainty
Explain epistemic uncertainty
- Predictable randomness
- Reducible
- Reduced by taking more measurements