DS Foundations Part 1 Flashcards
(30 cards)
What is the difference between structured and unstructured data?
Structured data fits into tables with rows and columns; unstructured data includes text, images, audio, etc.
What are examples of categorical data?
Gender, product category, ZIP code.
What is ordinal data?
Categorical data with a meaningful order, but no fixed spacing (e.g., rankings, Likert scales).
What is nominal data?
Categorical data without an inherent order (e.g., colors, names).
What is the difference between discrete and continuous data?
Discrete data has countable values; continuous data can take any value in a range.
What is Exploratory Data Analysis (EDA)?
A process to summarize data characteristics using visualization and statistics.
What is the purpose of EDA?
To understand data distributions, spot anomalies, and form hypotheses.
What plot is best for showing distribution of a numerical variable?
Histogram or boxplot.
What plot is best for identifying relationships between two numerical variables?
Scatter plot.
What does a boxplot show?
Median, quartiles, and potential outliers of a dataset.
What is the mean?
The average of all data values.
What is the median?
The middle value when data is sorted.
When is the median better than the mean?
When the data is skewed or contains outliers.
What is standard deviation?
A measure of how much values vary around the mean.
What is IQR?
The interquartile range: Q3 − Q1, shows middle 50% spread.
What is skewness?
A measure of the asymmetry of a distribution.
What is kurtosis?
A measure of the ‘tailedness’ of a distribution.
What does a right-skewed distribution look like?
It has a long tail on the right; mean > median.
What is a uniform distribution?
A distribution where all outcomes have equal probability.
What does a bimodal distribution suggest?
There may be two subgroups or populations in the data.
What are common data quality issues?
Missing values, duplicates, outliers, inconsistent formats.
What is a missing value?
An entry in a dataset that has no recorded value.
How can you handle missing data?
Methods include deletion, mean/median imputation, or predictive models.
What is data duplication?
When the same observation appears more than once unnecessarily.