DS Foundations Part 1 Flashcards by James Clyne

What is the difference between structured and unstructured data?

Structured data fits into tables with rows and columns; unstructured data includes text, images, audio, etc.

How well did you know this?

Not at all

Perfectly

What are examples of categorical data?

Gender, product category, ZIP code.

How well did you know this?

Not at all

Perfectly

What is ordinal data?

Categorical data with a meaningful order, but no fixed spacing (e.g., rankings, Likert scales).

How well did you know this?

Not at all

Perfectly

What is nominal data?

Categorical data without an inherent order (e.g., colors, names).

How well did you know this?

Not at all

Perfectly

What is the difference between discrete and continuous data?

Discrete data has countable values; continuous data can take any value in a range.

How well did you know this?

Not at all

Perfectly

What is Exploratory Data Analysis (EDA)?

A process to summarize data characteristics using visualization and statistics.

How well did you know this?

Not at all

Perfectly

What is the purpose of EDA?

To understand data distributions, spot anomalies, and form hypotheses.

How well did you know this?

Not at all

Perfectly

What plot is best for showing distribution of a numerical variable?

Histogram or boxplot.

How well did you know this?

Not at all

Perfectly

What plot is best for identifying relationships between two numerical variables?

Scatter plot.

How well did you know this?

Not at all

Perfectly

What does a boxplot show?

Median, quartiles, and potential outliers of a dataset.

How well did you know this?

Not at all

Perfectly

What is the mean?

The average of all data values.

How well did you know this?

Not at all

Perfectly

What is the median?

The middle value when data is sorted.

How well did you know this?

Not at all

Perfectly

When is the median better than the mean?

When the data is skewed or contains outliers.

How well did you know this?

Not at all

Perfectly

What is standard deviation?

A measure of how much values vary around the mean.

How well did you know this?

Not at all

Perfectly

What is IQR?

The interquartile range: Q3 − Q1, shows middle 50% spread.

How well did you know this?

Not at all

Perfectly

What is skewness?

Study These Flashcards

A measure of the asymmetry of a distribution.

What is kurtosis?

Study These Flashcards

A measure of the ‘tailedness’ of a distribution.

What does a right-skewed distribution look like?

Study These Flashcards

It has a long tail on the right; mean > median.

What is a uniform distribution?

Study These Flashcards

A distribution where all outcomes have equal probability.

What does a bimodal distribution suggest?

Study These Flashcards

There may be two subgroups or populations in the data.

What are common data quality issues?

Study These Flashcards

Missing values, duplicates, outliers, inconsistent formats.

What is a missing value?

Study These Flashcards

An entry in a dataset that has no recorded value.

How can you handle missing data?

Study These Flashcards

Methods include deletion, mean/median imputation, or predictive models.

What is data duplication?

Study These Flashcards

When the same observation appears more than once unnecessarily.

What is outlier detection?

The process of identifying data points that are significantly different from others.

Why is data ethics important?

To protect individual rights, ensure fairness, and prevent misuse of data.

What is informed consent in data collection?

When individuals agree to data use with full knowledge of risks and purpose.

What is algorithmic bias?

When data or models systematically favor certain groups over others.

What is the difference between correlation and causation?

Correlation shows association; causation implies one variable influences another.

Why is context important in data interpretation?

Because the same metric can mean different things in different domains.

DS Foundations Part 1 Flashcards

(30 cards)