T4: EDA Flashcards

1
Q

What is EDA?

A
  • A crucial first step in the data analysis process.
  • Helps understand the data’s main characteristics.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why EDA?

A
  • Offers a clear insight into the underlying structure of the data.
  • Helps identify obvious errors and outliers.
  • Provides a foundation for subsequent analysis.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

IMPORTANCE OF EDA

A
  • Uncovering patterns: Helps detect and visualize patterns in the data.
  • Identifying anomalies: Spot potential outliers or mistakes in the data.
  • Informing model selection: Understand which models might work best.
  • Validating assumptions: Ensure data meets assumptions required by
    modeling techniques.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

STEPS IN EDA

A

1) Data Collection: Gathering relevant data from various sources (last time)
2) Data Cleaning: Preparing the data for analysis (last time)
3) Data Visualization: Using plots and charts to understand data
(today)
4) Statistical Analysis: Applying stats to derive insights (next time).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

1) DATA COLLECTION (RECAP)

A
  • Sources of Data: Surveys, databases, logs, etc.
  • Diverse and Accurate Data: Ensure varied sources for unbiased results.
  • Initial Observations: First look at raw data for obvious issues or patterns.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

2) DATA CLEANING (HANDLING MISSING
VALUES)

A

R Code:
# Identify missing values
missing_values <- is.na(data)

Remove rows with missing values
cleaned_data <- na.omit(data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

2) DATA CLEANING (DEALING WITH
OUTLIERS)

A

Definition: Data points that differ significantly from others.
Types: Point outliers, contextual outliers, and collective outliers.

R Code:
# Boxplot to visualize outliers
boxplot(data$column_name)

IQR method to identify outliers
IQR <- IQR(data$column_name)
upper_bound <- quantile(data$column_name, 0.75) + 1.5 * IQR
lower_bound <- quantile(data$column_name, 0.25) - 1.5 * IQR
outliers < data$column_name[data$column_name > upper_bound |
data$column_name < lower_bound]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

2) DATA TRANSFORMATION AND
NORMALIZATION

A

Log transformation (E.g., Reduce skewness Normal dist.) 🡪

Why Transform Data? Enhance model performance; meet assumptions of certain
algorithms.

Common Transformations: Log, square root, z-score.

R Code:
log_data <- log(data$column_name)

Square root transformation
(E.g., Reduce skewness in count data Uniform dist.)
sqrt_data <- sqrt(data$column_name)

Z-score normalization
(E.g., to create mean = 0 and sd = 1).
z_score <- scale(data$column_name)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

3) DATA VISUALIZATION

A

Histograms and Box Plots: Understand data distribution.

Scatter Plots: Visualize bivariate relationships.

Heatmaps: Show correlations.

R Code (for a simple scatter plot):

plot(data$column1, data$column2, main=”Scatter Plot of Column1 vs Column2”,
xlab=”Column1”, ylab=”Column2”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

4) STATISTICAL ANALYSIS (NEXT TIME)

A
  • Descriptive Statistics: Summarize main features of data.
  • Inferential Statistics: Make predictions or inferences.
  • Testing Hypotheses: Determine validity of certain claims.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

DATA VISUALIZATION

A
  • Line plot
  • Bar plots
  • Box plots
  • Density plots
  • Scatter plots
  • Word clouds
  • Pie chart
  • Raincloud plot
  • Heatmap
  • Animated plots
  • (Interactive plots)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

LINE PLOT

A

ggplot(data = df, aes(x = date, y = unemploy) +
geom_line()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

BAR PLOT

A

ggplot(data = df, aes(x = class) +
geom_bar()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

BOX PLOT

A

ggplot(data = df, aes(x = ‘Distance measure’, y = temperature) +
geom_boxplot()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DENSITY PLOT

A

ggplot(data = df, aes(x = X, fill = cut)) +
geom_density(alpha = 0.5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

SCATTER PLOT

A

ggplot(data = df, aes(x = dist, y = speed) +
geom_point()

17
Q

WORD CLOUDS

A

wordcloud(words = df$word, freq = df$freq,
random.order = FALSE, colors=brewer.pal(8,
“Dark2”))

18
Q

PIE CHART

A

ggplot(data = df, aes(x = factor(1), fill = as.factor(cyl)) +
geom_bar()

19
Q

RAINCLOUD PLOT

A
20
Q

HEATMAP

A
21
Q

ANIMATED PLOTS

A
22
Q

INTERACTIVE PLOTS

A