R - 7. Exploratory Data Analysis Flashcards

1
Q

What does EDA stand for?

A

exploratory data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is EDA?

A

A task of how to use visualisation and transformation to explore your data in a systematic way.
EDA is an iterative cycle. You:

  1. Generate questions about your data.
  2. Search for answers by visualising, transforming, and modelling your data.
  3. Use what you learn to refine your questions and/or generate new questions.

No strict rules. Ask more and more questions to get to the core of the datasaet.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What two types of questions will always be useful for making discoveries within your data?

A

You can loosely word these questions as:

What type of variation occurs within my variables?

What type of covariation occurs between my variables?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a variable?

A

A variable is a quantity, quality, or property that you can measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a value?

A

A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an observation?

A

An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is tabular data?

A

Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is variation?

A

Variation is the tendency of the values of a variable to change from measurement to measurement. If you measure any continuous variable twice, you will get two different results. Even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error. Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people). Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable’s values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a categorical variable?

A

A variable is categorical if it can only take one of a small set of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a continuous variable?

A

A variable is continuous if it can take any of an infinite set of ordered values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What do you use to examine the distribution of a continuous variable?

A

A histogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do you use to examine the distribution of a categorical variable?

A

Bar chart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is geom_freqpoly() and what is it used for?

A

If you wish to overlay multiple histograms in the same plot, I recommend using geom_freqpoly() instead of geom_histogram().
geom_freqpoly() performs the same calculation as geom_histogram(), but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do you do woth unnusual values that don’t make sense?

A

You should zoom into them and figure out, what they are about, and if the data makes sense. If you are sure, that the data is wrong you can replace them with missing values. But you must be careful.
If you replace them, do so with mutate(), to replace the variable with a modified copy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What happens with missing values?

A

Missing values should never silently go missing. It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is covariation?

A

Covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables.