Introduction to data types and quality Flashcards

1
Q

data in data science mean:

A

collection of organized observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

There are two types of organization

A

methodology and shape.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The methodology is

A

how the data was collected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The most common shape for data is

A

spreadsheet or table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The most common shape for data is a spreadsheet or table.

A

The things we are measuring (variables) are in the columns, and the individual instances (observations) are in the rows.
We can read each column “down” the table (viewing multiple observations), and each row “across” the table (viewing multiple variables).
This isn’t the only way to organize data, but it is the most common.
the Imperial measurements (the American system), so will be collecting the data in feet and miles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

the shape of data:
each individual is called

A

an entity, observation, or instance
but know that these three terms are used interchangeably.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In a well-organized dataset, the variables describe……….

A

a characteristic of our entities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Good variables measure

A

only one characteristic and should not be a characteristic themselves.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Variable Types
The difference between measuring and categorizing is so important that the data itself is termed differently:

A
  • Variables that are measured are Numerical variables
  • Variables that are categorized are Categorical variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Numerical variables

A

Numerical variables are a combination of the measurement and the unit. Without the unit, a numerical variable is just a number.
Imagine I go into a cafe and ask the barista for 3. Three what? ☕? 🍩? 💵? Or my friend asks how far Toledo is and I say 300. 300 miles? Kilometers? Minutes? Without units, numbers don’t mean anything.
There are two ways to get a number: by counting and measuring. Counting gives us whole numbers and discrete variables. Measuring gives us potentially partial values and continuous variables.
In our tree census, we are measuring the height of our trees in feet (indicated in the variable name, ‘Height (ft)’), a continuous variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Categorical variables

A

Categorical variables describe characteristics with words or relative values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

This kind of categorical variable is a nominal variable which literally means

A

a named value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Categorical variables:
Dichotomous variables

A

have only 2 logical possibilities, “on/off”, “yes/no”, “true/false”, “0/1”, there’s no middle ground and no 3rd option. If there is a logical third option, it’s not a dichotomous variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Categorical variables:
ordinal variable

A

let’s say that we wanted to capture how “pretty” we thought each tree was. This isn’t really a thing we can measure, but we can subjectively say on a scale of 1 to 5, how pretty we think each tree is. The prettiest trees are a 5, the least pretty trees are a 1.
That ranking is inherently ordered and therefore called an ordinal variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ordinal variables are really popular in survey design “on a scale of 1-5 how much do you agree with this statement?” This is called a

A

a likert scale. They also show up in the Olympics and other competitions where someone wins 1st, 2nd, or 3rd place.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Ordinal variables can get a little confusing because they are often represented as numbers. But they don’t represent measurements or counts, they represent

A

categories

17
Q

cleaning data involves a lot of

A

critical thinking considering the nuances of the dataset you are working with.

18
Q

Accuracy

A

is a measure of how well records reflect reality.

19
Q

essential for accuracy

A

Standardization is essential for accuracy – but it’s not the only way that accuracy can be compromised.

20
Q

There are a lot of ways a dataset can have low accuracy, but it all comes down to the question of: “are these measurements (or categorizations) correct?” It requires a critical evaluation of your specific dataset to identify what the issues are, but there are a few ways to think about it.

A
  • First, thinking about the data against expectations and common sense is crucial for spotting issues with accuracy. You can do this by inspecting the distribution and outliers to get clues about what the data looks like.
  • Second, critically considering how error could have crept in during the data collection process will help you group and evaluate the data to uncover systematic inconsistencies.
  • Finally, identifying ways that duplicate values could have been created goes a long way towards ensuring that reality is only represented once in your data. A useful technique is to distinguish between what was human collected versus programmatically generated and using that distinction to segment the data.
21
Q

It’s not just typos, mistakes, missing data, poor measurement, and duplicated observations that make a dataset low quality. We also have to make sure that our data actually measures what we think it is measuring.

A

This is the validity of our dataset.

22
Q

Validity is a special kind of quality measure because it’s not just about the dataset

A

it’s about the relationship between the dataset and its purpose. A dataset can be valid for one question and invalid for another.