unit 5-data Flashcards
citizen science
scientifc research conducted in whole or part by distributed individuals, many of whom may not be scientists, who contribute relevant data to research using their own computing devices.
cleaning data
a process that makes data uniform without changing its meaning (changing spelling numbers in letters to their numerical value)
correlation
A relationship between two pieces of data typically refers to the amount that one has in relation to the other.
Crowdsourcing
Crowdsourcing: the practice of obtaining input or information from a large number of people via the Internet.
information
a collection of facts or patterns collected from data
data bias
data that does not accurately reflect the full population or phenomenon being studied
what is data filtering and how is it done?
choosing a smaller subset of a data set to use for analysis, for example by eliminating / keeping only certain rows in a table
bar chart
Graph of bars that shows the number of times each value in a column of data appears
Histogram
: Similar to a bar chart, but all numbers within a range (bucket) are grouped together.
Crosstab chart
Crosstab Chart: counts the number of times combinations of values appear (similar to a frequency table)
scatterplot
Scatterplot: graph that shows the relationship between 2 sets of data
Open Data:
publicly available data shared by governments, organizations, and others so that anyone can analyze it.
big Data
collection of huge amounts of data so we can learn from it often requiring cloud computing or parallel processing systems
what is Metadata, what is it used for, and why is it important?
data about data that is used to organize, find, and manage information. it is important because it increases the effectiveness of data by providing extra information
What is crowdsourcing
The practice of obtaining input or information from a large number of people via the internet.
What are the advantages of a histogram? What are the advantages of a bar chart?
Histogram advantages-
- useful when many unique values must be grouped (ONLY NUMBERS)
- easier to read with wider buckets
Bar chart advantages
- can work with both numeric nad qualitative data
- good at finding frequency of a value
what are the disadvantages of a histogram? what are the disadvantages of a bar chart?
histogram
-only works with numerical values
cons
-not useful because they have too many unique values (especially if data has small incremnets and is each input has the same/similar output)
What is two-column data?
data that uses 2 variables
for ex. height and max lifespan of dogs
What is one-column data?
1 variable ( for ex., the population across states (states is the variable)
What are the pros and cons of cross-tabs?
Pros/Useful for:
Finding the most / least common combinations of values
in two columns
Finding patterns across two columns
Exploring two columns when one or both are strings.
Cons/Not useful:
If either column has too many values
(the chart would be enormous)
When are scatter plots useful? When are they not useful?
scatter plots are useful when you want to see trends and patterns between two values or when you have numerical data with lots of unique, different values.
Seeing patterns and trends between two values
Numeric data with lots of different values
scatter plot are not useful when a specific combination has many values, as this is not easy to visualize. In that situation, using a cross-tab would be more helpful because it counts the frequency of a specific value.