Midterm Flashcards

(79 cards)

1
Q

FILTER + REPRESENT

A

Reorganize your data and take only what you need

The pros of mining before filtering is you know exactly what you want to filter. The con is you don’t know if there is enough data to answer your questions

Filter and Represent have an iterative nature. How you represent data can influence what you acquire

This stage could lead you back to aquire

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ACQUIRE

A

Locate and download the data from a source

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Primary Data

A

information collected for specific purpose at hand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Secondary Data

A

information that already exists somewhere, having been collected for another purpose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

PARSE

A

Look through data columns and identify the types and its correctness

Modify columns by splitting if needed

Each piece of data needs to be converted to a useful format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

String

A

a set of characters that forms a word of sentence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Float

A

a number with a decimal point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Character

A

a single letter or other symbol

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Integer

A

a number with no fractional part

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Alphanumeric

A

consists of both letters and numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Boolean

A

True or False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MINE

A

Determine basic descriptors and statistics for your data, categorize it, and figure out the range and spread, as well as partters

Categorize your data into groups such as nutrient fact

Should also start asking questions

Figure out if temporal data needs to be reorganized

Range check is important to see if there are null / na or negative numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

FILTER + REPRESENT

A

Reorganize your data and take only what you need

The pros of mining before filtering is you know exactly what you want to filter. The con is you don’t know if there is enough data to answer your question

Filter & Represent have an iterative nature. How you represent data can influence what you aquire

This stage could lead you back to acquire

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

CHRTS

A

categorical, hieratical, relational, temporal, spatial

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Categorical

A

compare categories of quantitative data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Hierarchical

A

visualize relationships and hierarchies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Relational

A

charts relations to explore correlations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Temporal

A

data that happens over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Spatial

A

data pertaining to a location

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

CRITIQUE + REFINE

A

Get feedback of your charts and refine based on the feedback

This stage could lead you back to acquire, min, or filter & represent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Data Product

A

translate the records of a data source into an easily understandable format

ex:
Raw vs Processed
Granular vs Summarized
Textual vs Quantitative
Statistic vs Dynamic
Small vs Massie

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Structured Data

A

easily searchable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Unstructured Data

A

not easily searchable

ex:
audio, video, reviews

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Quantitative

A

numerical data that is either discrete or continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Qualitative Data Types
nominal, ordinal
26
Nominal
label for a field ex: M/F, color, names
27
Ordinal
order matters
28
Anatomy of a graphic
Chart tile, data label, legend, horizontal axis title, left vertical axis title, category labels
29
Bar Charts vs Histograms
bar charts are comparing categories while histograms show the pattern of data within a range
30
Bar Chart
categories don't have an order order the bars by length for each comparison horizontal bar charts for long category labels Categorical
31
Clustered Bar Chart
comparison between subcategories Categorical
32
Pictogram
use point marks, in the form of symbols or pictures, to represent an associated quantitative count Categorical
33
Proportional Symbol Chart
works best when you have diverse range of quantitative value sizes Categorical
34
Word Cloud
shows the frequency of individual word item Categorical
35
Matrix Chart and Heat Map
displays quantitative values across the intersection of two categorical and or discrete quantitative dimensions Categorical
36
Histogram and Density Plot
displays the frequency and distribution of quantitative measurements across grouped values for data items Categorical
37
Box and Whisker Plot
displays the distribution and shape of quantitative values for different categories Categorical
38
Pie Chart and Donut Chart
how proportions of quantities for different constituent categories make up a whole Categorical
39
Treemap
an enclosure digram providing a hierarchical display that shows how quantitative values for different constituent categorical parts make up a whole Hierarchical
40
Venn Diagram
shows collections of and relationships between multiple sets Hierarchical
41
Scatter Plot
displays the relationship between two quantitative variables for different category items Relational
42
Bubble Plot
displays the relationship between three quantitative variables for different category items Relational
43
Network Diagram
display relationships through the connections between data items Relational
44
Line Chart
shows how quantitative values have changed over time for different categorical items Temporal
45
Bump Chart/Ribbon Chart/Rank Chart
shows how quantitative values have changed over time for categorical items, where the quantitative values are ranking measurement Temporal
46
Slope Graph
shows how quantitative values have changed over two points in time for different category items Temporal
47
Area Chart
shows how quantitative values have changed over time for a single categorical item Temporal
48
Stacked Area Chart
shows how quantitative values have changed over time for multiple categorical items
49
Gantt Chart
shows time based intervals for different categorical items Temporal
50
Instance Chart
displays time-based events for different categorical items Temporal
51
Choropleth
displays quantitative values for distinct, definable spatial regions Spatial
52
Isarithmic Map/Contour Map
displays distinct spatial surfaces on a map that shares the same quantitative classification Spatial
53
Proportional Symbol Map
displays quantitative values for locations on a map; ideal for highlighting the magnitude of data at specific locations through varying symbol sizes Spatial
54
Dot Map
displays the distribution of phenomena on a map Spatial
55
Flow Map
the characteristics of movement or connections between phenomena across spatial regions Spatial
56
Area Categorm
displays the quantitative values associated with distinct, definable spatial regions on a map by proportionately distorting (inflating or deflating) the relative size of and, to some degree, shape of the respective regional areas Spatial
57
Dorling Cartogram
displays the quantitative values associated with distinct, definable spatial regions on a map with marks which is proportionally sized to represent the quantitative values Spatial
58
Grid Map
displays the quantitative values associated with distinct, definable spatial regions on map. Each geographic region is represented by a fixed-size uniform shape, sometimes termed a tile. Attributes of color are applied to each rational tile to represent a quantitative measurement Spatial
59
Projections
Preserving local angles, but introducing severe distortions in areas near the poles Spatial
60
Logarithmic Transformation
Useful when data spans multiple orders of magnitude or has skewness (right-skewed)
61
Square Root Transformation
Appropriate for moderately skewed data or data with moderate outliers (right-skewed)
62
Reciprocal Transformation
Effective when large values disproportionately influence the dataset or right skewed data
63
Squaring/Cubing
Effective for left skewed data
64
Currency (Verifying Data)
Is the information up to date? When was it collected/published/updated
65
Relevancy (Verifying Data)
Is the information suitable for your intended use? Does it address your research question? Is there other (better) information
66
Authority (Verifying Data)
Is the information creator reputable and has the necessary credentials? Can you trust the information?
67
Accuracy (Verifying Data)
Do you spot any errors? What is the source of the information? Can other data or research support this information?
68
Purpose (Verifying Data)
Was the intended purpose of the information collected? Are other potential uses identified
69
Data Type Checking (Data Cleaning)
Checking to see if all the data types are the same ex: all inputs for ages should be integers
70
Range Check (Data Cleaning)
Checking to make sure that the information is within a reasonable range ex: an age shouldn't be negative, zero or over a hundred Missing or incorrect values should be replaced with an estimate (median age of the dataset) or as "Missing" or "Unknown"
71
Format Check (Data Cleaning)
Making sure the format is uniform
72
Handling Missing Data (Data Cleaning)
< 5% of data missing: delete those entries make note on how this impacts the data analysis and size > 5% of the data missing: Categorical Data should have a placeholder like "Unknown" Numerical Data: replace the mean of the data Temporal/Interval Data: User interpolation or a placeholder like "Unknown" check for patterns of missing data
73
Duplication (Data Cleaning)
Making sure that there are no duplicates in your data and getting rid of all entries that are
74
Spelling Check (Data Cleaning)
Detect and correct any spelling errors
75
Data Standardization
Ensure consistency in text entries, formats, and measurement units
76
Design Principles
Trustworth: data should be accurate, consistent, complete, and reliable with no misleading data representation Accessible: data should be relevant and understandable Elegant: eliminate the arbitrary and be thorough
77
Interval
quantitative data that's measured on a scale with equal intervals between values
78
Ratio
quantitative data and has a true zero point
79
Textual
stores any kind of text data