Final Exam Material Flashcards

(100 cards)

1
Q

what is data sourcing?

A

(also known as data collection) is the process of extracting data from external or internal sources

data sources include:
enterprise databases (historical data, customer sign-up information), web data (web pages, social media), mobile data (apps, GPS), government data, and survey data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

why use surveys as a data sourcing model?

A

efficient way to collect information about a large group of people, flexible medium that can measure attitudes/knowledge/preferences/etc., standardized–so less susceptible to error, easy to administer, can be tailored exactly by the topic you wish to study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

keys to effective surveying

A

begin with clear purpose, know what you want to be able to do with the data ahead of time, identify the most logical group to survey

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

parts of a survey: title

A

should reflect the content of the survey, be easy to understand, and be concise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

parts of a survey: introduction statement

A

provides brief summary of survey’s purpose, includes information about the respondent’s confidentiality, motivates the respondent to complete the survey, provides an estimate of the time required to complete, should be clear and concise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

parts of a survey: questions

A

include directions for completing, each question should have a defined objective, notice question wording, lead with high-interest questions, close with demographic questions, and keep it brief by eliminating unnecessary questions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

parts of a survey: survey logic

A

respondent should only be asked questions that apply to them, asking respondents to reply to questions that do not apply to them can lead to confusion and unreliable results (skip and display)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

parts of a survey: closing statement

A

thank the respondent for participating, provide contact information for questions, explain how the survey results will be disseminated, if any incentive is offered–provide relevant information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

double barreled questions

A

questions that attempt to get at multiple issues at once, and so tend to receive incomplete or confusing answers (ex. do you like pizza and ice cream?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

high-interest questions

A

should be at beginning of survey, most important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

demographic (sensitive) questions

A

should be at end of survey, not as important but very helpful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

question types: open-ended

A

provides respondents the opportunity to express themselves in their own words, no correct answers, often elicit unanticipated responses which provide new directions for research, can be difficult to interpret/analyze if clear themes do not emerge, short answer text or essay format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

question types: closed-ended

A

more difficult to write than open-ended questions, have a finite set of answers, responses are easy to standardize and analyze statistically, may miss pertinent information if a key answer is not provided to respondents (can be corrected by using “other” response option)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

advantages and disadvantages of open-ended questions

A

advantages:
respondents can define central issues, address the issue of “why”

disadvantages:
can be time consuming, results can be more challenging to analyze, leading questions can lead to less reliable results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

advantages and disadvantages of closed-ended questions

A

advantages:
easy to answer, easier to analyze results

disadvantages:
cannot address the issue of “why,” limited options available to respondents, can be hard to gauge results (ex a 2 on a ranking can mean different things to different respondents)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

types of survey logic (skip vs display)

A

skip logic: allows you to send respondents to a future point in the survey based on how they answer a question. (ex. if a respondent indicates that they don’t fit to your respondent criteria, they could immediately be skipped to the end of the survey.)

display logic: allows you to display questions conditionally based on the respondent’s answers to previous questions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

survey administration: population vs sample

A

population: the larger set of individuals you wish to understand
sample: a subset selected from a population to survey

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

sampling techniques: simple random sample

A

members of the subset are chosen completely at random so that every member of the population has an equal probability of being selected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

sampling techniques: stratified sample

A

the population is divided up into relatively homogeneous groups; then, a proportionate probability sample is drawn from the groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

sampling techniques: convenient sample

A

members of the subset are selected according to their availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

survey analysis: reporting the results

A

a final report should include: purpose, design of survey, administration process, data analysis, and findings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

primary data

A

data collected from the original source by the investigator himself/herself for a specific purpose

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

secondary data

A

data collected by someone else for some other purpose (but being utilized by the investigator for another purpose) or not from the original source

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

advantages and disadvantages of primary data

A

advantages:
data collected is specific to the problem, quality of data can be ensured, may be possible to obtain additional data

disadvantages:
expensive, time consuming, requires setup and manpower

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
advantages and disadvantages of secondary data
advantages: cost-effective, quicker to gather disadvantages: you cannot decide what is collected (maybe out of date or inaccurate), no control over quality, hard to obtain additional data
26
robots.txt file
A text file that provides special instructions (e.g. privacy information) about a Web site to Web crawlers. Web site owners use the robots.txt file to give instructions to web robots (e.g., scrapers) about their site The file is structured to specify what parts of the site robots are DISALLOWED to examine
27
API
Application Programming Interface intermediary software that allows two applications to talk to each other, through Web API, a sourcing application can talk to a website (i.e., extract information from the website), most websites require developer accounts to access their Web API
28
transactional information
encompasses all of the information contained within a single business process or unit of work, and its primary purpose is to support the performing of daily operational tasks
29
analytical information
encompasses all organizational information, and its primary purpose is to support performing of managerial analysis tasks
30
examples of transactional information
airline ticket, sales receipt, packing slip
31
examples of analytical information
product statistics, sales projections, future growth, trends
32
data quality
data that are fit for use by data consumers and satisfies the requirements of its intended use (depends on what is needed to know)
33
high-quality data
data that are relevant and accurately represent their corresponding concepts
34
high-quality information
information that is relevant and a faithful representation of what is being reported
35
characteristics and examples of high-quality information
accurate: is there an incorrect value in the information? (name spelled correctly? is the dollar amount recorded properly?) complete: is a value missing from the information? (is the address complete?) consistent: is aggregate or summary information in agreement with detailed information? (do all columns equal the true total of the individual item?) timely: is the information current with respect to business needs? (is information updated weekly, daily, or hourly?) unique: is each transaction and event represented only once in the information? (are there any duplicate customers?)
36
benefits of high-quality information
Information is everywhere in an organization Employees must be able to obtain and analyze the many different levels, formats, and granularities of organizational information to make decisions Successfully collecting, compiling, sorting, and analyzing information can provide tremendous insight into how an organization is performing
37
examples of low information quality
missing information, incomplete information, probable duplicate information, potential wrong information, inaccurate information
38
sources of low-quality information
four primary sources: - customers intentionally enter inaccurate information to protect their privacy - different entry standards and formats - operators enter abbreviated or erroneous information by accident or to save time - third party and external information contains inconsistencies, inaccuracies, and errors - parallel data entry (duplicates)
39
costs/consequences of low-quality information
potential business effects resulting from low quality information include: - inability to accurately track customers - difficulty identifying valuable customers - inability to identify selling opportunities - marketing to nonexistent customers - difficulty tracking revenue - inability to build strong customer relationships
40
why is data cleaning important?
improves your data quality and in doing so, increases overall productivity. When you clean your data, all outdated or incorrect information is gone - leaving you with the highest quality information
41
data cleaning strategies and activities
replace values (missing values in a column--null values, need to change a value everywhere it appears), remove duplicate rows (redundant data/repeated rows--dups), split column by delimiter (multiple values in one cell), trim (extra spaces before or after text), lowercase/uppercase/capitalize each word (text is mis-capitalized), create custom column-"glue" (values spread across multiple columns), create conditional column as a test (invalid data due to incorrect format), extract text using delimiter (text embedded in cell)
42
replacing values
transform tab, replace values
43
removing duplicates
home tab, remove rows, remove duplicates
44
removing blank rows
sort, delete by number of blank rows
45
trim
transform tab, format, trim
46
split column by delimiter (i.e. how to choose delimiter)
home, split column, by delimiter
47
glue/concentrate multiple values
add column tab, custom column, & " " &
48
capitalization
add column, format, whichever option needed
49
conditional column as a test
add column, conditional column, parameters
50
extract text using delimiter
add column, extract, feature needed
51
what is a relational database?
databases that store information in related two-dimensional tables
52
what is the purpose of a database?
- to store data - to provide an organizational structure for data - to provide a mechanism for querying, creating, modifying, and deleting data
53
what can be stored in a database?
specific details about each type of object
54
how is the design of a database communicated?
through data models
55
what is a data model?
graphical, logical structures that detail the relationships among data fields
56
what is included in a data model?
tables, data fields/attributes, keys (primary and foreign), relationships
57
what is a database table (entity)?
a representation of one type of object
58
two views of database table
model (an abstract representation of the structure) or with records in a two-dimensional grid
59
what is a record?
A record is a set of data for the fields in a table
60
what is a data field (attribute)?
the smallest or most basic unit of information that is stored about an object
61
well-structured vs ill-structured data fields
only one type of information should be stored in each field
62
database keys
create a logical relationship between two tables
63
what is a primary key?
a data field that uniquely identifies each row in a table (within each table, the values in the PK column can never repeat or be duplicated)
64
which attribute should be selected as a primary key?
typically a UNIQUE value (like customer_id)
65
what is a foreign key?
a primary key of one table that appears as an attribute in another table and acts to provide a logical relationship between the two tables
66
characteristics of foreign keys
foreign key and crow's foot are always together, primary key featured in other data tables
67
parent table and child table
To determine which table should be the child and which should be the parent, determine which makes more sense based on the business context
68
how to determine parent and child tables
Parent: ONE Child: MANY
69
crow's foot notation
When creating a logical relationship between two tables, one table gives its primary key to the other. The table that gives its primary key is the parent table, and it has a perpendicular line (|) on the outer edge of its end of the line. The other table is the child table and it has a crow's foot on the outer edge of its end of the line.
70
referential integrity
a property of data stating that all its references are valid
71
database vs. lists (spreadsheets)
with lists: multiple objects in the same row create modification problems in lists--update problems, deletion problems, insertion problems
72
advantages of using a database
increased flexibility, increased scalability and performance, reduced information redundancy, increased information integrity (quality), increase information security
73
What is a query?
the identification and transformation of data to answer a question
74
How are queries used in organizations?
to answer business questions and generate reports for decision-making (ex. sales reports by region)
75
How to determine query requirements (e.g. which tables need to be used)
dissect key parameters (important elements of the questions that need to be part of the query)
76
Single table queries
use only one table as their data source, PKs and FKs must be kept in mind (different meanings)
77
Aggregation (sum, min, max, median, average, count values, count distinct values)
sum: returns total of the column min: returns lowest value in the column max: returns highest value in column median: returns median of column's values (middle) average: average of column's values count values: returns the number of values in the column count distinct values: returns the number of different and unique values in a column
78
When to use aggregation?
when one wants an entire column to be summarized into a single value
79
Aggregating within groups (group by)
Using the Group By menu, specify a column that contains the grouping column and what column should be aggregated and how Any type of aggregation that can be performed on an entire column can be performed within groups The query will automatically determine how many groups exist by examining the grouping column's distinct values
80
Sorting (ascending, descending)
arranges the rows in the query by examining the values in a specified column Ascending: A to Z, Lowest Number to Highest Number Descending: Z to A, Highest Number to Lowest Number
81
Filtering (text filter, number filter)
removing values that meet specified criteria Can specify the entire cell contents ("is"), partial cell contents ("contains"), or starting with a specific character ("begins with") Can specify operators: equal to, not equal to, greater than, less than, greater than or equal to, less than or equal to
82
Multiple filtering criteria (AND, OR)
When combining multiple clauses, must specify whether they are connected via AND (more conservative) or OR (less conservative) AND: both test must be true OR: only one test have to be true
83
Multi tables queries
Goal: Moving information into a single table so that single table queries can be applied
84
When to use append
If multiple tables have the exact same columns and store similar information, the tables can be appended to form a single table
85
When to use merge
For queries that require data from multiple different tables, the tables must first be merged together using a join Merge selects all rows from both participating tables or queries as long as there is a match between the specified columns
86
What is data visualization?
The presentation of data in a pictorial or graphical format Human brains process information more easily graphically than analytically (tables) Allows trends or patterns in the data to be identified, more difficult concepts to be easily grasped, the presentation of analyses results
87
Types of variables (numerical)
Variables to which a number is assigned as a quantitative value
88
Types of variables (categorical)
Variables defined by the classes or categories into which an individual member falls
89
Types of variables (numerical, discrete)
Reflects a number obtained by counting—no decimal. Gaps between possible values (e.g. number of orders 1, 5, 7 etc. No 1.5 orders)
90
Types of variables (numerical, continuous)
Reflects a measurement; the number of decimal places depends on the precision of the measuring device. (e.g. money spent 228.58 dollars)
91
Types of variables (categorical, nominal)
Name only (e.g. Gender - female, male, hair color - black, brown, red etc).
92
Types of variables (categorical, ordinal)
Nominal categories with an implied order (e.g. low, medium, high).
93
Determining what type of chart to use
requirements, content you are trying to visualize, attributes available, does the data need to be aggregated or filtered, data needed for the chart
94
Elements of a graph (title, axis titles, legend, data labels)
Title: a descriptive text that uniquely identifies the graph. The title should not just repeat the labels, but add information specific to what the data represents. Axis titles: a short descriptive label that represents each axis. Legend: Many charts will use different visual properties such as colors or shapes to represent different values of data. A legend identifies what these associations mean. Not every chart has a legend. Data labels: Numerical values for each data point visualized in the graph. Data labels are not applicable to every graph (e.g. map, word cloud)
95
Filters
Remove all but the data you want to focus on (visual, page, report filters), additional attributes can also be added as a filter (even though when they are not one of the chart fields)
96
Slicers
An alternate way of filtering that is displayed on the report canvas, can add onto report just like any other visualization, can be used to display commonly-used or important filters on the report canvas for easier access, and make it easier to see the current filtered state without having to open the filter menu
97
What is a dashboard?
Set of visualizations (usually interactive that allow the reader to draw their own conclusion by looking at the data
98
Purpose of dashboards
Help to summarize and monitor events or activities at a glance by providing key insights and analysis about data on one or more pages or screens
99
What is an infographic?
Data visualization tools that present complex data and information in many visualizations on one page Static set of images that lead the reader to a conclusion that is pre-ordained by the author Information + Graphic Simplify, condense, engage, and enhance
100
Design guidelines
Consistent, complimentary colors across visualizations but use contrasting colors within Color text so it is visible (contrast) Use both text and graphics Maximum of 3 fonts Include a title, icons, lines/arrows, "whitespace," alignment, repetition, proximity