Google Data Analysis Flashcards

1
Q

A/B testing

A

The process of testing two variations of the same web page to determine which page is more successful at attracting user traffic and generating revenue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Compatibility

A

How well two or more datasets are able to work together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data analysis process

A

The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data analysis

A

The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data life cycle

A

The sequence of stages that data experiences, which include plan, capture, manage, analyze, archive, and destroy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

First-party data

A

Data collected by an individual or group using their own resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Gap analysis

A

A method for examining and evaluating the current state of a process in order to identify opportunities for improvement in the future

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Problem types

A

The various problems that data analysts encounter, including categorizing things, discovering connections, finding patterns, identifying themes, making predictions, and spotting something unusual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Statistical power

A

The probability that a test of significance will recognize an effect that is present

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Statistical significance

A

The probability that sample results are not due to random chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Wide data

A

A dataset in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Administrative metadata

A

Metadata that indicates the technical source of a digital asset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Descriptive metadata

A

Metadata that describes a piece of data and can be used to identify it at a later point in time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Structural metadata

A

Metadata that indicates how a piece of data is organized and whether it is part of one or more than one data collection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

3 types of metadata

A
  • descriptive
  • structural
    administrative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Foreign key

A

A field within a database table that is a primary key in another table (Refer to primary key)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Primary key

A

An identifier in a database that references a column in which each value is unique (Refer to foreign key)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Metadata

A

Data about data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Elements of metadata

A
  • title and discription
  • tags and categories
    -who created it and when
    -who last modified it and when
  • who can access or update it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Metadata repository

A

A database created to store metadata

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Metadata repositories

A
  • describe the state and location of the meatdata
  • describe the structures of the tables inside
  • describe how the data flows through the repository
  • keep track of who access the metadata and when
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Hypothesis testing

A

A process to determine if a survey or experiment has meaningful results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Confidence level

A

The probability that a sample size accurately reflects the greater population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Margin of error

A

The maximum amount that sample results are expected to differ from those of the actual population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
to calculate margin of error you need
- population size - sample size - confidence level
26
Dirty data
Data that is incomplete, incorrect, or irrelevant to the problem to be solved
27
Clean data
Data that is complete, correct, and relevant to the problem being solved
28
Data engineer
A professional who transforms data into a useful format for analysis and gives it a reliable infrastructure
29
Data warehousing specialist
A professional who develops processes and procedures to effectively store and organize data
30
Confidence interval
A range of values that conveys how likely a statistical estimate reflects the population
31
Statistical significance
The probability that sample results are not due to random chance
32
Why a minimum sample of 30?
this recommendation is based on the central limit theorem (CLT) in the field of probability and statistics. A sample of 30 is the smallest sizes for which the clt still valid
33
Das zentrale Grenzwertsatz
Das zentrale Grenzwertsatz (englisch: central limit theorem) ist ein wichtiger Satz der Wahrscheinlichkeitstheorie. Es besagt, dass sich die Summe von unabhängigen und identisch verteilten Zufallsvariablen einer bestimmten Verteilung annähert, wenn die Anzahl der Summanden groß genug ist. Genauer gesagt besagt der zentrale Grenzwertsatz, dass die Verteilung der Summe von unabhängigen und identisch verteilten Zufallsvariablen einer Normalverteilung annähert, wenn die Anzahl der Summanden groß genug ist. Dies bedeutet, dass viele Zufallsvariablen, die in der Realität auftreten, durch eine Normalverteilung approximiert werden können. Dieser Satz ist von großer Bedeutung in der Statistik, da er es ermöglicht, viele statistische Tests durchzuführen und Schätzungen zu machen, auch wenn die zugrunde liegende Verteilung unbekannt ist. Der zentrale Grenzwertsatz ist ein wichtiger Bestandteil vieler statistischer Methoden und hat Anwendungen in vielen Bereichen, wie der Finanzmathematik, der Physik und der Ingenieurwissenschaften.
34
Random sampling
A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen
35
Sampling bias
Overrepresenting or underrepresenting certain members of a population as a result of working with a sample that is not representative of the population as a whole
36
Sample
In data analytics, a segment of a population that is representative of the entire population
37
types of insufficient data
- data from only one sourse - data that keeps updating - outdated data - geographically limited data
38
SMART methodology
A tool for determining a question’s effectiveness based on whether it is specific, measurable, action-oriented, relevant, and time-bound
39
critical questions about the predictiv analytical models
- why is it taking so long to put new or updated models into production? - who created the model and why? - what input variables are used to make predictions and to make precisions? - how are models used? - how are models performing and when were they last updated? - where is the supporting documentation? no ansvers - no real value
40
making predictions
using data to make an informed decision about how things may be in the future
41
6 problem types that data analysts typically face
1. making predictions 2. cetegorizing things 3. spotting something unusual 4. identifying themes 5. discovering connections 6. finding patterns
42
data analysis process (google)
The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making
43
Data-driven decision-making
Using facts to guide business strategy
44
Algorithm
A process or set of rules followed for a specific task
45
how data is collected
- interviews - observations - forms - questionairies - survey - cookies
46
Metric
A single, quantifiable type of data that is used for measurement
47
Problem domain
The area of analysis that encompasses every activity affecting or affected by a problem
48
Structured thinking
The process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying options
49
Scope of work (SOW)
An agreed-upon outline of the tasks to be performed during a project
50
Report
A static collection of data periodically given to stakeholders
51
Quantitative data
A specific and objective measure, such as a number, quantity, or range
52
Quantitative data tools
- structural interviews - survey - polls
53
Qualitative data
A subjective and explanatory measure of a quality or characteristic
54
Qualitative data tools
- focus groups - social media - in-person interviews
55
Data life cycle
The sequence of stages that data experiences, which include plan, capture, manage, analyze, archive, and destroy
56
Best practices when organizing data
- naming conventios - foldering - archiving older files
57
data life cicle 5) archive
keep relevant data stored long-term and future reference
58
Bias
A conscious or subconscious preference in favor of or against a person, group of people, or thing
59
Confirmation bias
The tendency to search for or interpret information in a way that confirms pre-existing beliefs
60
Interpretation bias
The tendency to interpret ambiguous situations in a positive or negative way
61
Data integrity
The accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle
62
Data replication
The process of storing data in multiple locations
63
Data transfer
The process of copying data from a storage device to computer memory or from one computer to another
64
Data manipulation
The process of changing data to make it more organized and easier to read
65
Statistical significance
The probability that sample results are not due to random chance
66
Data bias
When a preference in favor of or against a person, group of people, or thing systematically skews data analysis results in a certain direction
67
types of data bias
- observer bias - interpretation bias - confirmation bias - sampling bias
68
Continuous data
Data that is measured and can have almost any numeric value
69
Discrete data
Data that is counted and has a limited number of values
70
Ordinal data
Qualitative data with a set order or scale
71
External data
Data that lives, and is generated, outside of an organization
72
Nominal data
A type of qualitative data that is categorized without a set order
73
Internal data
Data that lives within a company’s own systems
74
Qualitative data
A subjective and explanatory measure of a quality or characteristic
75
Second-party data
Data collected by a group directly from its audience and then sold
76
Population
In data analytics, all possible data values in a dataset
77
Third-party data
Data provided from outside sources who didn’t collect it directly
78
Structured data
Data organized in a certain format such as rows and columns
79
Unstructured data
Data that is not organized in any easily identifiable manner
80
Long data
A dataset in which each row is one time point per subject, so each subject has data in multiple rows
81
Dataset
A collection of data that can be manipulated or analyzed as one unit
82
Attribute
A characteristic or quality of data used to label a column in a table
83
Fairness
A quality of data analysis that does not create or reinforce bias
84
Query
A request for data or information from a database
85
Data governance
A process for ensuring the formal management of a company’s data assets
86
Naming conventions
Consistent guidelines that describe the content, creation date, and version of a file in its name
87
Data-inspired decision-making
Exploring different data sources to find out what they have in common
88
Data analysis
The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making
89
Data science
A field of study that uses raw data to create new ways of modeling and understanding the unknown
90
Data analysis process
The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making
91
Formula
A set of instructions used to perform a calculation using the data in a spreadsheet
92
Observation
The attributes that describe a piece of data contained in a row of a table
93
Data ecosystem
The various elements that interact with one another in order to produce, manage, store, organize, analyze, and share data
94
Data
A collection of facts
95
Data validation
A tool for checking the accuracy and quality of data
96
Analytical skills
Qualities and characteristics associated with using facts to solve problems
97
Observer bias
The tendency for different people to observe things differently (also called experimenter bias)
98
Unbiased sampling
When the sample of the population being measured is representative of the population as a whole
99
Data interoperability
The ability to integrate data from multiple sources and a key factor leading to the successful use of open data among companies and governments
100
Data anonymization
The process of protecting people's private or sensitive data by eliminating identifying information
101
Openness
The aspect of data ethics that promotes the free access, usage, and sharing of data
102
Currency
The aspect of data ethics that presumes individuals should be aware of financial transactions resulting from the use of their personal data and the scale of those transactions
103
Consent
The aspect of data ethics that presumes an individual’s right to know how and why their personal data will be used before agreeing to provide it
104
Transaction transparency
The aspect of data ethics that presumes all data-processing activities and algorithms should be explainable and understood by the individual who provides the data
105
Ownership
The aspect of data ethics that presumes individuals own the raw data they provide and have primary control over its usage, processing, and sharing
106
Data ethics
Well-founded standards of right and wrong that dictate how data is collected, shared, and used
107
Data model
A tool for organizing data elements and how they relate to one another
108
Data element
A piece of information in a dataset
109
Open data
Data that is available to the public
110
threats to data integrity
- humon error - viruses - malware - hacking - system failures
111
Data formats
1. - inernal - external 2. - continuous -discrete 3. - structured - unstructured 4. - nominal - ordinal 5. -qualitative - quantitative 6. - primary - secondary
112
Primary data
collected by a researcher from first-hand sources
113
Data type
An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform
114
2 common methods to develop data models
-entity relationship diagram (ERD) - unified modeling lanquage (UML)
115
5) Share DA-process
- untersrand visualization -create effective visuals -bring data to life -use data storytelling - communicate to help others understand results
116
Sorting
The process of arranging data into a meaningful order to make it easier to understand, analyze, and visualize
117
decision intelligence
is a combination of applied data science and the social and managerial science
118
data life cicle 6)destroy
remove data from storage and delete any shared copies of the data
119
3) Process DA-Process
- create and transform data - maintain data intergrity -test data -clean data -verify and report on cleaning results
120
4) Analyse DA-Process
-use tools to format and transform data -sort and filter data -identify patterns and draw conclusions -make predictions and recommendations make data-driven decisions
121
1) Prepare DA-Process
-understand how data is generated and collected - identify and use different data formats, types and structures - make sure data is unbiased and credible -organize and protect data
122
Step 1 - Ask
-define the problem you are trying to solve -make sure you fully understand the stakeholders expectations -focus on the actual problem and avoid any distractions -collaborate with stackeholders and keep open line of communication -take astep back and see the whole situation in context
123
Ask DA-Process
-aks effective questions -define the problem -use structured thinking -communicate whit others
124
dashboards pros&cons
pros: -dynamic automatic and interactive -more stackholder access -low maintenance cons: -labor intensive design -can be confusing -potentially uncleaned data
125
reports pros&cons
pros: -high level historical data - easy to design -pre-cleande and sorted data cons: -continual maintenance -less visually appealing -static
126
Step 4. Analyse
think analyticaly about my data - perform calculations -combine data from multiple sourses -create tables with your results Q: 1) what story is my data telling me 2)how will my data help me solve this problem?
127
Step 3. Process
clean data of any possible errors, inaccuracies, inconsistencies - using spreadsheet functions fo find incorreclty entered data - using SQL functions to check for extra spaces -removing repeated entries -checking for bias in data Q: 1. what data errors or inaccuracies might get in my way of getting out of best possible answer to find the problem i'm trying to solve 2. how can i clean my data so the information i have is more consistent
128
Data analytics
The science of data
129
Confidence interval
A range of values that conveys how likely a statistical estimate reflects the population
130
A/B testing
The process of testing two variations of the same web page to determine which page is more successful at attracting user traffic and generating revenue
131
Access control
Features such as password protection, user permissions, and encryption that are used to protect a spreadsheet
132
Accuracy
The degree to which data conforms to the actual entity being measured or described
133
Action-oriented question
A question whose answers lead to change
134
Analytical thinking
The process of identifying and defining a problem, then solving it by using data in an organized, step-by-step manner
135
Bad data source
A data source that is not reliable, original, comprehensive, current, and cited (ROCCC)
136
Big data
Large, complex datasets typically involving long periods of time, which enable data analysts to address far-reaching business problems
137
Boolean data
A data type with only two possible values, usually true or false
138
Changelog
A file containing a chronologically ordered list of modifications made to a project
139
Compatibility
How well two or more datasets are able to work together
140
Completeness
The degree to which data contains all desired components or measures
141
Consistency
The degree to which data is repeatable from different points of entry or collection
142
Context
The condition in which something exists or happens
143
Cross-field validation
A process that ensures certain conditions for multiple data fields are satisfied
144
Dashboard
A tool that monitors live, incoming data
145
Data analyst
Someone who collects, transforms, and organizes data in order to draw conclusions, make predictions, and drive informed decision-making
146
Data constraints
The criteria that determine whether a piece of a data is clean and valid
147
Data design
How information is organized
148
Data mapping
The process of matching fields from one data source to another
149
Data merging
The process of combining two or more datasets into a single dataset
150
Data range
Numerical values that fall between predefined maximum and minimum values
151
Data security
Protecting data from unauthorized access or corruption by adopting safety measures
152
Data strategy
The management of the people, processes, and tools used in data analysis
153
Data visualization
The graphical representation of data
154
Estimated response rate
The average number of people who typically complete a survey
155
Experimenter bias
The tendency for different people to observe things differently (Refer to Observer bias)
156
Gap analysis
A method for examining and evaluating the current state of a process in order to identify opportunities for improvement in the future
157
General Data Protection Regulation of the European Union (GDPR)
Policy-making body in the European Union created to help protect people and their data
158
Good data source
A data source that is reliable, original, comprehensive, current, and cited (ROCCC)
159
Incomplete data
Data that is missing important fields
160
Inconsistent data
Data that uses different formats to represent the same thing
161
Incorrect/inaccurate data
Data that is complete but inaccurate
162
Mandatory
A data value that cannot be left blank or empty
163
Normalized database
A database in which only related data is stored in each table
164
Outdated data
Any data that has been superseded by newer and more accurate information
165
Problem types:
The various problems that data analysts encounter, including categorizing things, discovering connections, finding patterns, identifying themes, making predictions, and spotting something unusual
166
Redundancy
When the same piece of data is stored in two or more places
167
Reframing
The process of restating a problem or challenge, then redirecting it toward a potential resolution
168
Regular expression (RegEx)
A rule that says the values in a table must match a prescribed pattern
169
Small data
Small, specific data points typically involving a short period of time, which are useful for making day-to-day decisions
170
Stakeholders
People who invest time and resources into a project and are interested in its outcome
171
Technical mindset
The ability to break things down into smaller steps or pieces and work with them in an orderly and logical way
172
Transferable skills
Skills and qualities that can transfer from one job or industry to another
173
Typecasting
Converting data from one type to another
174
Validity
The degree to which data conforms to constraints when it is input, collected, or created
175
Verification
A process to confirm that a data-cleaning effort was well executed and the resulting data is accurate and reliable
176
aspects of data ethics
-ownership -transaction transparency -consent -currency -privacy -openness
177
PII
personal identifiable information information that can be used by itself or with other data to track down a person's identity
178
privacy
preserving a data subjets information and activity any time a data transaction occurs
179
structured data
-defined data types -most often quantitative data -easy to organise -easy to search -easy to analyse -stored in relational databases & data warehouese -contained in rows and colums
180
Re-identification
A process used to wipe data clean all personally identifying information
181
confidence level
confidence level is targered before you start your study, because it will affect how big your margin of error is at the end of your study. how confident you are in the survey sesults. F.e. a 95% confidence level means that if you were to run the same survey 100 times you would be get similar results 95 of those 100 times.
182
data collection considerations
how the data will be collected chose data sources decide what data to use how much data to collect select the right data type determine the time frame
183
6) act DA process
-apply your insights -solve problem -make decisions -create something new
184
good data sources
Reliable Original Comprehensive Current Cited
185
spotting something unusual
identifying data that is different from the norm
186
4) analyse DA process
use data to solve problems make decisions and support buisness goals
187
data anlyticer skills quantities
curiosity understanding context having technical mindset data design data strategy
188
SAS's iterative life cycle
ask-prepare-explore-model-implement-act-evaluate