Privacy + Data Sci/Machine Learning Basics Flashcards
Test 1 Prep (42 cards)
What is Artificial Intelligence?
enabling computers with intelligence to solve complex problems
e.g.
robots
chatbots
online gaming
voice assistants
What is machine learning?
extract knowledge from data to learn from that data and make predictions
E.g. recommendations
search algorithms
classification
What is Data Science?
Business and Problem Solving using descriptive and predictive analysis
e.g.
retail trends
financial analysis
transit development
According to the Harvard Business Review, what is data science about?
Data science is about infrastructure, testing, using machine learning for decision making, and data
products.
Use data analysis to get insights
What can DS work help reveal?
hidden impacts
What are the 5 stages of the Data Science Lifecycle?
Problem definition
Data Collection/curation
Data Analysis
Context
Decision Making
What questions do we try to answer in problem definition/formulation?
What problem are we trying to solve?
What do we want to find out that we don’t know now?
What are our assumptions and hypotheses
How will we measure success?
What are the 2 types of questions we ask in problem definition/formualtion?
Exploratory –> Relationship between different elements/variables/features
Descriptive –> describe statistically, or through data, the situation
Helpful in determining what data you need
What questions do we answer in Data Collection and Curation?
What data do we need?
How do we collect that data?
Do we have enough data?
Does the data contain what we are looking for?
What is the scope of the data?
Is the data clean?
What do we do in the Data Analysis step?
from simple statistical understanding to complex models for prediction or inference.
- statistical summaries;
- discover patterns;
- machine learning algorithms
- supervised vs unsupervised vs semi-supervised learning.
- Classification - predict a label (discrete)
- Regression - predict a continuous value
What kinds of questions do we answer in the data analysis step?
Are we able to answer our question?
Do we have the right form of data and answer?
Are we allowed to use this data?
Are there biases in the data or result?
Can we explain the results?
What kind of questions do we answer in the context step?
What world or context are we in?
Does this analysis generalize to other data?
What do we do in the context step of the data science lifecycle?
Assess whether inferences carry over from samples to populations
Understand likelihood of predictions with intervals
Analyze Correlation vs Causation
What are examples of correlation?
discovery of patterns
passive data collection
What are examples of causation?
randomized experiments as needed
active data - interactive data collection
What do we do in the decision making step of the data science lifecycle?
What is the data telling us?
Does it answer the right question?
Do we trust our conclusions and decisions?
Will this be valid tomorrow?
How to act on the analysis?
Where should we consider privacy?
Input : data collection/sharing
Analysis: how data handled, shared, algorithms
Output: how analysis is presented, how models are published
What are the types of information disclosure?
Attribute disclosure
Identity Disclosure
Membership Disclosure
What is Attribute Disclosure?
Disclosure of some information about a known person
e.g. healthcare data: diagnosis: test results, etc
Disclosure is through linkage with other data
What is identity disclosure?
Identification of a person
De-anonymization
Re-identification
Reconstruction of dataset
What is membership disclosure?
Membership in a dataset revealed
Identity and attribute disclosure
What is anonymization?
A common tool used to claim privacy, especially in data sharing.
Personal Identifiable Information (PII) is masked or hidden.
What is a PII?
Personally Identifiable information
personal information includes any factual or subjective
information, recorded or not, about an identifiable individual.
age, name, ID numbers, income, ethnic origin, or blood type;
What is a dataset?
database: the form in which data is presented.
a table, “matrix”.
rows correspond to data points (depending on the data, a row is about an
individuals if the data is about people, or an event for sensor data, or a transaction
for bank data).
columns correspond to features or attributes