AI for Business Specialization (Wharton, University of Pennsylvania) Flashcards
(129 cards)
What are the criteria for AI to be considered a general-purposed technology (GPT)?
1) has a pervasive use in a wide range of industries and sectors
2) stimulate innovation and economic growth
3) has a significant number of research jobs across industries
Which are the 4xV of Big Data?
VOLUME - Terabytes to petabytes of existing data to process
VARIETY - Structured, unstructured, text, video,…
VELOCITY - Streaming data, multiseconds to seconds to process
VERACITY - Uncertainty due to data inconsistency and incompleteness, ambiguities, latency, …
What is the difference between traditional analytics and big data analytics?
Traditional analytics is hypothesis driven, i.e. question > hypothesis > analyzed info > answer. It is structured and repeatable.
Big data analytics is data driven, i.e. Data > exploration > correlation > actionable insight. It is iterative and explorative.
What are the new skillsets required in Big Data?
1) Manage data»_space;> tool development, data expertise
2) Understanding data»_space;> data science, visualization
3) Acting on data»_space;> decision making, apply data to problem solving
Type if tools in Big Data…
1) DATA MANAGEMENT TOOLS: Data warehouse and Hadoop/Spark
2) DATA ANALYSIS TOOLS: Clustering, association role mining, machine learning
How would you define a Data Warehouse?
It is a particular kind of DB management system, specialized in historic data from many sources and with the purpose to enable analytics (e.g. reporting, visualization or BI).
Examples of Data Warehouse
Azure SQL Data Warehouse
Google BigQuery
Snowflake
Amazon Redshift
What does the term ETL come from and what does it do?
ETL = Extract > Transform > Load
It is a function that takes data from different sources (CRM, ERP, billing, supply chain, etc.) and builds a Data Warehouse from which you can generate different analytics as an outcome (e.g. reporting, visualization, BI)
What is the difference between data for operations and data for analytics?
Data for operations requires real time processing in order to take immediate action while analytics do not. The other difference is that analytics considers historical information, operations do not.
What are the main open source Big Data tools?
HADOOP and SPARK (evolution of HADOOP). Developed by Google, storage and process massive amount of data in distributed fashion with low cost server architecture
SNOWFLAKE
What is data mining?
it is a term encompassing tools for discovering patterns in large datasets. Main difference with data regression is that is is data driven (predictive analytics) and not hypothesis driven,
What are two of the most popular techniques for data mining?
1) Clustering - grouping data, e.g. customer segmentation
2) Association rule mining - finding common co-ocurrences in data
AI spectrum (types)…
WEAK AI - artificial narrow intelligence, good at one specific task
STRONG AI - artificial general intelligence, do things similarly as humans (as quickly and easy)
ARTIFICIAL SUPER INTELLIGENCE - do human things faster and better
In which ways can you build AI?
There are two approaches:
1) Expert systems approach: capturing and transferring knowledge using rules. Cannot beat human since it has the limitation that tacit knowledge is not transferred.
2) Machine Learning: subset of AI, used for predictions, that has the ability to learn from data without being explicitly trained through rules.
What are the 3 most common techniques for ML?
1) Supervised learning
2) Unsupervised learning
3) Reinforcement learning
What are the main characteristics of Supervised (M)Learning?
- learns from past data, coming down to aproximating the function f(x)=y with high fidelity and accuracy
- inputs (x) = features/covariates (labeling and annotations)
- outputs (y) = targets
- uses classification and regretion methods
- requires high quality training data sets
- 90% of practical AI cases uses ML and out of that 90% is supervised ML
What are the main characteristics of unsupervised (M)Learning?
- There is not a fixed set of outputs predefined
- the goal is to cluster and identify important features and patterns so the system can learn by itself
- requires large training data sets
What are the main characteristics of Reinforcement (M)Learning?
- let algorithms learn by testing various actions and strategies to decide which one works best…do not begin with large training datasets but learn by taking actions and observing the results
- bandit algorithm: trades off between EXPLORATION (gathering more info about the decision environment) and EXPLOITATION (making the best decision based on the info available)…specific example would be the Multi-armed bandit algorithm where a finite set of resources must be allocated among multiple choices
- application e.g. in gaming or onlines personalization
- this type of ML is not widely used
What drives accuracy in Supervised ML?
1) Quantity of data (number of observations)
2) Quality of data (number of characteristics of the observations)
Others: relevance of the info, complexity of the model, feature engineering, etc.
What are some of the most common methods in ML to aproximate f(x)=y?
- Logistic regression
- Decision trees and random forest
- Neuronal networks
- others: boosting, SVM (support vector machines), Neuronal Networks more complex than the ones explained, etc.
Explain Logistic Regression in ML…
It is the most popular method for binary classification algorithms when outcomes can take only 1 of 2 values. Logit function constrains probabilities to between 0 and 1…it is equivalent to finding the ‘best fit’ line/plane that separates the data
Explain Decision Trees in ML…
It is an easy-to-interpret model built iteratively looking for features in your data that are most predictive. In essence is about choosing a variable/split that provides the most predictive power at each step.
Explain Random Forest in ML…
It is an ‘ensemble’ algorithm that harnesses the power of multiple decision trees. It is popular and relatively simple. Takes many random samples of your dataset and train a decision tree for each one, choosing the prediction with more votes. Each prediction is less accurate that a single decision tree built with the entire dataset…however the combination is better!
Explain Neuronal Network in ML…
Loosely inspired by biological neurons, takes inputs from other neurons, apply transformation and pass signal on. Normally has several layers (deep neural network) with an input layer, output layer and hidden layers. Are often the best algorithms for audio, images, video, etc. due to its ability to built very complex model. Recent advances in GPU (Graphic Power Units) and algorithm propagation has allowed to build more layers…main disadvantage is that are hard to understand and interprete. Lots of works are being done to open up the black box and understand what they do all through the different intermediate layers