AI for Business Specialization (Wharton, University of Pennsylvania) Flashcards

Question

How do you choose a specific ML model?

Answer 1

by evaluating each model performance on a validation dataset

Answer 2

splitting the data set in different subsets: - training set: subsection from which the ML learns - validation set: another subset to which we apply the ML algorithm to see how accurately it identifies relationships between the known outcomes for the target variable and the datasets of the features - holdout set: testing data that provides a final estimate of the ML model´s performance after trained and validated. Should never be used to make decisions about which algorithm to use

Answer 3

1) Holdout validation: Partition available data into a training dataset and a holdout; evaluate model performance on holdout 2) Cross-validation:Create a number of partitions (validation datasets) from the training dataset; fit model to the training dataset (sans the validation data); evaluate model against each validation dataset; repeat with each validation set and average results to obtain the cross-validation error

Answer 4

The data...among modern methods, performance differences between algorithms are relatively small for a given dataset when compared to differences between algorithms with more or less data...it is the so call: 'UNREASONABLE EFFECTIVENESS OF DATA" (Peter Novig, Google)

Answer 5

It is the part of the process of ML that engineer the unstructured data so it can be consumed by the algorithm and convert into a target

Answer 6

FALSE it is one of the more complex and time consuming

Answer 7

It is a kind of AI that eliminates the need for feature extraction (engineering) by using neural networks that leverage loss/cost functions to compare against training labels

Answer 8

- no need for feature engineering that is expensive, error-prone and uncertain - can lead to massive performance improvements relative to hand-coded features - computation is getting cheaper, making deep learning more feasible - it substitutes domain expertise (e.g. doctor) by more and more data (labeled examples) and computation

Answer 9

Image recognition, detecting fake news, detecting knockoffs from luxury products

Answer 10

1) scale of data required (quantity and variety) 2) Computational power and storage space 3) lack of interpretability (explainability)

Answer 11

Lost/cost functions: - Accuracy: fraction of labels (answers) that the algorithm predicts correctly - Precision: what proportion of positive identifications (e.g. fraudulent) were actually correct - Sensitivity: how many relevant instances (e.g. fraudulent) did you catch? - Specificity: proportion of legitimates correctly identified as such

Answer 12

...mapping True/False positives and negatives to check lost/cost functions

Answer 13

e.g. on medical application like disease screening we might want to make sure we dont miss a person with the disease (i.e. sensitivity is key) while on identifying violations that have severe punishments we want to make sure that those identified are true (i.e. precision is key)

Answer 14

The data the algorithm uses to learn the best mapping between the inputs and the right predictions or outputs. Training data is the key to building ML algorithms

Answer 15

- archival or historical data in the organization - human data labeling to generate training data (e.g. platforms for crowdsourcing the task of labeling data) - using customer to label data (e.g. Google or Gmail spam filtering)

Answer 16

Training data though the point is to predict outcomes where we don´t already know what is going to. happen

Answer 17

It is the danger that the model performs well on training data but no other data sets...example: studying a test vs studying the material

Answer 18

the challenge in capturing the relevant aspects of the model vs. capturing the noise in the training data

Answer 19

it is the data set that is not used to train or build the model, but can be used to validate the model. It helps ensure the model also works well on outside samples

Answer 20

one common approach is to divided the labeled data into training and test (e.g. 70/30)

Answer 21

- Consistency of results - scale and speed - works really well for some tasks (e.g. image recognition) - faster and cheaper to build

Answer 22

it is normally used for predictions (through unstructured data) but it might also be used for unsupervised learning such as topic modeling (classifies content in a way that makes them easier to interpret)

Answer 23

Can create new data instances. Instead of classifying data into two categories, a generative model asks what the underlying process is that could have generated the type of data that we are seeing in the sample

Answer 24

It is used to generate artificial content that is increasingly hard to tell apart from real content.

Answer 25

The two networks used in GAN, competing with one another. Generator produces new content and the other, the Discriminator, tellS whether the output of the first one is real or fake. Over time the generator will learn what it needs to do to create content that is harder and harder for the discriminator to identify as being fake content.

Answer 26

FALSE. Big concerns around deepfakes

Answer 27

It is an encoder that takes data and boil it down to a simpler representation which then can be used to recreate itself. Can be used to slightly vary some attributes or aspects of the image

Answer 28

Practices and tools to build, test and deploy code in production

Answer 29

Continuous Integration is about creating a new code branch to fix the problems in parallel to the production code. Once fixed it is merged and deployed in to production (Continuous Deployment).

Answer 30

In ML Ops the code is not the only source of changes, the data might change, the model itself might change as it re-trains

Answer 31

1. Infrastructure Management 2. Data Management 3. Model Management 4. Deployment 5. Monitoring

Answer 32

There are tools for each of the phases on the ML Ops cycle. For exemple: Amazon Sage Maker covers several phases, Paperspace or Pachyderm are focused only in one.

Answer 33

When developing a new product you don't have data to trained the models...without user you don´t have data, without data you can´t build AI models

Answer 34

1. Star with a non-AI product that generates data 2. Partner with an organization that has data 3. Crowdsource the (labeled) data you need 4. Make use of public data 5. Rethink the need for data (e.g. reinforcement learning or expert systems)

Answer 35

1. AI to predict or shorten customer journey 2. Personalization on the Web

Answer 36

- Voice AI - Visual AI - Language AI

Answer 37

1. Ask Why? what is the customer need or problem you are trying to solve 2. How? what are the technologies and the data assets necessary to best solve it? 3. RoI...is it worth?

Answer 38

upstreaming...company wants to curate the entire customer journey downstreaming...company wants to keep customer engaged after initial purchase

Answer 39

1. Centralized model. Core unit serves the entire company. PROS: ability to work cross-functional. CONS: agility, workload and time constrains 2. Center of Excellence (CoE) model. Analysts based in BUs and their activities coordinated by a small central team. PROS: coordination is centralized, as well as training. CONS: may not have enough control or organizational support 3. Functional model. Analysts place within function(s) that dominate the analytics activity. PROS: analytics concentrated where it can benefit the most. CONS: other functions may not get the support. 4. Dispersed model. Analyst spread through the organization without centralized support. PROS: some units may get support...CONS: not really a model

Answer 40

1. Set up AI brain trust. 8-12 people 2. Identify data assets and activities to be automated using ML/AI 3. Construct ML portfolio with 4-5 short term and at least 1 long term projects 4. Figure out how AI team will fit into org. chart 5. Set up risk management or audit process

Answer 41

Systems that use data on purchases, product ratings and user profiles to predict which products are best suited to a particular customer (e.g. "customer who bought this item also bought..." or "people like you bought...")

Answer 42

1) Content-based Recommenders: find products with similar attributes 2) Collaborative Filtering (CF): use information on what others buy/like

Answer 43

1) Item-to-Item CF: recommends items bought by others who bought the item you are interested in 2) User-similarity based CF: recommends items bought by others who are similar to you based on data the company has about your preferences

Answer 44

- easy and cheap to build (no need for detailed metadata about products) - effective in practice

Answer 45

Cold-start: how to make recommendations for new users without data or how to handle new items without reviews or users

Answer 46

...the main effect of automated recommendations would be to help people move from the world of hits to the world of niches- obscure products that are closer to our individual preferences but never get our attention in mainstream markets this is partially correct, CF increase niche product sales but also hit ones, this last ones increasing more the marketshare (i.e. popularity bias)

Answer 47

Pros: - doesn´t have popularity bias - provides relevant suggestions - can explain recommendations - works relatively well even for less popular/newer tiems with less data as well as for new users CONS - Difficult to build because you need detaild metadata

Answer 48

- crawls web to examine blog posts and online discussions to figure out the kind of descriptive items that listeners use to discuss songs/artists and then uses these terms for attributes of songs - when there is less data/discussions, uses ML to analyze the audio signal of a song and extract characteristics

Answer 49

it is about hollistically adjusting communications with customers based on customer characteristics (e.g. websites or emails tailored to individual users)

Answer 50

- missaplication - data privacy concerns...crossing the 'creepy' line - regulatory compliance

Answer 51

1. Clearly articulate a specific question 2. Guess an answer (hypothesize) 3. Identify empirical implications of guess 4. Compare implications with data

Answer 52

1. Acquisition and verification 2. Preparation (most time consuming and critical) 3. Analysis 4. Communication ...to decision makers so they can take actions

Answer 53

...inability of firms to repay financial obligations ...affects availability and price of credit ...it is important to investors, employees, customers, suppliers and taxpayers

Answer 54

* Liquidity ratios, focused on balance sheet coverage (e.g. current ratio, quick ratio, cash ratio) * Coverage ratios, focused on operating coverage (e.g. interest, debt service or cash) * Leverage ratios, focused on how the company is financing its operations (e.g. debt-to-ebitda, debt-to-

Answer 55

Moody's, Standard&Poor's, Fitch Ratings

Answer 56

1. Define very precisely what is success, e.g. a model that balances both type of errors, false positive and negative) 2. Data Science workflow i) data acquisition and verification (e.g. Wharton Research Data Services -WRDS- or S&P Compustat database), 10,000+ observations from more than 1,400 firms between 1995 and 2016 ii) data preparation...EDA = Exploratory Data Analysis iii) data analysis iv) communication 3. Model preparation. Y = f(x1, x2, ...xn) ...Y = outcome variable = 1 investment grade, 0 otherwise ...(x1, x2, ...xn) = model inputs, predictors, explanatory variables, et. 4) Train - test split...take a sample and split it for these two purposes 5) Prection...logit model - confusion matrix 6) Model training - Precision = probability of true positive conditional on positive prediction - Recall = probability of a true positive conditional on a positive outcome - F1 = Harmonic mean (weighted average of recall and precision)

Answer 57

1) Hiring 2) Engagement 3) Attrition 4) Internal career path

Answer 58

Bette than 10 guilty persons scape than that one innocent one suffers

Answer 59

FALSE It is difficult and there is a tradeoff between explainability and performance

Answer 60

- improve training data (e.g. change labels or adding weights) - more information - training engineers/dev - AI councils

Answer 61

relates to using methods where how and why the algorithm arrives at the results can be understood by human experts

Answer 62

Deep learning no, decission trees yes

Answer 63

- HR - Medical decissions

Answer 64

- SHAP. separates out how much each feature is contributing to the output/prediction - LIME. can start with a model that is very complex and difficult to interpret and generate a simpler comparison model that is locally accurate - SURROGATE TREES. Generates a simpler model (e.g. decision tree) that mimics the performance of the more complex model and is easy to interpret - AUTO-ENCODERS. Boils data down into a small set of features to make it easier to interpret

Answer 65

AI requires massive amount of data that needs to be store, with the implications in terms of ownership and privacy. Bolckchain can help to solve this problem by exchanging the information between two parties without the need of a thrid-party to own or hold the data

Answer 66

is a data storage technology that is immutable, transactions cannot be changed once confirmed and therefore it is the truth.

Answer 67

Identical ocpies of this ledgers are stored across all participating nodes of the network - the transaction is verified by each of the counterparts - the transaction entry is subsequently never falsified, updated or deleted transactions are verified using PGP encription

Answer 68

Quick wins + Long term projects Quick wins: focus’s on applying off-the-shelf ML to internal employees touch points LT projects: most impactful, re thinking R2-D2 processes

Answer 69

1. hardware…specialized, GPU or TPU (x15-30 faster), available to rent at low cost 2. Software…open-source frameworks and developer tools (e.g.Tensor Flow). Automate data science process 3. Data and algorithms…marketplaces for data and algorithms

Answer 70

1. SW…cost down while moving to open-source 2. skills…lowering barrier while things get easier 3. computation 4. data…the key to ML applications, virtual cycle of data collection means the rich get richer

Answer 71

Automates the ML workflow, just have to upload the data in the right format and drives you all through the process (eg data preparation, feature engineering, model selection and training, etc.)

Answer 72

The notion that a lack of understanding about the inner workings will lead to problems…likely to fuel a rapid increase in demand for AI ethicists who can grapple with the implications of algorithm decisions

Answer 73

1. Integrate AI strategy into broader company strategy 2. Focus on revenue and growth rather than on the cost side alone 3. Right data infrastructure in place! 4. Think not only on producing AI but also consuming it 5. Investing in talent

Answer 74

Productivity gains ca. 7% (lower if no process improvement orientation

Answer 75

(1) Combining many technology elements together in a new way (2) technology class that has not existed before

Answer 76

Descentralización and centralization

Answer 77

False…the complement and work together though

Answer 78

TRUE…robots are associated with less managerial hiring

Answer 79

an algorithm that fits historical data too well but fail in realistic test conditions

Answer 80

- Harms of allocation...e.g. unfair loan approval - Harms of representation...e.g. airport screening system more likely to false alarm people of color

Answer 81

- Reputational risks - Legal risks - Regulatory risks

Answer 82

- legal claims...quite limited - Disparate impact - US based, limited - GDPR -limited context ...new proposal under construction, e.g. EU AI paper or US Algorithm Accountability Act

Answer 83

1. Collection 2. Aggregation/analysis 3. Storage 4. Use 5. Distribution

Answer 84

Legal: GDPR (EU) human-rights based and opt-in, US market based, user choice (opt-out) Technical level: - Federated learning - differential privacy---adding noise Operational level: - Privacy by design - formal mechanisms such as data impact assessments - etc.

Answer 85

no, according to different experiments with just some control is enough to get trust and does not increase proportionally

Answer 86

- Global interpretability: Can we explain at a high level what are the most important variables driving a model’s predictions (e.g. income, credit history, etc)? - Local interpretability: Can we explain the most important variables driving a particular prediction or decision (why Kartik’s loan application was not approved)?

Answer 87

- creation of an inventory of all ML models employeed - specific use cases of these models - names of the developers & business owners of models - risk rating: social/finantial risks if the model fails

Answer 88

1. inputs, e.g. quality, bias, etc. 2. Model, e.g. alternative models, statisticals tests,transparency, stress test, etc. 3. Outputs, decisions with explanations, outliers

Answer 89

1. Control...model developer 2. Transparency...data science QA 3. Audits...data science auditor

AI for Business Specialization (Wharton, University of Pennsylvania) Flashcards

(129 cards)