Topic 1: Introduction to Data Science & Alternative Data Flashcards

1
Q

What gave rise to the realm of data science?

A
  1. Investments in business infrastructure
  2. Volume and variety of data
  3. Powerful computers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is data mining used for customer relationship management?

A

To manage attrition and maximize expected customer value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Type I and Type II Data Driven Decision (DDD) Making Problems

A

Type I: Decisions for which discoveries need to be made within the data

Type II: Increase decision making accuracy based on data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why view data and data science capability as a strategic asset?

A

Viewing these as assets allows us to think explicitly about the extent to which one should invest in them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you transition a business problem into a data mining problem?

A

Convert the business problem into subtasks and match the subtasks to known tasks for which tools are available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Difference between regression and classification

A

classification predicts whether something will happen, regression predicts how much something will happen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define Classification

A

predict, for each individual, in a population which set of classes this individual belongs to

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does a regression attempt to do?

A

attempts to estimate or predict, for each individual, the numeric value of some variable for that individual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Similarity matching

A

attempts to identify similar individuals based on data known about them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Clustering

A

attempts to group individuals in a population together by their similarity, but not driven by any specific purpose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Co-occurrence grouping

A

attempts to find associations between entities based on transactions involving them

“what items are often purchased together”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Profiling (behavior description)

A

attempts to characterize the typical behavior of an individual, group, or population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Link prediction

A

attempts to predict connections between data items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data reduction

A

attempts to take a large dataset and replace it with a smaller set of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Causal modeling

A

helps us understand what events or actions actually influence others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Difference between supervised and unsupervised methods

A

Unsupervised methods: no specific purpose or target has been specified for grouping

17
Q

Two main subclasses of supervised data mining

A

Classification (binary target) and regression (numerical target)

18
Q

CRISP-DM process

A
  1. Business understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment
19
Q

Leakage in data preparation

A

Including variables from historical data that gives information on the target variable (predicting the future)

20
Q

Purpose of the evaluation stage (in CRISP-DM)

A
  1. Assess the data mining results rigorously and

2 serves to help ensure that the model satisfies the original business goals

21
Q

Difference between Statistics and the process of Data Mining

A

Data Mining is hypothesis generation (may produce numerical estimates), while statistics focusses mainly on hypothesis testing (can we have confidence in these estimates)

22
Q

Define a query

A

A specific request for a subset/statistics of data, formulated in a technical language and posed to a database system.

23
Q

Difference between Knowledge Discovery and Data Mining (“KDD”) and Machine Learning

A

KDD is more focused on problems concerning “real world”. Also KDD tends to be more concentrated with the entire process of data analytics.

24
Q

Discuss investment managers in terms of their places on the diffusion of innovations curve.

A

Innovators - mostly hedge funds
Early adopters - aggressive long-only and PE mgrs.
Early majority - tech savvy large complex IM firms
Late majority - traditional large complex IM firms
Laggards - reluctant firms

25
Name the unique challenge to alternative data as described in the paper Alternative Data for Investment Decisions?
Standard historical data may not exist
26
Name the risk exposures to early adaptation of alternative data as described in the paper Alternative Data for Investment Decisions ?
1. Model risk 2. Regulatory risk 3. Data risk 4. Talent risk
27
What are the four types of data risks as described in the paper Alternative Data for Investment Decisions?
Data provenance risk - risk related to origin of data Accuracy or validity risk - bad trading signals Material nonpublic information (MNPI) risk Privacy risk - posibility of PII information in data set
28
Name the risk exposures to late adaptation of alternative data as described in the paper Alternative Data for Investment Decisions ?
1. Positioning risk 2. Execution risk 3. Consequence risk
29
Categories among the Big Data Analytic groups that are most aligned to support alpha generation
1. Content analytics - extract value from text 2. Advanced and predictive analytics software tools 3. Spatial information analytics (SIA) tools - geographic information software and tools
30
Describe the term collective intelligence investing
The process of gathering insights from online communities and crowdsourcing.
31
What are the four key platform types offering Collective Intelligence Investing (CII)
1. Open communities 2. Digital expert contribution networks 3. Digital expert communication networks 4. Crowdsourcing platforms
32
What are the risk exposures of Collective Intelligence Investing and their mitigants?
1. Community engagement risk - adopt gamification 2. Material nonpublic information risk - Rigorous DD 3. Model risk - Sufficient testing/sturdiness checks 4. Information security risk - better security 5. Data integrity risk
33
Steps to a potentially smooth takeoff for Collective Intelligence Investing
1. Vendor review 2. Thorough risk assessment 3. Customized technology architecture
34
Key considerations when you want Wall Street to take notice of your data (in case you want to sell it)
- Data productization (know how the client uses the data) - Infrastructure and delivery (how will you deliver data) - Distribution (you need an 'in' to get them to look at your data)
35
What are some of the factors that determine the commercial value of a dataset?
1. Data Edge 2. Monetization Strategy 3. Deep Market 4. Uniqueness and Replicability 5. Exclusive Access 6. Table Stakes Potential
36
What is the single strongest indicator of how much a client will pay for a dataset? And what is the expected return on a data investment?
How big the client is in terms of AuM. 10-20x