Chapters 1, 2, 3 Flashcards

1
Q

What is Data Science?

A

A set of fundamental principles that guide the extraction of knowledge of data.

Data science is not the same as data processing and engineering. They complement eachother.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Data Mining?

A

The extraction of knowledge from data, via technologies.

Data mining techniques provide some of the clearest illustrations of the principles of data science.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is data-driven decision making?

A

Practice of basing decisions on the analysis of data rather than purely on intuition.

Firms engage in DDD in varying degrees.

Firms that are data driven are more productive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two types of Data-Driven Decision making problems?

A

1) Decisions for which “discoveriesneed to be made within data
2) Decisions that repeat, mainly at massive scale, so decision making benefits from even small increases in decision-making accuracy based on data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Big Data?

A

Datasets that are too large for traditional data processing systems and therefore require new processing technologies.

Big data techniques are most frequently used for data processing in support of data mining techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the fundamental principle of Data Science?

A

Data and the capability to extract useful knowledge from data should be regarded as key strategic assets.

Within company, necessary to have a close connection between data scientists and business people.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is overfitting?

A

Looking so hard to find something that you actually do, but it might not be generalizable beyond the given data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Classification?

A
  • predicts for each individual (item) in a population, which of a (small) set of classes that individual (item) belongs to.
  • List must be exhaustive and mutually exclusive.
  • A related task is scoring and class probability estimation.
    • Scoring task is when an individual is given a score representing the probability that that individual belongs at a class.
  • Example:
    • “Among all the customers of MegaTelCo, which are likely to respond to a given offer?”
    • In this example the two calsses could be called will respond and will not respond.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Regression?

A
  • Attempts to estimate or predict, for each individual, the numerical value of some variable for that individual.
  • Example:
    • “How much will a given customer use the service?”
    • The property (variable) to be predicted here is service usage, and a model could be generated by looking at other, similar individual in the population and their historical usage.
  • Regression is related to classification, but the two are different. Informally, classification predicts *whether* something will happen, whereas regression predicts *how much* something will happen. whether vs how much.
  • Regression has a numerical target.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is similarity matching?

A
  • Attempts to identify similar individual based on data known about them.
  • Similarity matching can be used directly to find similar entities.
  • Example:
    • ​For example, IBM is interested in finding companies similar to their best business customers, in order to focus their sales force on the best opportunities.
    • Making product recommendations based on people who are similar to you in terms of the products they have liekd or have purchased.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Clustering?

A
  • Attempts to group individual in a population by their similarity, however not driven by any specific purpose.
  • Why is it useful? Useful in preliminary domain exploration to see which natural groups exist.
  • Example:
    • “Do our customers form natural groups or segments?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Co-occurence grouping?

A
  • Attempts to find associations between entities based on transactions involving them.
  • Example:
    • What items are commonly purchased together in a supermarket?
    • Analyzing purchase records from a supermarket may uncover that ground meat is purchased together with hot sauce much more frequently than we might expect.
  • While clustering looks at similarity between objects based on objects’ attributes, co-occurence grouping consider similarity of objects based on their appearing together in transactions.
  • So its like co-occurence is more specific. Clustering looks at general similarity, co-occurence as it foucsed on objects that appear together in transactions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is profiling?

A
  • Attempts to characterise the typical behaviour of an individual, group or population.
  • Example:
    • “What is the typical cell phone usage of this customer segment?” (not its referring to customer segment, so different to the regression question)
    • Behavior may not have a simple description; profiling cell phone usage might require a complex description of night and weekend airtime averages, international usage, roaming charges, text minutes, and so on.
  • Often used to establish behavioral norms (baseline) for anomaly detection applications such as fraud detection and monitoring intrusions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is link prediction?

A
  • Attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly estimating the strength of the link.
  • Example:
    • Often used in social networking systems
    • “Since you and Karen share 10 friends, maybe you’d like to be Karen’s friend?”
  • Link prediction can also estimate the strength of a link. For example, for recommending movies to customers one can think of a graph between customers and the movies they’ve watched or rated. Within the graph, we search for links that do not exist between customers and movies, but that we predict should exist and should be strong. These links form the basis for recommendations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Data Reduction?

A
  • Attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information of the large set.
  • Example
    • For example, a massive dataset on consumer movie-viewing preferences may be reduced to a much smaller dataset revealing the consumer taste preferences that are latent in the viewing data (for example, viewer genre preferences).
  • May reveal better information but involves a loss of information.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Causal Modelling?

A
  • Attempts to helps us understand what events or actions actually influence others.
  • Can be done using experimental and observational methods (“counterfactual analysis”):
    • They attempt to understand what would be the difference between the situations—which cannot both happen —where the “treatment” event (e.g., showing an advertisement to a particular individual) were to happen, and were not to happen.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are unsupervised methods?

A
  • The data mining task has no specific target or purpose.
  • Clustering, co-occurence grouping and profiling are unsupervised methods.
  • Risk is that it forms groups that are not meaningfull.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are supervised methods?

A
  • The data mining has a specific purpose or target. Hence it is necessary to have data on the target.
  • Supervised tasks require different techniques than unsupervised and the results often more useful.
  • Examples: Classification, regression, and causal modeling are solved with supervised methods.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What type of overarching method are similarity matching, link prediction, and data reduction?

A

They could be either supervised or unsupervised.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the CRISP-DM framework?

A

Cross Industry Standard Process for Data Miningone codification of the data mining process

The process diagram makes explicit the fact that iteration is the rule rather than the exception. Going through the process once without having solved the problem is generally speaking not a faillure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How do the different parts of the CRISP-DM framework relate to eachother?

A

Business understanding and D__ata understanding both link to eachother.

Data understanding → D_ata Preperation_.

Data preperation and Modelling both link to eachother.

ModellingEvaluation.

Evaluation ⇒ Deployment and links back to Business Understanding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the different parts of the CRISP-DM framework and how are they defined?

A
  • Business understanding
    • Design team should think about the use scenario.
    • What exactly do we want to do?
    • Start off with a simplified use scenario.
  • Data Understanding
    • Available data rarely matches the problem at hand.
      • Understand strengths & limitations of the data
      • Cost of data varies: cost/ benefits analysis of acquiring additional data
      • Cleaning data for subsequent analysis
  • Data Preparation
    • Convert data into useable format.
    • Data are manipulated and converted into forms that yield better results.
    • Leakage must be considered → a situation in which a variable collected in historical data gives information on the target variable, but this information is not available when the decision needs to be made.
  • Modelling
    • Primary place where data mining techniques are applied to the data.
    • Understand techniques and algorithms that can be used
  • Evaluation
    • Assess the data mining results & gain confidence that they are valid and reliable before moving on
      • Ensures that the model satisfies the original business goal.
      • Goal: prove that detected patterns are truly regular
      • Assessment: is qualitative & quantitative by using a comprehensive evaluation framework.
      • Model needs to be comprehensible for other stakeholders (non-data scientists)
      • Evaluation may be extended into the development environment.
  • Deployment
    • Putting the results of data mining into real use to realize some ROI.
      • Use case: implementing a predictive model in a business process
      • Increasingly the data mining techniques themselves are deployed
        • The world changes faster than the data science team can adapt the model
        • A business has too many modeling tasks to manually curate each model
        • Systems automatically build models (for the associated process)
      • Typically requires that the model is recoded for the production environment, e.g. to accommodate greater speed or compatibility
      • After mining the data (successfully or not) the process often returns back to the initial business problem.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the Business Understanding step in the CRISP-DM model?

A
  • Business understanding
    • Design team should think about the use scenario.
    • What exactly do we want to do?
    • Start off with a simplified use scenario.
24
Q

What is the Data Understanding step?

A
  • Data Understanding
    • Available data rarely matches the problem at hand.
      • Understand strengths & limitations of the data
      • Cost of data varies: cost/ benefits analysis of acquiring additional data, what is the cost/benefit of acquiring more data
      • Cleaning data for subsequent analysis
25
Q

What is the Data Preparation step?

A
  • Data Preparation
    • Convert data into useable format.
    • Data are manipulated and converted into forms that yield better results.
    • Leakage must be considered → a situation in which a variable collected in historical data gives information on the target variable, but this information is not available when the decision needs to be made. (Discrepancy)
26
Q

What is the Modelling step?

A
  • Modelling
    • Primary place where data mining techniques are applied to the data.
    • Understand techniques and algorithms that can be used
27
Q

What is the Evaluation step?

A
  • Evaluation
    • Assess the data mining results & gain confidence that they are valid and reliable before moving on
      • Ensures that the model satisfies the original business goal.
      • Goal: prove that detected patterns are truly regular
      • Assessment: is qualitative & quantitative by using a comprehensive evaluation framework.
      • Model needs to be comprehensible for other stakeholders (non-data scientists)
      • Evaluation may be extended into the development environment.
28
Q

What is the Deployment step?

A
  • Deployment
    • Putting the results of data mining into real use to realize some ROI.
      • Use case: implementing a predictive model in a business process
      • Increasingly the data mining techniques themselves are deployed
        • ​More fundamental than creating a model.
        • The world changes faster than the data science team can adapt the model
        • A business has too many modeling tasks to manually curate each model
        • Systems automatically build models (for the associated process)
      • Typically requires that the model is recoded for the production environment, e.g. to accommodate greater speed or compatibility
      • After mining the data (successfully or not) the process often returns back to the initial business problem.
29
Q

What are the other analytic techniques that you should be aware of?

A
  • “Statistics”
    • In the context of data science often refer to summary statistics as the basic building block of much data science theory and practice
      • Also denotes the field of study “Statistics” provides data science with knowledge that underlies the analytics (e.g. hypothesis testing)
  • Query
    • A specific request for a subset of data or for statistics about data, formulated in a technical language and posed to a database system.
      • Does not discover any patterns or models, in contrast to data mining.
      • Data mining can be used to come up with a query in the first place.
      • Query tools have the ability to execute sophisticated logic, including computing summary statistics over subpopulations, sorting, joining together multiple tables with related data, etc.
  • Data warehousing
    • Collect and coalesce data from across an enterprise, often from multiple transaction-processing systems, each with its own database
      • Facilitating technology of data mining
  • Regression analysis
    • Data mining focuses more on predictive modeling than on explanatory modeling
      • Two forms of modeling overlap, but not all lessons learned from explanatory modeling apply to predictive modeling
  • Machine learning methods
    • Collection of methods for extracting (predictive) models from data.
      • Data mining (and KDD → knowledge discovery and data mining) started as an offshoot of Machine learning
      • Concerned with many types of performance improvement and the issues of agency and cognition
      • KDD tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, etc.
      • Machine learning overaches data mining.
30
Q

What is tree induction?

A

Is a modelling technique that incorporates the idea of supervised segmentation in an elegant manner, repeatedly selecting informative attributes.

31
Q

What is a predictive model?

A
  • A formula for estimating the unknown value of interest: the target
    • Prediction → an estimate of an unknown value
    • Judged on its predictive performance
32
Q

What is descriptive modelling?

A

A model that gains insight into the underlying phenomenon or process.

33
Q

What is an instance?

A

A fact or a data point that is described by a set of attributes.

Also called a feature vector or row.

34
Q

What is the goal of supervised segmentation?

What is the most common splitting critera?

A

Goal of segmentation is to create groups that are as homogeneous as possible within the group with respect to the target variable.

If everyone has the same target value, the group is pure.

Most common splitting criteria: information gain which is based on entropy.

35
Q

What is entropy and its equation?

A

A measure of disorder that can be applied to a set.

Disorder corresponds to how mixed (impure) the segment is with respect to the properties of interest.

Each pi is the probability of property i within the set, ranging from pi = 1 when all members of the set have property i, and pi = 0 when no members of the set have property i

At pi = 1 the instance classes are balanced within the group (the same amount of each class in one group)

36
Q

How does entropy vary in a two class set when a set moves from having all members the same to a set where properties are mixed? What does the figure look like if you plot this out.

A
37
Q

What is information gain?

What does it value tell us?

A

Measure of how much an attribute improves (decreases) entropy over the whole segmentation it creates (e.g. one large group into 3 subgroups based on variable x).

Measures the change in entropy due to any amount of new information being added.

The higher the information gain, the l_ower will be the resulting entropy of the newly created segments_. If you find a variable with a high IG, it reduces entropy, meaning that it is very characteristic for the target variable

38
Q

How do you measure impurity?

A
  • Variance: measure of impurity for numerical values (when regression is used)
  • Pure set / variance of 0: set has all the same values for the target variable
  • Impure / high variance: numeric target values are very different
  • General idea: find a variable with the highest correlation with the target variable
  • Calculating the single attribute with the highest information gain
    • Calculate the information gain achieved by splitting on each attribute individually
39
Q

What is a decision tree?

A
  • A type of tree-structured model
    • Consists of interior (“decision nodes”) and terminal nodes (“leaf”)
    • Each branch from the node represents a distinct value for that attribute
    • Each leaf corresponds to a segment
    • No two parents share descendants and there are no cycles
    • First feature to use is the one with the highest information gain, the order of the remaining features depends on the set of instances against which it is evaluated
40
Q

How can numerical values be applied to a classification tree?

A

Numeric variables can be “discretised” by choosing a split point (or many split points) and then treating the result as a categorical attribute.

41
Q

When regression is used, what is a natural measure of impurity?

A

Variance, information gain is not the right measure.

42
Q

How do you find the attribute with the most information gain?

A

Calculate the information gain achieved by splitting on each attribute individually.

43
Q

Why is a deicsion tree a supervised segmentation?

A

The tree is a supervised segmentation, because each leaf contains a value for the target variable.

44
Q

What is the instance space?

And what are decision lines?

A
  • The space described by the data features.
  • Often visualized using a scatterplot on some pair of features.
  • The lines separating the regions in an instance space.
  • Also called decision surfaces or decision boundaries.
  • Instance space is two-dimensional.
45
Q

How does a decision tree node and leaf relate to the instance space?

A
  • Each internal (decision) node corresponds to a split of the instance space
  • Each leaf node corresponds to an unsplit region of the space (a segment of the population)
46
Q

What can you get from tracing down a single path from the root node to a leaf?

A

From the information you collect you can derive a rule.

47
Q

What is the Frequency-based estimate equation?

A

n/(n+m)

where n is positive instances and m negative instances

48
Q

What is the Laplace correction of the binary class probability estimation?

A
  • A “smoothed” version of the frequency-based estimate
  • Where n is the number of examples in the leaf belonging to class c, and m is the number of examples not belonging to class c
  • As the number of instances increases, the Laplace correction converges to the frequency-based estimate.
49
Q

What can a decision tree be viewed as?

A

A bunch of nested if-else statements.

Remember though that are many potential splitting decisions. The model will choose the split that maximises information gain.

50
Q

If you have a set of attributes, how do you determine which attribute has the highest information gain?

A

You calculate the information gain by splitting on each attribute.

51
Q

What does the information gain actually tell you?

A

How much an attribute and its split reduces the entropy. In other words, the close the information gain is to 1 the better.

52
Q

What is the process of multivariate supervised segmentation through classification tree induction?

A

Procedure of classification tree induction is a recursive process of divide and conquer, the goal at each step is to select an attribute to partition the current group into subgroups that are as pure as possible wrt the target variable. This partitioning is done recursively, spliting until we are done. The attributes to split upon are chosen by testing all of the attributes and selecting the one with the purest subgroups. We continue until the leaf nodes are pure, we run out of variables to split on or we decide to stop earlier in case neither of the other two are met.

In summary, the procedure of classification tree induction is a recursive process of divide and conquer, where the goal at each step is to select an attribute to partition the current group into subgroups that are as pure as possible with respect to the target variable. We perform this partitioning recursively, splitting further and further until we are done. We choose the attributes to split upon by testing all of them and selecting whichever yields the purest subgroups.

  1. Test for the individual information gain of each variable
  2. Select the variable with the highest IG as the root
  3. Evaluate which variable will then give you the highest IG
    1. Variables will not be applied chronologically based on the individual IG
    2. Subsequent IG depends on the preceding variable (e.g. at root)
53
Q

Classification Decision Trees: Can you split on different variables on the same level?

A

Yes. If you split a tree on an attribute you can split the new nodes on different attributes.

54
Q

What’s an important thing to remember when doing information gain questions?

A

The proportions need to add up.

55
Q

What happens to the laplace correction as the number of instances increases?

A

As the number of instances increases, the Laplace correction converges to the frequency-based estimate.

56
Q

What do you need to remember about leaf nodes?

A

You can have mulltiple leaf nodes leading to one class.