Data Analytics Flashcards

1
Q

What is data analytics?

A

Data analytics is the process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

Extract actionable, but non-obvious information from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is statistics?

A

Statistics is about hypothesis testing. You assume a relation, propose a model, collect data to test the model, perform statistical analysis and evaluate the results.

As such, we are backing up an assumed relation with data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is machine learning?

A

Machine learning is the science of teaching machines how to learn from data, without being explicitly programmed to do so.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between statistics and machine learning?

A

Statistics starts from a proposed model, whereas machine learning builds a model from data. Statistics requires a normally distributed data in order to validate results. Machine learning does not always rely on the distribution characteristics of data.

Statistics has implicit validation via the significance level. Machine learning performs explicit validation by counting errors using labeled cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the advantages of statistics?

A
  • Quantification of effects (estimations for intercept and slope).
  • Implicit testing of significance (likelihood of finding a pattern by coincidence)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the disadvantages of statistics?

A
  • Starts from a proposed model (hypothesis) (confirmatory analysis)
  • Makes assumptions on data distribution (otherwise no correct estimation of significance)
  • Choice of significance level is not straightforward.
    • significance level too high means that the conclusion that the pattern exists is wrong.
    • significance level too low means that the conclusion that the pattern does not exist, is wrong.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the advantages of machine learning?

A

Does not always rely on the distribution of your data. Derives a model from your data, instead of proving a model with data. Does explicit validation by counting errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the disadvantages of machine learning?

A

Requires labeled data to perform explicit validation. There is a risk of overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the essential points of statistics?

A

. . . . .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is significance?

A

Wat is de kans dat mijn model toeval is, berekend op basis van de distributie van uw data. De data moet normaal verdeeld zijn. Als de data niet normaal verdeeld is, wordt de significantie verkeerd berekend.

Lage significantie betekent dat de kans dat je patroon uit toeval voorkomt groot is. Het resultaat is dus niet te vertrouwen.

Hoge significantie betekent dat de kans dat je patroon uit toeval komt klein is. Het resultaat is dus meer te vertrouwen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you calculate precision?

A

TP/(TP+FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the essential points of machine learning?

A
  • Derive model from data.
  • Explicit validation by counting errors.
  • Beware of overfitting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a model?

A

Combination of formula to transform input data into output (classification or prediction)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you detect/check for overfitting?

A

By using a test set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is meant by training set?

A

This is the dataset that is used to train the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you validate a model?

A

Using a test or validation set. You calculate the performance by counting the errors your model has made. These can derive useful metrics like precision, recall and accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a confusion matrix?

A

A confusion matrix is a matrix that shows the types of errors a model makes.

It shows true positives, false positives, true negatives and false negatives. These can be used to calculate performance metrics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you interpret a confusion matrix?

A

A confusion matrix tells us the performance of the model. It shows us the correct classifications on the main diagonal, and the incorrect classifications on the other diagonal.

This can give us metrics such as accuracy, precision and recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What can you learn from a confusion matrix?

A

How well a model performs and what types of errors it makes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is accuracy? How do you interpret it? What can you learn from it?

A

Accuracy is the amount of correct predictions a model makes. Caution has to be made when using accuracy metrics against unbalanced datasets. A simple model that always predicts the majority class will also score very well on this metric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is precision? How do you interpret it? What can you learn from it?

A

Precision shows us how good the model is at predicting the true positive case. It should be interpreted as the higher the number the better: the higher the number, the fewer cases are misclassified as positive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is recall? How do you interpret it? What can you learn from it?

A

Recall should be interpreted as how good is it at identifying positive cases. A high recall means it’s very good at identifying positive cases, a low recall means it misses many of them.

You can learn how good your model is at determining the positive case from it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What kind of problems can you solve with machine learning?

A

Regression. Classification. Clustering. Association Rule Discovery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why is data analytics relevant for managers?

A

Money, money, money. Because it will help you make faster, better decisions. It will help you reduce costs. It will lead you to new products and services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How can value be created from data analytics (multiple ways)?

A

Marketing: churn prediction, sentiment analysis.
Banking & Insurance: fraud detection, credit scoring.
Retail: recommender systems, shop behaviour.
Production: maintenance optimization.
Logistics: replenishment planning.
HR: CV matching.
Health: imaging, diabetes control, air quality monitoring.
Security: intelligence, smart cameras, crowd monitoring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is so new about data & analytics?

A

Nothing specifically. What’s new is that the mass availability of data and computer power at low prices has enabled it for a modern market.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is meant by the trade-off between precision and recall?

A

It’s typically hard to get both good precision and good recall. It’s usually one or the other: the higher your precision gets, the lower your recall becomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the precision and recall if the model always says yes in a binary classification model?

A

In a binary classification model where the model always says yes, the precision will be very low, but the recall will be perfect.

(it will predict many false positives, but no false negatives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the precision and recall if the model always says no in a binary classification model?

A

In a binary classification model where the model always says no, the precision will be infinite, but the recall will be 0.

(it will always say no, so there will be no false positives, but there will be many false negatives)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is meant by creative destruction?

A

The process of industrial mutation that continuously revolutionizes the economic structure from within, incessantly destroying the old one, incessantly creating a new one.

31
Q

What is disruptive technology?

A

Innovations that significantly alter the ways that consumers, industries and businesses operate. A disruptive technology sweeps away the systems or habits it replaces because it has attributes that are recognizably superior.

32
Q

What is the danger of disruptive technology (from the perspective of society / from the perspective of a company)?

A

As entire industries are shaken to their core many people become unemployed as established giants lose their upper hand. Society might not be ready to adapt to these disruptive technologies as quickly as they appear. Consequently, many people become unemployed and have little to no chances of finding another job. People that have worked their entire lives in industries that suddenly become irrelevant, might not have the opportunities to begin working in new technologies.

Companies that are established in these traditional technologies and that can’t or won’t change will succumb.

33
Q

What is the opportunity of disruptive technology (from the perspective of society / from the perspective of a company)?

A

We continuously improve ourselves and the world around us. Society as a whole benefits as technology improves significantly, removing the inferior qualities of prior products and services. Companies also benefit from this as they unlock more opportunities to monetize and create value from these new disruptive technologies.

34
Q

What are the characteristics of disruptive technology?

A
  • Radical new products, services, business models
  • Shake the market, reset the rules
  • Fast growing new entrants challenging incumbents
  • Level the playing field
35
Q

What is meant by the data analytical cycle?

A

Data analytics is a continuous improvement cycle.

36
Q

What are the typical steps of the data analytical cycle, what do we mean with these steps, and what are typical activities within these steps?

A
  1. Business Case: questions, threads, opportunities, optimizations, new products, new services.
  2. Data Selection & Collection:
    - Selection: sources, natural experiment, ..
    - Collection: experiment, enterprise information system, external data/ databases, web crawling/ scraping, web services/APIs
  3. Data Preparation: cleaning, outlier removal, missing values, wrangling, reduction, feature selection, feature extraction.
  4. Explorative Analytics: visual exploration.
    - unsupervised machine learning: clustering, association rule mining.
  5. Predictive Modelling: supervised machine learning.
  6. Interpretation & Action: insights, decisions, operational deployment.
37
Q

What is meant by the art of data analytics?

A

Training a model is easy. But, garbage in is garbage out. The selection of training data and the selection of models and parameters is crucial. This requires a more creative approach.

38
Q

Why is data science not only science but also art?

A

Training a model is easy. But, garbage in is garbage out. The selection of training data and the selection of models and parameters is crucial. This requires a more creative approach.

39
Q

Why can you use more techniques to find patterns with machine learning compared to statistics?

A

Because machine learning is not always dependent on the data being normally distributed.

40
Q

What kind of problems can you solve with machine learning?

A

Regression, Classification, Clustering, Association Rule Discovery

41
Q

Which types of problems are supervised?

A

Regression/ Classification

42
Q

Which types of problems are unsupervised?

A

Clustering/ Association Rule Discovery

43
Q

How do you decide to choose for supervised or unsupervised techniques?

A

It depends on the business case, it depends on the result we’re trying to achieve.

44
Q

What is clustering and when to use it?

A

Clustering is the techniques used to segment your data into groups. You use it when you don’t know the labels of your data, or when you’re trying to identify groups…

45
Q

What is clustering and when to use it?

A

Split or group cases/observations.

46
Q

What is association rule discovery and when to use it?

A

Discover events that happen together. e.g. recommendation systems, market basket analysis.

47
Q

What is estimation/prediction and when to use it?

A

Predict a continuous value: e.g. a house price.

48
Q

What is classification and when to use it?

A

Predict a categorical value: e.g. male/female.

49
Q

What is a recommender system?

A

Recommender systems are used to determine what items a customer would be interested based on prior items they liked or frequently buy. For example, Netflix uses this to recommend movies you might enjoy to keep you on its platform for longer.

50
Q

What is market basket analysis?

A

This is a technique used by retails to uncover associations between items.

It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.
Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

51
Q

Why is parallel processing the solution to the performance problems in big data?

A

Because we’re reaching the limits of how much we can optimize the hardware. Computers and Processors aren’t getting much faster/smaller anymore. As such, we need new ways of improving performance, which can be achieved through parallel processing.

52
Q

Why do we need to adapt software to make use of parallel processing?

A

Because software needs to be explicitly programmed to deal with the intricacies of parallel programming. Instructions need to be laid out specifically, data needs to be managed more carefully.

Most software requires explicit changes to deal with parallel processing.

53
Q

How can we make use of parallel processing in data analytics without having to adapt software?

A

We can train multiple models side-by-side.

We can train multiple models with different parameters side-by-side. This reduces the lead time of one task. (e.g. K-Means clustering with different numbers of clusters and different start positions.)

We can split our data in multiple parts and train separate models on separate datasets. For Classification problems, these resulting models can be recombined. This is impossible to do for clustering, where we might discover different clusters on different data subsets.

54
Q

How can we make use of parallel processing in data analytics without having to adapt software?

A

We can train multiple models side-by-side.

We can train multiple models with different parameters side-by-side. This reduces the lead time of one task. (e.g. K-Means clustering with different numbers of clusters and different start positions.)

We can split our data in multiple parts and train separate models on separate datasets. For Classification problems, these resulting models can be recombined. This is impossible to do for clustering, where we might discover different clusters on different data subsets.

55
Q

Why will a program adapted for parallelization not run N times faster on N processors?

A

Because of the computing overhead of splitting the data, moving/copying data across different processes, reassembling the data, etc.

56
Q

Why will a computer that is 100 times faster not solve all performance problems?

A

Because some problems scale exponentially. For a 10-fold increase in data, the problem scales 100-times. In these cases, better software/algorithms are the answer.

57
Q

What is meant by selection bias?

A

Making assumptions on the wrong/unrepresentative data.

Example: you could say that all hospitals are inherently dangerous because most people who die, die at hospitals. This is not true, it’s just that most people who are injured or ill go to a hospital.

58
Q

What kind of role can managers play in the data analytics cycle?

A

Providing the business case, aiding in the data selection and assisting in the interpretation and decision-making steps. Being the translation step between the technology side and the business side.

59
Q

What is meant by spurious correlation?

A

A correlation that is caused by random chance or by a third (unseen) factor.

60
Q

What is process mining?

A

Process mining takes existing data records as a starting point, extracts different variations of the process and automatically turns them into understandable visualizations.

This can show remarkable deviations, unnecessary rework and the real bottlenecks.

61
Q

What kind of data does process mining use?

A

Existing data records/event logs.

62
Q

What can you do with discovery process mining?

A
  • Reverse Engineering: derive process model from event log.

- Decision Mining: check how decisions are made.

63
Q

What can you do with conformance checking process mining?

A
  • Auditing/Testing: compare real process flows with intended process flows.
64
Q

What can you do with performance mining?

A
  • Optimization: add additional data (e.g. waiting and process times) to get additional insights (e.g. bottlenecks).
65
Q

What is network mining?

A

Network mining is analysing the connections between people and/or objects to uncover their relations.

66
Q

What kind of data does network mining use?

A
  • Contacts (emails, phone calls, Facebook)
  • Transactions (financial, trade, fraud)
  • Citations (scientific, …)
  • Co-occurrence
  • Collaboration
67
Q

What can you do with network mining?

A

Gain insights in communities and networks. This can unravel patterns, weaknesses, optimizations, etc.

68
Q

What kind of insights can you gain from important nodes in your network?

A

You can discover the most crucial points in your network. For example, in social networks you could identify the social influencers which can be targeted for your marketing campaigns. If you can convince these people, they will convince the people in their networks.

69
Q

What kind of insights can you gain from communities in your network?

A

You can discover the kinds of groups you should be targetting. How these groups interact, how these groups interact with other groups, …

70
Q

What are the potential applications of network mining?

A

Marketing

  • Segmentation/ communities
  • Influencers

Bottlenecks and load balancing

  • Physical networks
  • Processes
  • People

Fraud detection

Anti-terrorism, anti-espionage, crime

Disease control

Collaboration

Behaviour analysis

  • Humans
  • Animals
  • Plants
71
Q

What are the potential applications of network mining?

A

Marketing

  • Segmentation/ communities
  • Influencers

Bottlenecks and load balancing

  • Physical networks
  • Processes
  • People

Fraud detection

Anti-terrorism, anti-espionage, crime

Disease control

Collaboration

Behaviour analysis

  • Humans
  • Animals
  • Plants
72
Q

What is a type 1 error?

A

False positives are a type 1 error.

73
Q

What is a type 2 error?

A

False negatives are a type 2 error.