A Crash Course in Data Science Flashcards

1
Q

What is Data Science?

A

Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured,which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar toKnowledge Discovery in Databases (KDD).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some key activities that define the field of Statistics?

A
  1. Descriptive statistics
  2. Inference
  3. Prediction
  4. Experimental Design
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Descriptive Statistics?

A

Descriptive statistics includes exploratory data analysis, unsupervised learning, clustering and basic data summaries.

Descriptive statistics have many uses, most notably helping us get familiar with a data set.

Descriptive statistics usually are the starting point for any analysis.

Often, descriptive statistics help us arrive at hypotheses to be tested later with more formal inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Inference?

A

Inference is the process of making conclusions about populations from samples.

Inference includes most of the activities traditionally associated with statistics such as: estimation, confidence intervals, hypothesis tests and variability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is prediction?

A

Prediction overlaps quite a bit with inference, but modern prediction tends to have a different mindset.

Prediction is the process of trying to guess an outcome given a set of realizations of the outcome and some predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some prediction algorithms?

A
  • Machine learning
  • Regression,
  • Deep learning,
  • Boosting,
  • Random forests
  • Logistic regression
    *

are all prediction algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Classification?

A

If the target of prediction is binary or categorical, prediction is often called classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the purpose of Random Sampling?

A

In random sampling, one tries to randomly sample from a population of interest to get better generalizability of the results to the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the two main activities of machine learning?

A
  1. Supervised Learning
  2. Unsupervised Learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Supervised Learning?

A

Supervised learning - using a collection of predictors, and some observed outcomes, to build an algorithm to predict the outcome when it is not observed.

Some examples include: neural networks, random forests, boosting and support vector machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is UnSupervised Learning?

A

Unsupervised learning - trying to uncover unobserved factors in the data. It is called “unsupervised” as there is no gold standard outcome to judge against.

Some example algorithms including hierarchical clustering, principal components analysis, factor analysis and k-means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some characteristics on Machine Learning?

A
  1. the emphasis on predictions;
  2. evaluating results via prediction performance;
  3. having concern for overfitting but not model complexity per se;
  4. emphasis on performance;
  5. obtaining generalizability through performance on novel datasets;
  6. usually no superpopulation model specified;
  7. concern over performance and robustness.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some characteristics of traditional Statistics?

A
  1. emphasizing superpopulation inference;
  2. focusing on a-priori hypotheses;
  3. preferring simpler models over complex ones (parsimony), even if the more complex models perform slightly better;
  4. emphasizing parameter interpretability;
  5. having statistical modeling or sampling assumptions that connect data to a population of interest;
  6. having concern over assumptions and robustness.

In recent years, the distinction between both fields have substantially faded. ML researchers have worked tirelessly to improve interpretations while statistical researchers have improved the prediction performance of their algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Example of Supervised Learning

A

For supervised learning, we give an early example, the development of regression.

In this, Francis Galton wanted to predict children’s heights from their parents. He developed linear regression in the process.

Notice that having several children with known adult heights along with their parents allows one to build the model, then apply it to parents who are expecting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Example of UnSupervised Learning

A

We give a famous early example of unsupervised clustering in the computation of the g-factor.

This was postulated to be a measure of intrinsic intelligence. Early factor analytic models were used to cluster scores on psychometric questions to create the g-factor.

Notice the lack of a gold standard outcome. There was no true measure of intrinsic intelligence to train an algorithm to predict it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the outputs of a data scinece project?

A
  1. Reports
  2. Presentations
  3. Interactive web pages
  4. Apps
17
Q

What are the hallmarks of a good report?

A
  • Be clearly written
  • Involve a narrative around the data
  • Discuss the creation of the analytic dataset
  • Have concise conclusions
  • Omit unnecessary details
  • Reproducible
18
Q

What are some tools for producing reproducible reports?

A
  • Knitr
  • iPython notebooks
19
Q

How would you define the success of a data science experiment?

A
  1. New knowledge is created.
  2. Decisions or policies are made based on the outcome of the experiment.
  3. A report, presentation or app with impact is created.
  4. It is learned that the data can’t answer the question being asked of it.
20
Q

What are some negative results of data scince experiments?

A
  • Decisions being made that disregard clear evidence from the data
  • Equivocal results that do not shed light in one direction or another
  • Uncertainty prevents new knowledge from being created