Statistical Thinking for Data Science and Analytics Flashcards

1
Q

What is data science?A professors def.

A

I really think of data science as the pairing of people who develop technology that can learn from data with people who have data and who have problems to solve.

And really, it’s interdisciplinary at heart.

To me, data science is about building tools

to help solve problems with data.

Data science is about building tools to uncover patterns,

to form predictions, to help us explore

data to understand the world.

And this involves pushing fields like computer science

and optimization and statistics in new ways.

But there has to be an application of some sort.

So the intersection of probability, statistics,

computer science, and an application

is an essential definition of data science, I believe.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What questions can data science answer?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is there an explosion of data?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is data visualization important?

A

Data visualization is important for people who want to explore their data, to get some idea of what it contains, and therefore, perhaps, to develop some intuitions about how they would go about solving a problem, or learning from that data.

Visualization is also really important when we’re looking at the output of data science systems.

relies heavily on data visualization for interpretation. So to be able to take things from a mathematical space,

which is fairly abstract, and convert them and be able to speak to a clinician and map the areas on the brain that are affected is absolutely critical.

For dashboards.People won’t be able to see the benefits that are being provided if they are not able to visualize things.

Secondly data visualization is important for the communication

of the results of data science to a general audience.

Number one is, it makes it very easy to understand what’s going on.

at least the best data visualizations that I’ve seen,

is that they introduce new questions.

So it’s both the initial exploration of the data set,

as well as the presentation of the results to the people

that need to understand what we’ve learned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What skills does a data scientist need?

A
  1. Math and statistics foundation
  2. Algorithms for big data
  3. Computer Scince knowledge
  4. Storing and accessing data
  5. Parallel processing of data
  6. How to apply DS to real world problems
  7. Optimization
  8. Statistical way of thinking beyond theory
  9. Machine Learning
  10. Complexity Theory
    11.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In 2011 Peter Warden writes in “Why the term ‘data science’ is flawed but useful” that:

A

Traditional scientists chose a problem then find data to shed light on it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the first stept in any data analysis project?

A

The first step of any data analysis project is “data conditioning,” or getting data into a state where it’s usable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When a company does a data mashup it is:

A

Using data from multiple, disparate sources to create a data product

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data conditioning involves:

A

Getting data into a state where it is usable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The MapReduce approach is a strategy for:

A

Processing a large data set using a large number of computers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The term “stream processing” refers to:

A

Processing data as it arrives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Machine learning almost always requires:

A

A training set of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The role of Statistics in Data Science:

A

Showing trends in the data being analyzed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Making data tell its story:

A

Involves creating visualizations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the validity of our results of data experiments based on?

A

So the validity of results depends on the validity of assumptions we make on the data generating process.

Such assumptions include assumptions on sampling, randomization, the measurements of the data, and independence between variables, and so forth.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When we call an observed effect statistical significant, we mean that:

A

The effect is unlikely to occur purely by chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Data?

A

Data are numbers, but they’re not just simply numbers.

They’re numbers with context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the different units of measurement in data sets?

A

The unit of measurements can be objects,

can be dates, can be time units, can be events, et cetera.

So it basically is on what unit we’re taking measurements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why do we study variables in data sets?

A

variables are really the central focus of analysis

because we want to study the variation of variables to gauge the trend and the randomness and the extent of variability in this particular variable to generate knowledge about population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How many types of variables are there and what are they?

A

3.

  • Categorical
  • Quantitative
  • Ordinal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are summary statistics?

A

The statistics are summaries of numerical data.

They do not tell the whole story, but they’re useful and meaningful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Generally how are categorical and quantitative summaries visualized?

A

Categorical data - Pie Charts,Bar Plots

Quantitative data - Histograms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Good to remember about how to visualize different data…

A

Even though technically one can make a pie chart for any numerical values, a pie chart for price values of products will not be meaningful as there are too many possible values and the values should also be arranged in an increasing order. The pie chart treats each distinct value of the variable as a category and does not use the order information of these values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Good to remember about standard statistical notation:

A

For a data set, we use n for sample size, or the number of individuals in the data. The variables are represented by letters that are close to the end of the alphabet, such as X, Y and Z. We use letter i to index the individuals. Therefore Xi would refer to the value of variable X for the ith individual.

One important notation in statistics is the summation sign, ∑ (capital Greek letter /sigma/). For example

∑i=1nXi

would mean a sum of the n values from X1 to Xn.

If we replace Xi in the sum above by (Xi−3)2, then the quantity changes to a sum of (X1−3)2, (X2−3)2, …, (Xn−3)2,.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

See Image for Question

A

See image for answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the first thing to consider when summariziing numerical data?

A

Center of Variation.

The center of variation is where the different observed values distribute around

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

2 commonly used methods to show center of variation:

A

The first one is mean, which is the numerical average

of observed values.

The second is median, which is the midpoint.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Why is it always a good idea to plot the data, in addition to reporting summary statistics?

A
  • Summary statistics alone don’t necessarily provide an insight into the distribution i.e. normal distribution, skewed etc.
  • A visualisation such as a box plot is not only easy to read and understand, but can also show outliers in the data.
  • data plot is an image. Image make quicker sense to human brain than pure numbers.
  • Sometimes plotting the data can give additional insight into data itself.
  • Visuals are also more compelling to people and help communicate what the data is saying
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Define Association?

A

Association is defined as when you observe certain values of one variable are observed more frequently, more often, with certain values of another variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What does a correlation of ‘0’ mean? Does it tell th whole story?

A

A correlation of ‘0’ means there is no linear association but this does not mean there is no association. To get the whole picture look at the scatter plot. There could be a ‘U’ plot and hence some association.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How to determine if there are cause-effect relationships?

A
  • Randomized Experiments
    • A/B Testing
    • Control Groups
    • Double Blinded Studies
  • Causal Inference from Observational Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Why do we need a sample?

A

To derive knowledge from sample to population, we need to have a representative sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

what happens if we do not have a good representative sample.

A
  • Misleading Outcomes
  • Biased Results
  • Difficult to analyze results
  • Wastage of time and money
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are the 2 charestristics of Randomness?

A
  1. Unpredictability
  2. Trends
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is Probability?

A

Probability is the proportion of a certain occurrence in the long run.

It is only when you have a large number of occurrences

in the long run you can use probability to accurately describe the proportion of any certain random outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Probability Rules : Very Important

A

Specific Addition Rule

Only valid when the events are mutually exclusive.

P(A or B) = P(A) + P(B)

Non-Mutually Exclusive Events

General Addition Rule

P(A or B) = P(A) + P(B) - P(A and B)

Specific Multiplication Rule:Independent Events

P(A and B) = P(A) * P(B)

Conditional Probability : General Multiplication Rule

P(A and B) = P(A) * P(B|A)

OR

P(B|A) = P(A and B) / P(A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How can we study sampling distribution ?

A
  • by simulation
  • by experiment
  • by mathematical models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Factors influencing sampling distribution?

A
  • Sampling size
  • Data generation process
  • Population distribution for the variable of interest
39
Q

What is a ‘Confidence Interval’?

A

A confidence interval measures the probability that a population parameter will fall between two set values. The confidence interval can take any number of probabilities, with the most common being 95% or 99%.

40
Q

What is coverage probability?

A

In statistics, the coverage probability of a confidence interval is the proportion of the time that the interval contains the true value of interest.

41
Q

What is Hypotheses?

A

Then hypothesis is a statement regarding a value of interest. It may not be true.

It may be true or may be incorrect.

But it is a statement that you’re trying to collect evidence– collect information– to prove or disprove.

42
Q

Why would we need to carry out observational health studies?

A

Because randomized controlled trials may not always be possible

43
Q

Which of the following is the main issue suffered by observational health studies?

A

Systematic errors such as bias and confounding.

44
Q

What is the “negative control” used in this study?

A

A real drug-outcome pair for which there is no causal relationship.

45
Q

Among individuals who do not take supplement A, the risk (probability) of being infected by the seasonal flu is 0.2 during the flu season. What is the corresponding odds value? ( Hint: Odds is defined as P(A)/[1-P(A)] ).

A

1/4

46
Q

Why are the characteristics of estimates derived from observational health studies unknown?

A
  1. We do not have full control or knowledge of the sampling process
  2. We do not know whether there are unmeasured confounding factors
  3. We do not know whether there are any systematic measurement errors in the observed data.
47
Q

Why do we need to have better understanding of the characteristics of estimates derived from observational health studies?

A
  1. Biased estimates will lead to misrepresented statistical significance
  2. Unmeasured confounding will lead to spurious association findings
  3. Systematic measurement errors will contribute to poor reproducibility of findings.
48
Q

What does P( B | A) denote?

A

It denotes conditional probability and means the Probability of ‘ B’ given that ‘A’ has occured.

49
Q

Consider the biased coin that produces head 70% of the time from the probability calculation in week 2. One tosses the coin twice. Conditioning on that we know both tosses have the same outcome, what is the probability of that both tosses are tails?

A

0.155

EXPLANATION

Given this biased coin, P(HH) = 0.49, P(HT)=0.21, P(TH)=0.21, P(TT)=0.09.

P(both tosses are the same) = P(HH or TT) = 0.49+0.09 = 0.58.

P(TT | both tosses are the same) = P(TT)/P(both tosses are the same) = 0.09/0.58.

50
Q

In a box, there are the same number of two kinds of coins: the fair coins (50% chance for head) and the biased coins (70% chance for head). One person randomly selected a coin and tosses it twice. Both tosses are tails. What is the probability that the selected coin is a biased coin?

A

0.265

EXPLANATION

P(TT | fair coin) = 0.25

P(TT | biased coin) = 0.09

P(fair coin) = 0.5

P(biased coin) = 0.5

P(biased coin | TT) = 0.5*0.09 /(0.5*0.09+0.5*0.25) = 0.265.

51
Q

What is the chi-square formula?

A

The Chi square test applies to categorical data. This nonparametric test determines whether the observed counts for the categories differ from the expected counts. Look up the p-value for the Chi square statistic obtained in a statistical table in order to determine if the test reaches significance. Before using the table, calculate the degrees of freedom for the problem. For two independent variables, the degrees of freedom are the number of levels of the first variable minus one, times the number of levels of the second variable minus one. Hence, df = (r - 1) (s - 1), where r is the number of levels in the first variable and s is the number of levels in the second variable.

52
Q

TRUE OR FALSE

Two-way table summarizes joint occurrences of values from any two variables.

A

False.

EXPLANATION

Only categorical variables.

53
Q

Two variables are said to be associated if certain combinations of values for the two variables occur more or less frequently than expected under independence.

A

True

54
Q

Under the null hypothesis of independence, the probability of observing a given combination of values for the two variables is 1 divided by the number of possible combinations.

A

False.

EXPLANATION

…is divided by the marginal distribution.

55
Q

Rejecting the null hypothesis of independence does not mean there is a strong association between X and Y.

A

TRUE.

EXPLANATION

It only means that the association pattern between X and Y shown in the data is unlikely due to chance.

56
Q

If a quantitative variable Y is independent with a categorical variable X, the distribution of Y for individuals with one value of X differs substantially from the distribution of Y for individuals with another value of X.

A

FALSE.

EXPLANATION

The distribution of Y does not change with the value of X when Y is independent with X.

57
Q

Analysis of Variance is used to detect differences in the within-group mean of Y between groups defined by X.

A

TRUE.

58
Q

What is the importance of visualization of data?

A

It’s important to make of visualization of your data

so that you can spot some misinformation in your data

very easily.

59
Q

The magnitude of regression coefficients is a good indicator of their importance.

A

FALSE

EXPLANATION

It depends on the scale of X variable.

60
Q

The fitted regression line using a sample of data gives imperfect predictions for future observations due to only sampling variability.

A

FALSE.

EXPLANATION

Due to sampling variability and randomness in Y that is not related to X.

61
Q

Extrapolation is dangerous as the form of association between X and Y outside the range where the data were collected may be different from that within the range.

A

TRUE.

62
Q

What are the 3 types of data analytics?

A
  1. Descriptive
  2. Predictive
  3. Prescriptive
63
Q

What is a document term matrix?

A

So a document term matrix describes the counts

of words in each document.

So each row will be a document.

64
Q

A bank would like to split loans into two groups—those that are likely to default and those that are unlikely—using a set of attributes like user FICO scores, age, and default history. Which type of data analytics is needed?

A

Predictive Analytics

65
Q

The same company would like to use a model to determine a FICO score cutoff for issuing loans. What type of model would they use?

A

Prescriptive Analytics

66
Q

Finally, the bank would like to study its customer base and group customers according to shared attributes. This is an example of which type of analytics?

A

Descriptive Analytics

67
Q

Some methods do not fit neatly into one category or another. Trees give predictions based on a series of binary splits of the attribute space. For example, a probability of default could be predicted through a series of if-then statements: if {FICO ≧ 700} then {prob default = .02}; else if {FICO

A

Descriptive Analytics and Predictive Analytics

68
Q

What does pre processing of text for data analytics involve?

A
  • stopping
  • stemming

stop words are simple words, usually conjunctions–

and, but, or– prepositions– in,on, to– and any other words, such as articles–A, B, and, et cetera, et cetera– that tend not to carry much information.

Secondly, words are stemmed, or trimmed to their roots.

This is so that you can gather similar terms

without having to worry about verb conjugations or noun

declensions.

69
Q

Indicate whether each of the statements about a bag of words model is true or false:

The order of words in a document does not matter.

A

TRUE

70
Q

Indicate whether each of the statements about a bag of words model is true or false:

Different documents can produce the same bag of words representation.

A

TRUE

71
Q

Grammar is kept in a bag of words model.

A

FALSE

72
Q

Bag of words models are used because they are more computationally tractable than document representations with word/location pairs.

A

TRUE

73
Q

Descriptive analytics can involve metrics or loss functions.

A

TRUE

74
Q

Changing a metric can give you a completely different estimator or data summary

A

TRUE

75
Q

Human interpretability of a descriptive model is not important.

A

FALSE

76
Q

Multiple layers of descriptive analytics may make results more human interpretable.

A

TRUE

77
Q

There is always a best way to summarize a data set.

A

FALSE

78
Q

What is Exploratory data analysis?

A

Exploratory data analysis refers to display of data– or more generally, display of any numerical information–

in a way that can allow us to discover patterns that we did not expect to see.

79
Q

What is visualization of data?

A

Visualization, more generally, refers to the techniques

that we use to see data or to see the numerical information.

80
Q

What are the 3 challenges of exploratory data analysis?

A
  1. Displaying the data clearly and consicely
  2. Interpreting the data
  3. Modeling the data.
81
Q

When you think of graphs think in terms of _________

A

COMPARISIONS

82
Q

To make a causal statement, we must compare to a

A

control group

83
Q

How to choose a chart to visualize data?

A

need to choose the chart thatcommunicates the data most effectivelyand communicates the story that you’re trying to show to your reader most easily and intuitively so that it doesn’t take too much time before they’re able to see the point we’re trying to make with our data.

84
Q

Good to remember:

A

” A good visualization summarizes information and organizes in a way that enables the reader to focus on the points that are relevant to the key message being conveyed.

85
Q

What are data Dashboards?

A

Dashboards are a way of organizing data,such that we can see multiple chartsand how they all are linked together.Typically the dashboard is constructed of several charts in a panel.

And those charts come from one data set but that data set can change over time.

The key thing about these dashboards are that the multiple visualizations are all organized together, such that they build up a story that one chart by itself is not sufficient to tell us.

86
Q

Why use dashboards?

A
  • Communicating results
  • Exploratory data analysis
87
Q

What should a well-designed data dashboard contain:

A

The most important information on one screen observable at a glance

88
Q

What is a caliberated probability?

A

A probability is called calibrated if it’s empirically correct on average.

89
Q

TRUE / FALSE

Exploratory data analysis refers to display of data– or more generally, display of any numerical information– in a way that can allow us to discover patterns that we did not expect to see. Visualization, more generally, refers to the techniques that we use to see data or to see the numerical information.

A

TRUE

90
Q

TRUE/FALSE

Dense display of information allows you to see less data and focus in on more specific information.

A

FALSE

91
Q

In marketing, there are 2 goals for modeling:

A

To understand customers’ behavior and to use that information to predict future outcomes.

92
Q

What is the essence of Bayesian data analysis?

A

The combination of information from different sources.

93
Q

TRUE/FALSE

From the statisticians point of view probabilities are measurements. You can measure probability just as you can measure someone’s height or weight or anything else.

A

TRUE

94
Q
A