Data Analysis Iteration Flashcards

Question 1

Q

What are the 5 core activities of data analysis?

Answer

A

Stating and refining the question
Exploring the data
Building formal statistical models
Interpreting the results
Communicating the results

Question 2

Q

What are the 3 core steps for data analysis activities?

Answer

A

Setting Expectations,
Collecting information (data), comparing the data to your expectations, and if the expectations don’t match,
Revising your expectations or fixing the data so your data and your expectations match.

Question 3

Q

What is Setting Expectations?

Answer

A

Developing expectations is the process of deliberately thinking about what you expect before you do anything, such as inspect your data, perform a procedure, or enter a command. For experienced data analysts, in some circumstances, developing expectations may be an automatic, almost subconscious process, but it’s an important activity to cultivate and be deliberate about.

Example:

You may have also sought out external information to develop your expectations, which could include asking your friends who will be joining you or who have eaten at the restaurant before and/or Googling the restaurant to find general cost information online or a menu with prices. This same process, in which you use any a priori information you have and/or external sources to determine what you expect when you inspect your data or execute an analysis procedure, applies to each core activity of the data analysis process.

Question 4

Q

What is Collecting Information?

Answer

A

This step entails collecting information about your question or your data. For your question, you collect information by performing a literature search or asking experts in order to ensure that your question is a good one. In the next chapter, we will discuss characteristics of a good question. For your data, after you have some expectations about what the result will be when you inspect your data or perform the analysis procedure, you then perform the operation. The results of that operation are the data you need to collect, and then you determine if the data you collected matches your expectations. To extend the restaurant metaphor, when you go to the restaurant, getting the check is collecting the data.

Question 5

Q

Comparing Expectations to Data

Answer

A

Now that you have data in hand (the check at the restaurant), the next step is to compare your expectations to the data. There are two possible outcomes: either your expectations of the cost matches the amount on the check, or they do not. If your expectations and the data match, terrific, you can move onto the next activity. If, on the other hand, your expectations were a cost of 30 dollars, but the check was 40 dollars, your expectations and the data do not match. There are two possible explanations for the discordance: first, your expectations were wrong and need to be revised, or second, the check was wrong and contains an error. You review the check and find that you were charged for two desserts instead of the one that you had, and conclude that there is an error in the data, so ask for the check to be corrected.

One key indicator of how well your data analysis is going is how easy or difficult it is to match the data you collected to your original expectations. You want to setup your expectations and your data so that matching the two up is easy. In the restaurant example, your expectation was $30 and the data said the meal cost $40, so it’s easy to see that (a) your expectation was off by $10 and that (b) the meal was more expensive than you thought. When you come back to this place, you might bring an extra $10. If our original expectation was that the meal would be between $0 and $1,000, then it’s true that our data fall into that range, but it’s not clear how much more we’ve learned. For example, would you change your behavior the next time you came back? The expectation of a $30 meal is sometimes referred to as a sharp hypothesis because it states something very specific that can be verified with the data.

Question 6

Q

Applying the Epicyle of Analysis Process

Answer

A

Example:

Asthma prevalence in the U.S.

Let’s apply the “data analysis epicycle” to a very basic example. Let’s say your initial question is to determine the prevalence of asthma among adults, because your company wants to understand how big the market might be for a new asthma drug. You have a general question that has been identified by your boss, but need to: (1) sharpen the question, (2) explore the data, (3) build a statistical model, (4) interpret the results, and (5) communicate the results. We’ll apply the “epicycle” to each of these five core activities.

For the first activity, refining the question, you would first develop your expectations of the question, then collect information about the question and determine if the information you collect matches your expectations, and if not, you would revise the question. Your expectations are that the answer to this question is unknown and that the question is answerable. A literature and internet search, however, reveal that this question has been answered (and is continually answered by the Centers for Disease Control (CDC)), so you reconsider the question since you can simply go to the CDC website to get recent asthma prevalence data.

You inform your boss and initiate a conversation that reveals that any new drug that was developed would target those whose asthma was not controlled with currently available medication, so you identify a better question, which is “how many people in the United States have asthma that is not currently controlled, and what are the demographic predictors of uncontrolled asthma?” You repeat the process of collecting information to determine if your question is answerable and is a good one, and continue this process until you are satisfied that you have refined your question so that you have a good question that can be answered with available data.

Let’s assume that you have identified a data source that can be downloaded from a website and is a sample that represents the United States adult population, 18 years and older. The next activity is exploratory data analysis, and you start with the expectation that when you inspect your data that there will be 10,123 rows (or records), each representing an individual in the US as this is the information provided in the documentation, or codebook, that comes with the dataset. The codebook also tells you that there will be a variable indicating the age of each individual in the dataset.

When you inspect the data, though, you notice that there are only 4,803 rows, so return to the codebook to confirm that your expectations are correct about the number of rows, and when you confirm that your expectations are correct, you return to the website where you downloaded the files and discover that there were two files that contained the data you needed, with one file containing 4,803 records and the second file containing the remaining 5,320 records. You download the second file and read it into your statistical software package and append the second file to the first.

Now you have the correct number of rows, so you move on to determine if your expectations about the age of the population matches your expectations, which is that everyone is 18 years or older. You summarize the age variable, so you can view the minimum and maximum values and find that all individuals are 18 years or older, which matches your expectations. Although there is more that you would do to inspect and explore your data, these two tasks are examples of the approach to take. Ultimately, you will use this data set to estimate the prevalence of uncontrolled asthma among adults in the US.

The third activity is building a statistical model, which is needed in order to determine the demographic characteristics that best predict that someone has uncontrolled asthma. Statistical models serve to produce a precise formulation of your question so that you can see exactly how you want to use your data, whether it is to estimate a specific parameter or to make a prediction. Statistical models also provide a formal framework in which you can challenge your findings and test your assumptions.

Now that you have estimated the prevalence of uncontrolled asthma among US adults and determined that age, gender, race, body mass index, smoking status, and income are the best predictors of uncontrolled asthma available, you move to the fourth core activity, which is interpreting the results. In reality, interpreting results happens along with model building as well as after you’ve finished building your model, but conceptually they are distinct activities.

Let’s assume you’ve built your final model and so you are moving on to interpreting the findings of your model. When you examine your final predictive model, initially your expectations are matched as age, African American/black race, body mass index, smoking status, and low income are all positively associated with uncontrolled asthma.

However, you notice that female gender is *inversely* associated with uncontrolled asthma, when your research and discussions with experts indicate that among adults, female gender should be positively associated with uncontrolled asthma. This mismatch between expectations and results leads you to pause and do some exploring to determine if your results are indeed correct and you need to adjust your expectations or if there is a problem with your results rather than your expectations. After some digging, you discover that you had thought that the gender variable was coded 1 for female and 0 for male, but instead the codebook indicates that the gender variable was coded 1 for male and 0 for female. So the interpretation of your results was incorrect, not your expectations. Now that you understand what the coding is for the gender variable, your interpretation of the model results matches your expectations, so you can move on to communicating your findings.

Lastly, you communicate your findings, and yes, the epicycle applies to communication as well. For the purposes of this example, let’s assume you’ve put together an informal report that includes a brief summary of your findings. Your expectation is that your report will communicate the information your boss is interested in knowing. You meet with your boss to review the findings and she asks two questions: (1) how recently the data in the dataset were collected and (2) how changing demographic patterns projected to occur in the next 5-10 years would be expected to affect the prevalence of uncontrolled asthma. Although it may be disappointing that your report does not fully meet your boss’s needs, getting feedback is a critical part of doing a data analysis, and in fact, we would argue that a good data analysis requires communication, feedback, and then actions in response to the feedback.

Although you know the answer about the years when the data were collected, you realize you did not include this information in your report, so you revise the report to include it. You also realize that your boss’s question about the effect of changing demographics on the prevalence of uncontrolled asthma is a good one since your company wants to predict the size of the market in the future, so you now have a new data analysis to tackle. You should also feel good that your data analysis brought additional questions to the forefront, as this is one characteristic of a successful data analysis.

In the next chapters, we will make extensive use of this framework to discuss how each activity in the data analysis process needs to be continuously iterated. While executing the three steps may seem tedious at first, eventually, you will get the hang of it and the cycling of the process will occur naturally and subconsciously. Indeed, we would argue that most of the best data analysts don’t even realize they are doing this!

Question 7

Q

What are the 6 Types of Questions to ask?

Answer

A

Descriptive
Exploratory
Inferential
Predictive
Causal
Mechanistic

Question 8

Q

What is A descriptive question?

Answer

A

A descriptive question is one that seeks to summarize a characteristic of a set of data. Examples include determining the proportion of males, the mean number of servings of fresh fruits and vegetables per day, or the frequency of viral illnesses in a set of data collected from a group of individuals. There is no interpretation of the result itself as the result is a fact, an attribute of the set of data that you are working with

Question 9

Q

What is An exploratory question?

Answer

A

An exploratory question is one in which you analyze the data to see if there are patterns, trends, or relationships between variables. These types of analyses are also called “hypothesis-generating” analyses because rather than testing a hypothesis as would be done with an inferential, causal, or mechanistic question, you are looking for patterns that would support proposing a hypothesis. If you had a general thought that diet was linked somehow to viral illnesses, you might explore this idea by examining relationships between a range of dietary factors and viral illnesses. You find in your exploratory analysis that individuals who ate a diet high in certain foods had fewer viral illnesses than those whose diet was not enriched for these foods, so you propose the hypothesis that among adults, eating at least 5 servings a day of fresh fruit and vegetables is associated with fewer viral illnesses per year.

Question 10

Q

What is An inferential question?

Answer

A

An inferential question would be a restatement of this proposed hypothesis as a question and would be answered by analyzing a different set of data, which in this example, is a representative sample of adults in the US. By analyzing this different set of data you are both determining if the association you observed in your exploratory analysis holds in a different sample and whether it holds in a sample that is representative of the adult US population, which would suggest that the association is applicable to all adults in the US. In other words, you will be able to infer what is true, on average, for the adult population in the US from the analysis you perform on the representative sample.

Question 11

Q

What isA predictive question?

Answer

A

A predictive question would be one where you ask what types of people will eat a diet high in fresh fruits and vegetables during the next year. In this type of question you are less interested in what causes someone to eat a certain diet, just what predicts whether someone will eat this certain diet. For example, higher income may be one of the final set of predictors, and you may not know (or even care) why people with higher incomes are more likely to eat a diet high in fresh fruits and vegetables, but what is most important is that income is a factor that predicts this behavior.

Question 12

Q

What is a casual question?

Answer

A

Although an inferential question might tell us that people who eat a certain type of foods tend to have fewer viral illnesses, the answer to this question does not tell us if eating these foods causes a reduction in the number of viral illnesses, which would be the case for a causal question. A causal question asks about whether changing one factor will change another factor, on average, in a population. Sometimes the underlying design of the data collection, by default, allows for the question that you ask to be causal. An example of this would be data collected in the context of a randomized trial, in which people were randomly assigned to eat a diet high in fresh fruits and vegetables or one that was low in fresh fruits and vegetables. In other instances, even if your data are not from a randomized trial, you can take an analytic approach designed to answer a causal question.

Question 13

Q

What is a mechanistic question?

Answer

A

None of the questions described so far will lead to an answer that will tell us, if the diet does, indeed, cause a reduction in the number of viral illnesses, how the diet leads to a reduction in the number of viral illnesses. A question that asks how a diet high in fresh fruits and vegetables leads to a reduction in the number of viral illnesses would be a mechanistic question.

Question 14

Q

2 additional things to remember about data analysis questions

Answer

A

First, by necessity, many data analyses answer multiple types of questions.

A second point is that the type of question you ask is determined in part by the data available to you (unless you plan to conduct a study and collect the data needed to do the analysis).

Question 15

Q

What are the 5 key characteristics of a good data science question?

Answer

A

The question should be of interest to your audience
The question has not already been answered
The question should also stem from a plausible framework
The question, should also, of course, be answerable
Specificity is also an important characteristic of a good question

Question 16

Q

Exploratory Data Analysis Checklist 9 points :

Answer

A

Formulate your question
Read in your data
Check the packaging
Look at the top and the bottom of your data
Check your “n”s
Validate with at least one external data source
Make a plot
Try the easy solution first
Follow up

Question 17

Q

What is the purpose of a statistical model?

Answer

A

A statistical model serves two key purposes in a data analysis, which are to provide a quantitative summary of your data and to impose a specific structure on the population from which the data were sampled.

The trivial “model” is simply no model at all.

Question 18

Q

What is the first purpose of a model?

Answer

A

The first key element of a statistical model is data reduction.

Question 19

Q

More about a Statistical model :

Answer

A

At its core, a statistical model provides a description of how the world works and how the data were generated. The model is essentially anexpectation of the relationships between various factors in the real world and in your dataset. What makes a model a statistical model is that it allows for some randomness in generating the data.

Question 20

Q

What is the most common model?

Answer

A

Perhaps the most popular statistical model in the world is the Normal model. This model says that the randomness in a set of data can be explained by the Normal distribution, or a bell-shaped curve. The Normal distribution is fully specified by two parameters—the mean and the standard deviation.

Question 21

Q

What to do when the model and the data don’t match very well?

Answer

A

Get a different model; or
Get different data

Question 22

Q

What is the most common way to look at linear relationships between variables of interest?

Answer

A

The most common statistical technique to help with this task is linear regression

Question 23

Q

When exactly do you stop the process od data analysis?

Answer

A

Are you out of data?
Iterative data analysis will eventually begin to raise questions that simply cannot be answered with the data at hand.Either way, you need to go back out into the world and collect new data. More data analysis is unlikely to bring these answers.*
Do you have enough evidence to make a decision?
It’s important to always keep in mind the purpose of the data analysis as you go along because you may over- or under-invest resources in the analysis if the analysis is not attuned to the ultimate goal.*
It’s important to realize that the analysis that you perform to get yourself to the point where you can make a decision about something may be very different from the analysis you perform to achieve other goals, such as writing a report, publishing a paper, or putting out a finished product.*
Can you place your results in any larger context?
Another way to ask this question is “Do the results make some sort of sense?”*
Ultimately, if your analysis leads you to a place where you can definitively answer the question “Do the results make sense?” then regardless of how you answer that question, you likely need to stop your analysis and carefully check every part of it.*
Are you out of time?
Ultimately, there will be both a time budget and a monetary budget that determines how many resources can be committed to a given analysis.*
In such a situation, it’s useful to know when to stop the data analysis iteration and prepare whatever results you may have obtained to date in order to present a coherent argument for continuation of the analysis.*
Are you out of money?

Question 24

Q

What are The key factors affecting the quality of an inference ?

Answer

A

Sampling process and the Model for the Population and Sampling Variability.

Obviously, if we cannot coherently define the population, then any “inference” that we make to the population will be similarly vaguely defined.

Question 25

Q

What should you ask before you begin analysis?

Answer

A

In any data analysis, you want to ask yourself “Am I asking an inferential question or a prediction question?” This should be cleared up before any data are analyzed, as the answer to the question can guide the entire modeling strategy.

Question 26

Q

What are some of the principles of interpreting results?

Answer

A

Revisit your original question
Start with the primary statistical model to get your bearings and focus on the nature of the result rather than on a binary assessment of the result (e.g. statistically significant or not). The nature of the result includes three characteristics: its directionality, magnitude, and uncertainty. Uncertainty is an assessment of how likely the result was obtained by chance.
Develop an overall interpretation based on (a) the totality of your analysis and (b) the context of what is already known about the subject matter.
Consider the implications, which will guide you in determining what action(s), if any, sbould be taken as a result of the answer to your question.
* It is important to note that the epicycle of analysis also applies to interpretation. At each of the steps of interpretation, you should have expectations prior to performing the step, and then see if the result of the step matches your expectations.*

Question 27

Q

After coming up with primary results from your modeling what are the 3 things to consider:

Answer

A

Directionality
Magnitude
Uncertainity of Results

Question 28

Q

What is the next step after interpreting ?

Answer

A

Now that you’ve interpreted your results and have conclusions in hand, you’ll want to think about the implications of your conclusions

Question 29

Q

What is the main purpose of routine communication?

Answer

A

The main purpose of routine communication is to gather data

Question 30

Q

What role does communication play in data analysis?

Answer

A

Communication is both one of the tools of data analysis, and also the final product of data analysis: there is no point in doing a data analysis if you’re not going to communicate your process and results to an audience.

Question 31

Q

4 factors to consider while communicating:

Answer

A

Audience: Know your audience and when you have control over who the audience is, select the right audience for the kind of feedback you are looking for.

Content: Be focused and concise, but provide sufficient information for the audience to understand the information you are presenting and question(s) you are asking.

Style: Avoid jargon. Unless you are communicating about a focused highly technical issue to a highly technical audience, it is best to use language and figures and tables that can be understood by a more general audience.

Attitude: Have an open, collaborative attitude so that you are ready to fully engage in a dialogue and so that your audience gets the message that your goal is not to “defend” your question or work, but rather to get their input so that you can do your best work.

Question 32

Q

Brainscape's Knowledge GenomeTM

Data Analysis Iteration Flashcards

Brainscape's Knowledge Genome^TM