Exploratory Data Analysis Flashcards

Question

Visualization

Answer 1

The visual representation and presentation of data to facilitate understanding.

Answer 2

1. Perceiving: what do I see, what is shown, how is data represented 2. Interpreting: what does it mean, given the subject? What is interesting? 3. Comprehending: what does it mean to me? what have I learnt?

Answer 3

- Context is important as it helps determine what is interesting and what is important (signal vs noise) - Any disconnect from the subject impedes the process of interpretation - The onus is thus on the visualizer to bridge the gap by providing captions, headlines, use of colors etc. - Comprehension: the viewers needs to answer “what does it mean to me?”

Answer 4

- Show the data - Persuade the user to think about the data - Avoid distorting data - Be concise: present more information with minimum ink - Make large datasets coherent - Encourage the reader to compare different pieces of data - Reveal data

Answer 5

- Mathematically describe our findings as a numerical representation of the data - Descriptive statistics summarize data - Inferential statistics are tools that indicate how much confidence we can have when we generalize from a sample to a population - Draw conclusions from our results - Test hypotheses - Test for relationships among variables

Answer 6

set of procedures and rules for reducing large masses of data to manageable proportions allowing us to draw conclusions from the data

Answer 7

Statistical Questions: - Studies are designed to answer research questions (ex. Will this vaccine be effective, how tall are students at a given school) Non-Statistical Questions: - Seeking generality not a particular instance, there should not be a direct comparison (Ex. how tall is the president, which dog weighs more). Interested in variability and features. Should be groups of individuals.

Answer 8

- Whole group of data is called the population - Include all elements from the set of observations that can be made - Members of population share a common set of properties that are the subject statistical analysis - Subset of population is called a subpopulation if they share one or more additional properties

Answer 9

- Includes one or more observations from a population - The sample is the portion of the population that is representative of the population from which it was selected - Its not always possible to perform a census of every individual member of a population Using inferential statistics, we perform measurements on a subset of the population which tells us about the corresponding measurements in the population A good sample is not biased, and is random.

Answer 10

The null hypothesis (H0) states the numerical assumption to be tested, ex Each household has at least 3 TVs. Begin with the assumption that the null hypothesis is TRUE, refers to the status Quo; always contains the = sign. The null hypothesis may or may not be rejected. The alternative hypothesis (H1) represents the opposite of the null hypothesis, ex. each household has less than 3 TVs.

Answer 11

Statistical Testing: Formulate the null hypothesis, decide in advance what kinds of evidence/data will lead to rejection of the null hypothesis (define the rejection region). Gather the data and carry out the test.

Answer 12

Type 1 error or Type 2 error

Answer 13

Failing to take action when warranted

Answer 14

Taking action when not needed

Answer 15

Data that is inconsistent with the hypothesis. Evidence is divided into two types: - Data that is inconsistent with the hypothesis (Rejection region) - Everything else

Answer 16

Usually looking for what kind of data will lead to reject the hypothesis. Scientifically, if you want to prove a hypothesis is true, being by assuming it is not true and look for plausible evidence that contradicts the assumption. - Formulate the null hypothesis - Gather the evidence - Q: if my null hypothesis were true, how likely is it that I would have observed this evidence - Very unlikely: reject the hypothesis - Not unlikely: Do not reject (retain the hypothesis for continued scrutiny)

Exploratory Data Analysis Flashcards

(40 cards)