Midterm Flashcards

1
Q

VARIABLES

A

VARIABLE ASPECTS OF REALITY

(In statistical research, a variable is defined as an attribute of an object of study.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. VARIABLES CONSIST OF
A

VALUES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

σ

A

Sigma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sigma represents

A

population standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

pulation standard deviation formula

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

µ means

A

mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

VARIABLE ASPECTS OF REALITY ARE CALLED

A

VARIABLES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

VARIABLES CONSIST OF

A

VALUES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

VALUES ARE TAKEN ON BY

A

OBSERVATIONS (SUBJECTS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

VALUES ARE TAKEN ON BY OBSERVATIONS (SUBJECTS) IN

A

TIME AND IN SPACE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

WE MAY WANT TO DO TWO THINGS WITH VALUES OF OBSERVATIONS:

A
  1. WE MAY WANT TO KNOW IF THERE IS A PATTERN IN A LIMITED NUMBER OF VALUES AVAILABLE TO US “HERE AND NOW” (IS THERE A PATTERN OF SCORING BY A BASKETBALL TEAM OVER A SEASON?)
    • THIS GOAL CAN BE ACCOMPLISHED WITH A SET OF STATISTICAL PROCEDURES, CALLED DESCRIPTIVE STATISTICS.
  2. b. WE MAY ALSO WANT TO KNOW IF A PATTERN OBSERVED IN A LIMITED NUMBER OF OBSERVATIONS IS LIKELY TO HOLD WITH OTHER OBSERVATIONS UNDER SIMILAR CONDITIONS. (ARE OTHER TEAMS IN THE LEAGUE LIKELY TO DISPLAY A SIMILAR SCORING PATTERN OVER A SEASON AS THE TEAM WE HAVE OBERVED?)
    • THIS GOAL CAN BE ACCOMPLISHED WITH A SET OF STATISTICAL PROCEDURES, KNOWN AS INFERENTIAL STATISTICS.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

DESCRIPTIVE STATISTICS

A

is a means of describing features of a data set by generating summaries about data samples. It’s often depicted as a summary of data shown that explains the contents of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

INFERENTIAL STATISTICS

A

describe the many ways in which statistics derived from observations on samples from study populations can be used to deduce whether or not those populations are truly different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

OBSERVATIONS THAT WE OBSERVE “HERE AND NOW” MAKE UP A

A

SAMPLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

TO DESCRIBE A SAMPLE WE USE

A

SAMPLE STATISTICS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

SAMPLE STATISTICS ARE REFERRED TO BY

A

LATIN LETTERS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A SET OF ALL RELEVANT OBSERVATIONS FROM WHICH YOUR SAMPLE WAS TAKEN IS CALLED A

A

POPULATION

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

TO DESCRIBE A POPULATION, WE USE

A

POPULATION PARAMETERS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

POPULATION PARAMETERS ARE REFERRED TO BY

A

GREEK LETTERS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

THE PROBLEM WITH A POPULATION IS THAT IT’S DIFFICULT TO OBSERVE. THEREFORE, WE USUALLY OBSERVE PATTERNS IN SAMPLES AND DECIDE IF THESE PATTERNS ARE LIKELY TO

A

HOLD IN POPULATIONS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

SAMPLES MUST BE

A

REPRESENTATIVE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

SAMPLES MUST BE REPRESENTATIVE:

A

THEY MUST REFLECT GENERAL COMPOSITION OF POPULATION

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

SAMPLES MUST BE SELECTED VIA

A

RANDOM SAMPLING

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

SAMPLES MUST BE SELECTED VIA RANDOM SAMPLING:

A

WHERE EACH OBSERVATION IN A POPULATION HAS IDENTICAL PROBABILITY OF BEING SELECTED INTO A SAMPLE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
SAMPLING WITH/ WITHOUT REPLACEMENT
SAY WE HAVE 5 RED BALLS, 5 WHITE ONES, AND SELECT A SAMPLE OF 2 BALLS. PROBABILITY OF A 2ND BALL BEING RED DEPENDS ON THE COLOR OF THE 1ST BALL PICKED INTO THE SAMPLE. THIS VIOLATES EQUAL PROBABILITY PRINCIPLE FOR THE SECOND BALL. TO AVOID THE VIOLATION WE REPLACE THE 1ST BALL BEFORE PICKING THE 2ND ONE. REPLACEMENT IS NOT NECESSARY WITH LARGE POPULATIONS.
26
A PERFECT FIT BETWEEN A SAMPLE AND A POPULATION DOES NOT EXIST. THERE’S ALWAYS A
SAMPLING ERROR
27
A SAMPLING ERROR IS
THE DIFFERENCE BETWEEN A POPULATION PARAMETER AND A SAMPLE STATISTIC
28
TWO TYPES OF SAMPLING ERRORS:
• THE RELATIVELY “HARMLESS” SAMPLING ERROR IS UNBIASED • THE “HARMFUL” SAMPLING ERROR IS BIASED
29
THE RELATIVELY “HARMLESS” SAMPLING ERROR IS UNBIASED:
OVER MULTIPLE SAMPLES SOME SAMPLE STATISTICS WILL BE GREATER AND SOME – SMALLER THAN POPULATION PARAMETER.
30
THE “HARMFUL” SAMPLING ERROR IS BIASED:
OVER MULTIPLE SAMPLES SOME SAMPLE STATISTICS ALL OF THEM WILL BE EITHER GREATER OR SMALLER THAN POPULATION PARAMETER.
31
THE BIAS OF A SAMPLING ERROR CAN BE DETECTED BY
BY INVESTIGATING SAMPLING PROCEDURE. (BECAUSE FULL POPULATIONS AND THEIR PARAMETERS ARE USUALLY UNOBSERVABLE). FOR A SAMPLING ERROR TO BE UNBIASED, SAMPLING PROCEDURE MUST ENSURE EQUAL PROBABILITY OF SELECTION FOR EACH OBSERVATION IN POPULATION.
32
TYPES OF VARIABLES
NATURE VARIABLES DISCRETE VARIABLES CONTINUOUS VARIABLES
33
NATURE VARIABLES CAN BE
DISCRETE OR CONTINUOUS
34
DISCRETE VARIABLES HAVE MEASUREMENT UNITS THAT ARE
CLEARLY DEFINED WITH NO INTERIM VALUES FALLING BETWEEN TWO SMALLEST POSSIBLE UNITS
35
DISCRETE VALUES ARE OFTEN USED TO
DENOTE QUALITIES (FEMALE / MALE  1 / 2)
36
DISCRETE VARIABLES USUALLY HAVE
RELATICELY FEW VALUES (TYPES OF A MEDAL: GOLD, SILVER, BRONZE), BUT SOME CAN HAVE A LARGER NUMBER OF VALUES (THE AMOUNT OF ONE-CENT COINS IN YOUR POCKET).
37
CONTINUOUS VARIABLES DO NOT HAVE A
CLEARLY DEFINED SMALLEST VALUE. (TIME, TEMPERATURE, ETC.) VALUES COULD IN PRINCIPLE CONTINUE TO INFINITY IN BETWEEN ANY TWO GIVEN OBSERVATIONS.
38
CONTINUOUS VARIABLES NEVER HAVE
THE SAME VALUE FOR ANY TWO OBSERVATIONS. NO TWO PEOPLE ARE 170 CM TALL. WE ONLY HAVE SAME-SOUNDING VALUES, BECAUSE OUR MEASURMENT DEVICES CANNOT PICK-UP FINER SUB-UNITS.
39
TO GIVE A VALUE OF A CONTINUOUS VARIABLE PECISELY, YOU SHOULD
INDICATE ITS UPPER AND LOWER REAL LIMITS AT A DESIRED INTERVAL. LET’S SAY THAT A DESIRED INTERVAL IS 1 CM. A PERSON WITH A HEIGHT OF 170 CM IS THEN SAID TO BE BETWEEN LRL = 169.5 CM & URL = 170.5 CM
40
BY DEFINITION 170.5 IS THE URL OF AN INTERVAL
“71”
41
SAME VARIABLE CAN BE MEASURED WITH DIFFERENT DEGREE OF PRECISION WITH DISTINCT
MEASUREMENT SCALES
42
NOMINAL SCALE (NAMES) VALUES SIMPLY PERFORM
THE FUNCTION OF NAMES. NO MATHEMATICAL OPERATIONS CAN BE ACCOMPLISHED WITH NOMINAL SCALE. (NUMBERS ON BASKETBALL JERSEYS, RANDOMLY ASSIGNED TO PLAYERS)
43
ORDINAL SCALE (RANKINGS). VALUES CAN BE USED TO
RANK OBSERVATIONS IN ORDER OF MAGNITUDE. (NUMBERS ON BASKETBALL JERSEYS, ASSIGNED ACCORDING TO HEIGHT). NO MATHEMATICAL OPERATIONS, EXCEPT FOR RANKING, CAN BE CONDUCTED ON THIS SCALE.
44
NOMINAL AND ORDINAL SCALES CAN BE USED TO
MEASURE: LOW, MEDIUM AND HIGH PRESSURE IS AN ORDINAL MEASURE). TO MEASURE BOTH DISCRETE AND CONTINUOUS VARIABLES (WHILE BLOOD PRESSURE IS A CONTINUOUS VARIABLE, ITS MEASURE: SYSTOLIC OR DIASTOLIC IS A NOMINAL SCALE MEASURE, WHILE ANOTHER
45
STILL YOU SHOULD AVOID MEASURING CONTINUOUS VARIABLES ON NOMINAL OR ORDINAL SCALE, BECAUSE
THIS WAY YOU LOSE PRECISION THAT CAN BE OBTAINED WITH MORE SOPHISTICATED SCALES.
46
INTERVAL SCALE. ENABLES NOT ONLY RANKING, BUT
BUT MEASURING MEANINGFUL DIFFERENCE BETWEEN VALUES OF A VARIABLE.
47
INTERVAL SCALE MEASUREMENTS DO NOT HAVE
AN ABSOLUTE ZERO (SOMETIMES KNOWN AS AN ABSOLUTE ZERO, AT WHICH A VARIABLE CEASES TO EXIST.
48
ALL MATHEMATICAL OPERATIONS CAN BE DONE WITH
VARIABLES MEASURED ON INTERVAL SCALE, EXCEPT FOR TAKING A RATIO (CANNOT DIVIDE). CONSIDER WAKING UP AT 4AM WHILE YOU NORMALLY WAKE UP AT 8 AM. DOES THAT MEAN THAT YOU WOKE UP TWICE AS EARLY? NO (BECAUSE TIME DID NOT START AT MIDNIGHT).
49
RATIO SCALE. APPLIES TO
CONTINUOUS VARIABLES THAT HAVE AN ABSOLUTE ZERO. ALL MATHEMATICAL OPERATIONS POSSIBLE.
50
INTERVAL AND RATIO SCALES USUALLY MEASURE
MEASURE CONTINUOUS VARIABLES. YOU SHOULD USE THESE TWO SCALES TO MEASURE CONTINUOUS VARIABLES, INSTEAD OF USING NOMINAL OR ORDINAL SCALES FOR THE RICHNESS OF INFORMATION.
51
VARIABLES ARE USUALLY REFERRED TO WITH
LATIN UPPER-CASE LETTERS (X, Y, Q…)
52
VALUES ARE USUALLY REFERRED TO WITH
LATIN LOWER-CASE LETTERS WITH SUBSCRIPTS (x1, x2, x3 … xn).
53
THE NUMBER OF OBSERVATIONS IN A POPULATION IS MARKED WITH
UPPER CASE N
54
A SUM OF VALUES OF A PARTICULAR VARIABLE IS KNOWN BY
UPPER CASE GREEK LETTER SIGMA: Σ. SIGMA MUST ALWAYS BE FOLLOWED BY WHATEVER IS BEING ADDED. a. LETS SAY WE HAVE A SAMPLE OF n = 4, 3, 6, 7. • Σ(X) = 20 • Σ(X – 1)2 = 9 + 4 + 25 + 36 = 74 • (ΣX)2 = 202 = 400.
55
FREQUENCY DISTRIBUTIONS
THE FIRST TOOL FOR DESCRIPTIVE STATISTICS
56
FD SHOW
WHICH VALUES IN A VARIABLE OCCUR FREQUENTLY, AND WHICH ARE RARE
57
USUALLY, FD ARE
GRAPHIC REPRESENTATIONS OF DATA, BUT THEY BEGIN WITH A FREQUENCY TABLE.
58
FREQUENCY TABLES LIST VALUES OF A VARIABLE IN THE
LFTMOST COLUMN. ALL POSSIBLE VALUE MUST BE LISTED.
59
AN ADJACENT COLUMN CONTAINS
FREQUENCIES (f) OF EACH VALUE: NUMBERS OF OBSERVATIONS IN A SAMPLE THAT HAVE A PARTICULAR VALUE
60
A FREQUENCY TABLE MAY CONTAIN
RELATIVE FREQUENCIES (rf, %): SHARES OF OBSERVATIONS (FROM THE TOTAL n) THAT HAVE A PARTICULAR VALUE.
61
A FREQUENCY TABLE MAY CONTAIN CUMULATIVE FREQUENCIES (cf):
NUMBERS OF OBSERVATIONS THAT HAVE VALUES THAT ARE EQUAL TO OR LOWER THAN A GIVEN VALUE.
62
A FREQUENCY TABLE MAY CONTAIN CUMULATIVE RELATIVE FREQUENCIES (crf, c%):
SHARES OF OBSERVATIONS THAT HAVE VALUES THAT ARE EQUAL TO OR LOWER THAN THE VALUE.
63
CUMULATIVE RELATIVE FREQUENCY IS USEFUL FOR
SHOWING A RELATIVE STANDING OF AN OBSERVATION WITH A PARTICULAR VALUE VIS-À-VIS OTHER OBSERVATIONS.
64
THE CONCEPT OF PERCENTILE RANK.
PERCENTILE RANK SHOWS RELATIVE STANDING OF AN OBSERVATION’S VALUE AMONG OTHER VALUES.
65
RECENTILE RANK SHOWS
THE PERCENT OF OBSERVATIONS WITH VALUES EQUAL TO OR LOWER THAN A GIVEN VALUE. A STUDENT EARNING A GRADE WITH PERCENTILE RANK OF 70 HAS DONE AS WELL OR BETTER THAN 70% OF OTHER STUDENTS.
66
FOR CONTINUOUS VARIABLES, OR DISCRETE ONES WITH MANY POSSIBLE VALUES, THE CONTENT OF THE LEFT COLUMN IN F.T. HAS TO BE
CLUSTERED IN TO GROUPS OF EQUAL SIZE WITH APPROXIMATELY 8 – 10 SUCH GROUPS.
67
INTERPOLATION:
MAKING AN EDUCATED GUESS ABOUT THE LIKELY CRF OF A VALUE IN THE MIDDLE OF AN INTERVAL.
68
WHAT IF YOU HAVE A CONTINUOUS VARIABLE?
• USE A POLYGON • USE A HISTOGRAM (SHOWN FOR A SEPATATE SET OF VALUES)
69
FOR STARTERS: STEM AND LEAF DIAGRAM
AN ALTERNATIVE WAY OF VISUALIZING F.D. OF CONTINUOUS VARIABLES (JOHN TUKEY).
70
Just understand this table:
71
FREQUENCY DISTRIBUTIONS CAN HAVE A GREAT VARIERTY OF SHAPES. LETS LEARN SOME WORDS TO DESCRIBE THEM.
72
THE KEY POINT OF DEPARTURE, TALKING ABOUT SHAPES IS THE CONCEPT OF
SYMMETRY
73
ONE DEPARTURE FROM SYMMETRY IS
SKEWNESS
74
SKEWNESS
A SKEW EXISTS, WHEN F.D. HAS A “TAIL” IN ONE DIRECTION OR ANOTHER FROM THE CENTER.
75
A TAIL CONSISTS OF INFREQUENT VALUES ON THE SIDE OF A F.D., ALSO KNOWN AS
OUTLIERS
76
A TAIL STRETCHING IN THE DIRECTION OF POSITIVE NUMBERS SHOWS A
POSITIVE SKEW
77
A TAIL IN THE DIRECTION OF NEGATIVE NUMBERS
A NEGAITVE SKEW
78
SKEWNESS CAN BE MEASURED:
WHEN SKEWNESS STATISTIC IS 0, WE HAVE PERFECT SYMMETRY. WHEN SKEWNESS IS > 0, WE HAVE A POSITIVE SKEW; WHEN SKEWNESS IS > 2, WE HAVE AN EXTREME POSITIVE SKEW. ANALOGOUS INTERPRETATION FOR THE NEGATIVE SKEW.
79
ANOTHER ASPECT OF A SHAPE OF F.D. IS
KURTOSIS
80
KURTOSIS:
KURTOSIS: THIS IS THE RELATIVE HEAVINESS (THICKNESS) OF THE TAILS. THICK TAILS SHOW THAT EXTREME VALUES ARE RATHER FREQUENT RELATIVE TO “CENTRAL” MORE COMMON VLUES.
81
MEASUERMENT of kurtosis:
IF KURTOSIS STATISTIC = 3, WE HAVE TAILS THAT ARE NOT TOO THICK, AND NOT TOO THIN. KURTOSIS > 3 SHOWS A DISTRIBUTION WITH THIN TAILS. KURTOSIS < 3 MEANS DISTRIBUTION WITH FAT TAILS.
82
CENTRAL TENDENCY REPRESENTS VALUES (USUALLY A SINGLE VALUE) THAT IS
MOST COMMON IN A FREQUENCY DISTRIBUTION. CLEARLY C.T. IS NOT ALWAYS AT THE CENTER OF SAMPLE VALUES. THEREFORE WE HAVE SEVERAL ALTERNATIVE MEASURES OF C.T.
83
THE MEAN AKA AVERAGE (MARKED µ FOR A POPULATION, M FOR A SAMPLE) –
THE MOST COMMON, AND, WHEN POSSIBLE, PREFERRED MEASURE OF C.T., BECAUSE IT TAKES INTO CONSIDERATION VALUES OF EACH OBSERVATION IN A F.D.
84
POPULATION AND SAMPLE MEANS ARE GIVEN BY FOLLOWING FORMULA :
85
MEAN IS
THE PREFERRED MEASURE OF C.T. FOR ANY BELL-SHAPED (I.E. MORE OR LESS SYMMETRIC) F.D. WITH SKEWED DISTRIBUTIONS MEAN IS AN UNRELIABLE REPRESENTATION OF A CENTRAL TENDENCY, BECAUSE IT TENDS TO “MOVE” IN THE DIRECTIN OF EXTREME VALUES IN A TAIL. LIKEWISE IN DISTRIBUTION WITH SEVERAL “PEAKS” A MEAN DOES NOT CONVEY INFORMATIN ABOUT MOST COMMON VALUES.
86
BASIC FEATURES OF A MEAN:
• IF YOU CHANGE ONE VALUE IN A SAMPLE, MEAN CHANGES. • IF YOU ADD / REMOVE AN OBSERVATINO TO / FROM A SAMPLE, MEAN CHANGES, UNLESS THAT OBSERVATIN HAS THE VALUE OF THE MEAN. • IF YOU MULTIPLY/DIVIDE ALL VALUES IN A SAMPLE BY A CONSTANT, THE MEAN WILL ALSO BE MULTIPLIED/DIVIDED BY THAT CONSTANT. • IF YOU ADD / SUBTRACT A CONSTANT TO ALL VALUES IN A SAMPLE, YOU ADD / SUBTRACT THAT SAME CONSTANT TO THE MEAN.
87
THE MEDIAN (MD):
IS A VALUE FROM A SAMPLE OR A POPULATION DIVIDING ALL OBSERVATIONS INTO TWO EQUAL HALVES.
88
MEDIAN IS USEFUL UNDER FOLLOWING CONDITIONS:
• WITH SKEWED FREQUENCY DISTRIBUTIONS. • WHEN A SAMPLE HAS VALUES THAT ARE INCOMPLETE (“DID NOT FINISH” IN A RACE) • WHEN YOU HAVE OPEN ENDED DISTRIBUTIONS (“FIVE OR MORE” AS AN ANSWER TO A QUESTION HOW MANY PIZZAS DO YOU EAT IN A WEEK).
89
THE MODE:
A MOST COMMON VALUE IN YOUR SAMPLE / POPULATION.
90
A MODE IS USEFUL WITH:
• MULTIMODAL DISTRIBUTIONS (THE ONES WITH SEVERAL “PEAKS”) • NOMINAL / ORDINAL VALUES • WHEN YOU WANT TO USE A WHOLE NUMBER, AND NOT A FRACTION TO SHOW C.T.
91
IN A SYMMETRIC UNIMODAL F.D. MEAN, MEDIAN AND MODE COINCIDE. IN A SKEWED F.D. MEAN MOVES
TOWARDS OUTLIERS, WHILE MEDIAN AND MODE STAY CLOSER TO COMMON VALUES. IN A MULTIMODAL F.D. MEAN AND MEDIAN TEND TOWARDS THE MIDDLE VALUES OF ALL OBSERVATINO, WHILE MODES SHOW THE MOST FREQUENT ONES.
92
VARIABILITY:
ARE OBSERVATINONS WITHIN A F.D. SIMILAR TO ONE ANOTHER OR NOT?
93
VARIABILITY: ARE OBSERVATINONS WITHIN A F.D. SIMILAR TO ONE ANOTHER OR NOT?
1. ANSWER TO THIS QUESTION IS CENTRAL TO ISSUES OF INFERENTIAL STATISTICS. (THE LOWER VARIABILITY IN A POPULATION, THE MORE REPRESNTATIVE WILL A RANDOM SAMPLE FROM THAT POPULATION BE AND THE MORE LIKELY PATTERNS IN A SAMPLE WILL HOLD IN POPULATION). 2. A BASIC MEASURE OF VARIABILITY IS THE RANGE: xMAX – xMIN. SADLY IT GIVES NO INORMATION ABOUT VARIABILITY “INSIDE” THE RANGE. ARE VALUES DISTRIBUTED EVENLY BETWEEN xMAX AND xMIN OR ARE THEY CLUSTERED SOMEWHERE IN THE CENTER? 3. STANDARD DEVIATION AND VARIANCE.
94
STANDARD DEVIATION
IS AN AVERAGE DISTANCE OF ALL OBSERVATIONS FROM THE MEAN.
95
A DEVIATION SCORE IS
A DISTANCE FROM AN INDIVIDUAL OBSERVATION TO THE MEAN.
96
DIVIDING SS BY THE NUMBER OF OBSERVATIONS IN POPULATION GIVES A POPULATION
VARIANCE (σ2 FOR A POPULATION, s2 FOR A SAMPLE)
97
VARIANCE
98
WHY MUST WE DIVIDE BY n – 1 FOR SAMPLE VARIANCE?
BECAUSE SAMPLE VARIABILITY IS AWAYS SMALLER THAN POPULATION VARIABILITY. THIS OCCURS DUE TO A BIASED SAMPLING PROCEDURE. TO CORRECT WE MANUALLY DECREASE THE DENOMINTOR INCREASING VARIANCE.
99
MATHEMATICAL PROPERTIES OF STANDARD DEVIATION:
- ADDING / SUBTRACTING A CONSTANT TO EACH VALUE DOES NOT AFFECT STD. - MULTIPLYING / DIVIDING EACH VALUE BY A CONSTANT, STD ALSO GETS MULTIPLIED / DIVIDED BY THAT CONSTANT.
100
THE CONCEPT OF DEGREES OF FREEDOM
THIS n – 1 IN THE DENOMINATOR OF SAMPLE VARIANCE AND STD CAN BE INTERPRETED AS DEGREES OF FREEDOM. D.F. SHOW HOW MANY OBSERVATIONS IN A SAMPLE ARE WE FREE TO VARY (INDEPENDENT OF OTHER VALUES AND STATISTICS).
101
Look at these rocks (this table)
102
WE CAN SAY THAT D.F. SHOW THE
EXTENT (THE SIZE) OF THE PROBLEM OF NON-INDEPENDENT SAMPLING. THE LAGER YOUR n THE SMALLER IS THE PROBLEM (THE GREATER YOUR D.F.)
103
ALTERNATIVLEY WE CAN THINK OF DEGREES OF FREEDOM AS AN ANSWER TO THE QUESTION: HOW FREE ARE YOU TO ESTIMATE VARIABILITY OF A POPULATION WELL
WITH n = 1 YOU’RE NOT FREE AT ALL. WITH n = 2 YOU HAVE THE MINIMAL AMOUNT OF FREEDOM. WITH A LARGE N YOU ARE MORE FREE (MORE CONFIDENT) TO OBTAIN A GOOD MEASURE OF VARIABILITY IN POPULATION.
104
IN MOST SYMMETRIC UNIMODAL F.D.S APPROXIMATELY
70% OF ALL OBSERVATIONS FALL WITHIN + / - ONE STD AROUND THE MEAN. AND APPROXIMATELY 95% OF ALL OBSERVATIONS FALL WITHIN + / - TWO STD AROUND THE MEAN. OBSERVATIONS THAT ARE REMOVED FROM THE MEAN BY MORE THAN TWO STD ARE CONDERED OUTLIERS.
105
IN MOST SYMMETRIC UNIMODAL F.D.S APPROXIMATELY 70% OF ALL OBSERVATIONS FALL WITHIN + / - ONE STD AROUND THE MEAN. AND APPROXIMATELY 95% OF ALL OBSERVATIONS FALL WITHIN + / - TWO STD AROUND THE MEAN. OBSERVATIONS THAT ARE REMOVED FROM THE MEAN BY MORE THAN TWO STD ARE CONDERED OUTLIERS.
106
IF YOUR EXAM GRADE, COMPARED TO YOUR CLASSMATES IS ONE STD ABOVE AVERAGE, THEN YOU DID BETTER THAN
50% + (70% / 2) = 85% OF YOUR FRIENDS.
107
SEVERAL WAYS TO DETERMINE LOCATION OF AN OBSERVATION IN A F.D.
a. INTERPOLATION b. TUKEY’S S&L DIAGRAM c. S.T.D.
108
AN EXAMPLE, USING STD:
SUPPOSE A THREE-SEASON SCORING AVERAHE FOR A BASKETBALL TEAM IS 85 POINTS WITH s = 13. HOW DOES A SCORE OF 72 POINTS COMPARE TO OTHER SCORES BY THE TEAM? ASSUME THAT SCORES ARE DISTRIBUTED IN A SYMMETRIC BELL-SHAPED FORM. FIND OUT HOW MANY STANDARD DEVIATIONS THE GRADE DIFFERS FROM CLASS AVERAGE. (72 – 85)/13= -1 LOCATE –1 STANDARD DEVIATIONS ON THE BELL-SHAPED F.D. CALCULATE THE SHARE OF OBSERVATIONS THAT HAVE VALUES LOWER THAN 1 STANDARD DEVIATIONS BELOW THE MEAN. 50% – (70% – 35%) = 15% OF ALL SCORES FALL BETWEEN THE MINIMUM AND 72 POINTS. IN OTHER WORDS BY SCORING 72 POINTS THE TEAM PLAYED A GAME THAT IS BETTER THAN 15% AND WORSE THAN 85% OS ITS GAMES.
109
A DISTANCE BETWEEN VALUE 72 AND THE MEAN IN TERMS OF STANDARD DEVIATIONS. SUCH A DISTANCE IS CALLED A
z SCORE
110
WE CAN CALCULATE z SCORES FOR ALL VALUES IN A POPULATION USING THIS FORMULA:
111
IF WE z-STANDARDIZED AN ENTIRE POPULATION, IT WOULD HAVE FOLLOWING FEATURES:
a. SAME SHAPE AS ORIGINAL UNSTANDARDIZED POPULATION. b. A MEAN, EQUAL TO ZERO. c. A STD EQUAL TO ONE.
112
PROBABILITY OF SAMPLING (I.E. OBTAINING) AN OBSERVATION WITH A PARTICULAR VALUE FROM A POPULATION IS EQUAL TO
THE SHARE OF OBSERVATIONS WITH THAT VALUE IN THE POPULATION.
113
Probability of a value=
(number of observations with that value)÷ (total number of observations)
114
CALCULATE PROBABILITIES OF RANDOMLY SELECTING OBSERVATIONS FROM A CERTAIN REGION OF A F.D.
115
A NORMAL F.D. IS A SYMMETRIC UNIMODAL F.D. WITH CERTAIN SHARES OF OBSERVATIONS IN ITS REGIONS.
116
CONNECTING NORMAL F.D. WITH PROBABILITY. SUPPOSE THAT WINTER TEMPERATURES ARE NORMALLY DISTRIBUTED. ASSUME AN AVERAGE WINTER TEMPERATURE µ = –3C AND σ = 8C. GIVEN THIS INFORMATION, WHAT IS A PROBABILITY OF OBSERVING A WINTER DAY COLDER THAN MINUS 19C?
117
UNIT NOMAL TABLE
a. LEFT COLUMN OF U.N.T. CONSISTS OF SINGLE DIGITS AND A FIRST DECIMAL OF A z SCORE b. A TOP ROW OF U.N.T. CONTAINS THE SECOND DECIMAL OF A z SCORE. c. A CELL AT THE INTERSECTION OF A COLUMN AND A ROW GIVES PROBABILITY OF SELECTING AN OBSERVATION FROM A SHADED AREA UNDER A NORMAL F.D. (SAME AS THE SHARE OF OBSERVATIONS IN THAT SHADED AREA.) d. U.N.T. EXPRESSES PROBABILITIES (SHARES) AS PROPORTIONS, NOT PERCENTAGES
118
LETS SAY THAT WE HAVE A NORMAL F.D. WITH µ = 10 AND σ = 2. WHAT IS THE PROBABILITY OF OBTAINING AN OBSERVATION THAT IS GREATER THAN 7?
119
ASSUME THE SAME NORMAL F.D. AS ABOVE. NOW YOU WANT TO KNOW PROBABILITY OF RANDOMLY SELECTING AN OBSERVATION WITH VALUE THAT FALLS BETWEEN 8 AND 13.
120
CONSIDER A FOLLOWING EXERCISE A. OBTAIN A LARGE NUMBER OF SAMPLES FROM A POPULATION (SAME n). B. FOR EACH SAMPLE, CALCULATE M. C. ARRANGE THESE MEANS INTO A F.D. FROM MMIN TO MMAX. D. WHAT SHAPE, µ AND σ WILL F.D. OF THIS NEW VARIABLE HAVE?
121
HYPOTHESIS TESTS ARE A PART OF
SCIENTIFIC METHODOLOGY
122
HYPOTHESIS IS A STATEMENT IN THE FORM:
X (INDEPENDENT VARIABLE) CAUSES Y (DEPENDENT VARIABLE).
123
HYPOTHESIS TEST VERIFIES IF
THIS PROPOSED RELATIONSHIP IS LIKELY TO HOLD IN REALITY
124
INFERENTIAL STATISTICS IS A KEY COMPONENT OF A HYPOTHESIS TEST AS IT ALLOWS TO
DETERMINE IF PATTERNS OF DEPENDENT VARIABLE (Y) AFTER EXPOSURE TO AN INDEPENDENT VARIABLE (X) IN A LIMITED AMOUNT OF DATA ARE LIKELY TO BE SEEN IN A LARGE POPULATION.
125
HYPOTHESIS TESTS CAN FOLLOW MANY DIFFERENT RESEARCH STRATEGIES OR DESIGNS. WE BEGIN STUDY OF H.T. WITH THE SIMPLEST POSSIBLE DESIGN:
ONE SAMPLE HYPOTHESIS TEST
126
ONE SAMPLE HYPOTHESIS TEST
A. SUPPOSE WE WANT TO KNOW IF EXPOSURE TO A SOME VALUE OF VARIABLE X CHANGES VALUES OF VARIABLE Y. B. SAY X IS AN EXPERIMENTAL MEDICINE TO REDUCE BLOOD PRESSURE, AND Y IS BLOOD PRESSURE. C. WE KNOW THE MEAN AND THE STD OF FOR A POPULATION OF VARIABLE Y. D. WE SELECT A SAMPLE FROM THAT POPULATION AND EXPOSE IT TO VARIABLE X (MEDICATION). • NOTE THAT AFTER EXPOSURE TO X, THE SAMPLE NO LONGER REPRESENTS THE ORIGINAL POPULATION FROM WHICH IT CAME, BUT RATHER IT REPRESENTS A NEW POPULATION THAT WOULD EXIST IF ALL ORIGINAL POPULATION WERE EXPOSED TO X. E. WE OBTAIN A MEAN FOR THE SAMPLE, MY-NEW. SUPPOSE THAT MY-NEW < µY-OLD. HERE MY-NEW IS THE OBSERVED SAMPLE MEAN REPRESENTING THE IMAGINARY POPULATION WHERE EVERYONE HAS TAKEN THE MEDICINE, AND µY-OLD IS EXPECTED VALUE OF A SAMPLING DISTRIBUTION OF MEANT THAT COULD BE TAKEN FROM THE ORIGINAL POPULATION UNEXPOSED TO X (MEDICATION). F. ULTIMATELY WE WANT TO KNOW IF THE DIFFERENCE BETWEEN MY-NEW AND µY-OLD IS SUFFICIENTLY LARGE THAT WE CAN CONCLUDE THAT MEDICATION (AND NOT A SAMPLING ERROR) WAS BEHIND THE REDUCTION IN BLOOD PRESSURE IN THE SAMPLE.
127
A STATISTICAL HYPOTHESIS TEST PROCEEDS THROUGH
FIVE STEPS
128
A STATISTICAL HYPOTHESIS TEST PROCEEDS THROUGH FIVE STEPS:
A. STATE NULL AND ALTERNATIVE HYPOTHESES B. CHOOSE A CRITICAL LEVEL (AKA ALPHA LEVEL) C. OBTAIN CRITICAL VALUES OF z. D. CALCULATE TEST VALUE OF z E. COMPARE
129
A. STATE NULL AND ALTERNATIVE HYPOTHESES.
• A NULL HYPOTHESIS STATES THAT A THERE IS NO RELATIONSHIP BETWEEN X AND Y AS PROPOSED BY RESEARCHER. • STATISTICALLY, THIS MEANS THAT µY-NEW IS EQUAL TO µY-OLD. • H0 CAN BE DIRECTIONAL (ONE-TAILED) OR NON-DIRECTIONAL (TWO-TAILED): - NON-DIRECTIONAL H0: µY-NEW = µY-OLD - DIRECTIONAL H0: µY-NEW ≥ µY-OLD OR µY-NEW ≥ µY-OLD - THERE MUST ALWAYS BE AN “EQUAL” SIGN IN A NULL HYPOTHESIS. • ALTERNATIVE HYPOTHESIS STATES THAT A THERE IN FACT IS A RELATIONSHIP BETWEEN X AND Y AS PROPOSED BY RESEARCHER. • STATISTICALLY, THIS MEANS THAT µY-NEW IS NOT EQUAL TO µY-OLD. • LIKE H0, HA CAN BE DIRECTIONAL (ONE-TAILED) OR NON-DIRECTIONAL (TWO-TAILED): - NON-DIRECTIONAL HA: µY-NEW ≠ µY-OLD - DIRECTIONAL H0: µY-NEW < µY-OLD OR µY-NEW > µY-OLD - THERE IS NEVER AN “EQUAL” SIGN IN AN ALTERNATIVE HYPOTHESIS.
130
H0, HA CAN BE
DIRECTIONAL (ONE-TAILED) OR NON-DIRECTIONAL (TWO-TAILED)
131
B. CHOOSE A CRITICAL LEVEL (AKA ALPHA LEVEL)
THIS α LEVEL DETERMINES HOW FAR MY-NEW HAS TO BE REMOVED FROM µY-OLD, TO COUNT AS EVIDENCE AGAINST H0 AND IN FAVOR OF HA. • IN THE CHART BELOW, SHADED AREAS REPRESENT ALPHA LEVEL. • SPECIFICALLY α IS THE SHARE OF SAMPLE MEANS THAT ARE SO FAR REMOVED FROM THE EXPECTED VALUE (µY-OLD) AS TO BE CONSIDERED EVIDENCE FOR REJECTING H0. • A SAMPLE MEAN REPRESENTED BY THE ORANGE LINE IN CHART BELOW WOULD COUNT AS EVIDENCE THAT µY-NEW ≠ µY-OLD, WHILE A SAMPLE MEAN REPRESENTED BY A BLUE LINE WOULD INDICATE THAT µY-NEW = µY-OLD. • TYPICALLY APHA LEVELS ARE 0.01, 0.05 OR (IN SOME SOCIAL SCIENCES) 0.1. • NOTE THAT AN ALPHA LEVEL NEEDS TO BE SPLIT IN HALF (AND “PLACED” ON EACH TAIL) FOR A NON-DIRECTIONAL H.T., WHILE THE ENTIRE CRITICAL AREA MUS BE CONCENTRATED IN THE TAIL GIVEN BY ALTERNATIVE HYPOTHESIS FOR DIRECTIONAL H.T.
132
C. OBTAIN CRITICAL VALUES OF z
. THESE ARE z SCORES, MARKING BOUNDARIES BETWEEN CRITICAL AREA(S) IN THE TAIL(S) AND THE REMAINING BODY OF SAMPLING DISTRIBUTION.
133
D. CALCULATE TEST VALUE OF z
ASSOCIATED WITH SAMPLE MY-NEW
134
E. COMPARE
TEST z AGAINST CRITICAL z. IF |TEST z| > |CRITICAL z|, REJECT H0
135
A SAMPLE PROBLEM.
A HISTORIC, AVERAGE GRADE IN A COURSE, EARNED BY LOCAL STUDENTS IS µ = 7, WITH σ = 2. WE SELECT A RANDOM SAMPLE OF n = 9 ERASMUS STUDENTS, AND FIND OUT THAT THEIR M = 6. DOES THIS MEAN, THAT ERASMUS STUDENTS EARN ON AVERAGE DIFFERENT GRADES THAN LOCAL STUDENTS?
136
A HISTORIC, AVERAGE GRADE IN A COURSE, EARNED BY LOCAL STUDENTS IS µ = 7, WITH σ = 2. WE SELECT A RANDOM SAMPLE OF n = 9 ERASMUS STUDENTS, AND FIND OUT THAT THEIR M = 6. DOES THIS MEAN, THAT ERASMUS STUDENTS EARN ON AVERAGE DIFFERENT GRADES THAN LOCAL STUDENTS?
137
ASKING IF ERASMUS STUDENTS EARN LOWER GRADES THAN LOCAL STUDENTS?
138
A HYPOTHESIS TEST CAN COMMIT TWO TYPES OF ERRORS
TYPE ONE ERROR: REJECTING A CORRECT H0 (STATING AN EFFECT, WHEN THERE IS NONE). • A RATHER DANGEROUS ERROR (PRESCRIBING MEDICINE WHEN IT DOESN’T WORK). • T1E IS DIRECTLY RELATED TO THE SIZE OF YOUR ALPHA LEVEL. THE LARGER THE CRITICAL AREA, THE MORE LIKELY WILL OUR SAMPLE M “JUMP” INTO IT. EVEN IF ANY DIFFERENCE BETWEEN M AND µ IS DUE ONLY TO SAMPLING ERROR. • FOR THIS REASON, CHOOSE SMALLER OF THE TWO ELIGIBLE CRITICAL AREAS IN UNT. C. T2E: FAILING TO REJECT A WRONG H0. • T2E IS RELATED TO: - A SMALL EFFECT SIZE OF X ON Y. - A SMALL n. - THERE IS NOT MUCH A RESEARCHER CAN DO ABOUT T2E, EXCEPT FOR INCREASING n.
139
WHAT DETERMINES THE LIKELIHOOD OF REJECTING H0?
A. EFFECT SIZE: THE MORE DIFFERENT IS MY-NEW FROM µY-OLD THE MORE LIKELY MY-NEW IS TO GET INTO CRITICAL AREA. B. SAMPLE SIZE: THE LARGER n, THE SMALLER STANDARD ERROR, THE LARGERT TEST z SCORE, THE MORE LIKELY IS THAT Z SCORE TO GET INTO CRITICAL AREA. C. ALPHA LEVEL (NOT TO BE INCREASED FOR THE PURPOSE OF REJECTING H0)
140
Mean (μ) =
141
Σ is
the summation (addition) sign
142
xi is
each individual number
143
N is
the population size
144
A sampling distribution is a
probability distribution of a statistic obtained from a larger number of samples drawn from a specific population.
145
MEAN OF A SAMPLING DISTRIBUTION IS CALLED
146
AN ADJACENT COLUMN CONTAINS FREQUENCIES (f) OF EACH VALUE:
NUMBERS OF OBSERVATIONS IN A SAMPLE THAT HAVE A PARTICULAR VALUE
147
FREQUENCY TABLE MAY CONTAIN RELATIVE FREQUENCIES (rf, %):
SHARES OF OBSERVATIONS (FROM THE TOTAL n) THAT HAVE A PARTICULAR VALUE.
148
A FREQUENCY TABLE MAY CONTAIN CUMULATIVE FREQUENCIES (cf):
NUMBERS OF OBSERVATIONS THAT HAVE VALUES THAT ARE EQUAL TO OR LOWER THAN A GIVEN VALUE.
149
A FREQUENCY TABLE MAY CONTAIN CUMULATIVE RELATIVE FREQUENCIES (crf, c%):
SHARES OF OBSERVATIONS THAT HAVE VALUES THAT ARE EQUAL TO OR LOWER THAN THE VALUE.
150
STANDARD DEVIATION IS
AN AVERAGE DISTANCE OF ALL OBSERVATIONS FROM THE MEAN.
151
s^2=
152
IF YOU CHANGE ONE VALUE IN A SAMPLE,
MEAN CHANGES.
153
IF YOU ADD / REMOVE AN OBSERVATINO TO / FROM A SAMPLE, MEAN
CHANGES, UNLESS THAT OBSERVATIN HAS THE VALUE OF THE MEAN.
154
IF YOU MULTIPLY/DIVIDE ALL VALUES IN A SAMPLE BY A CONSTANT, THE MEAN WILL
ALSO BE MULTIPLIED/DIVIDED BY THAT CONSTANT.
155
IF YOU ADD / SUBTRACT A CONSTANT TO ALL VALUES IN A SAMPLE,
YOU ADD / SUBTRACT THAT SAME CONSTANT TO THE MEAN.
156
What is Mean?
The mean is the average or the most common value in a collection of numbers.
157
WITH SKEWED FREQUENCY DISTRIBUTIONS. A MEAN SHIFTS
QUITE STRONGLY IN THE DIRECTION OF OUTLIERS.
158
MEDIAN CAN BE AN AVERAGE OF TWO VALUES WHEN A SAMPLE OR A POPULATION HAS
AN EVEN NUMBER OF OBSERVATIONS.
159
GOOD EXAMPLES OF INTERVAL-SCALE VARIABLES ARE
TIME AND TEMPERATURE.
160
VARIANCE (σ2 FOR A POPULATION, s2 FOR A SAMPLE):
AN AVERAGE SQUARED DISTANCE OF ALL OBSERVATIONS FROM THE MEAN .
161
Mean and median are close to each other, because
distribution is symmetric.
162
CENTRAL TENDENCY REPRESENTS VALUES (USUALLY A SINGLE VALUE) THAT IS
MOST COMMON IN A FREQUENCY DISTRIBUTION.
163
with symmetric distributions mean is a preferred measure of
Central tendency
164
MEAN IS THE PREFERRED MEASURE OF C.T. FOR
ANY BELL-SHAPED F.D.
165
WITH SKEWED FREQUENCY DISTRIBUTIONS. A MEAN SHIFTS QUITE STRONGLY IN THE DIRECTION OF
OUTLIERS
166
IF YOU CHANGE ONE VALUE IN A SAMPLE, MEAN
CHANGES
167
IF YOU ADD / REMOVE AN OBSERVATION TO / FROM A SAMPLE, MEAN
CHANGES, UNLESS THAT OBSERVATION HAS THE VALUE OF THE MEAN.
168
IF YOU MULTIPLY/DIVIDE ALL VALUES IN A SAMPLE BY A CONSTANT, THE MEAN WILL
ALSO BE MULTIPLIED/DIVIDED BY THAT CONSTANT.
169
IF YOU ADD / SUBTRACT A CONSTANT TO ALL VALUES IN A SAMPLE, YOU ADD / SUBTRACT THAT SAME CONSTANT TO THE
MEAN
170
IN A SYMMETRIC UNIMODAL F.D. MEAN, MEDIAN, AND MODE
COINCIDE (SUTAMPA)
171
IN A SKEWED F.D. MEAN MOVES TOWARDS OUTLIERS, WHILE MEDIAN AND MODE
STAY CLOSER TO COMMON VALUES.
172
IN A MULTIMODAL F.D. MEAN AND MEDIAN TEND TOWARDS THE MIDDLE VALUES OF ALL OBSERVATION, WHILE MODES SHOW
THE MOST FREQUENT ONES.
173
The population would have greater variability than a sample, because of
biased sampling error
174
df =
N-1
175
DF means
degrees of freedom
176
What standard deviation means?
This means that N-1 observations in a sample could’ve taken on any value, while one observation would be predetermined by the values of others and of the mean.
177
OBSERVATIONS THAT ARE REMOVED FROM THE MEAN BY MORE THAN TWO STD ARE
CONSIDERED OUTLIERS.
178
Kai decreasina ar increasina, less \ more ->
one tail
179
Jei tiesiog differnt, change -
two tail
180
181
ESTIMATED STANDARD ERROR: