DSE1101 Flashcards

(104 cards)

1
Q

What is a variable

A

characteristics observed in a study.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When does variable become categorical
U

A

observation belongs to a set of categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When does variable become quantitative

A

observations take on numerical values that represent different magnitudes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is also called independent variable

A

Explanatory variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is also called dependent vairable

A

Response variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is mean

A

“average, is one way to measure the center
of a distribution.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is sample mean

A

The sample mean is a sample statistics and serve as a point estimate of the population mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What kind of variable does histogram show/

A

distribution of a continuous variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is modality

A

associated with the numner of peaks your data have. If have one peak, only talking about a general pattern and data is called unimodal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is unimodal?

A

1 peak

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is 2 peaks

A

bimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is more than 2 peaks

A

multimodal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is it called when all have same peask

A

uniform data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Where is the peak on negatively skewed data

A

“Long tail on left
Peak on right”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give an example of negatively skewed data

A

“GPA
Age of death”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the peak on positively skewed data

A

“Longer tail on right
Peak on left”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

If question ask wheterh left or right skewed, do we remove outliers first?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When you find data of some people who spend $1000 in super market, is it an error?

A

No, take them aside to be analysed separately

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why use median over mean?

A

More robust to outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the cons of using median

A

“MEAN IS EASIER TO COMPUTE THAN MEDIAN, REQUIRE MORE COMPUTING POWER

No need to sort”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

If question ask wheterh left or right skewed, do we remove outliers first?

A

YES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

If distribution is skewed or has some extreme values, where is the center

A

median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

If distribution is left skewed, where is median in relation to mean

A

“mean smaller than median

Median is always closer to the PEAK”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is variance?

A

the average squared deviation from the sample mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the formula for variance?
26
Why we dont use absolute value but square for variance
less computatoinal power, get rid of negative value
27
What is the interquartile range
Q1 to Q3
28
Where does the whiskers of box plot extend up to
1.5 x IQR away from lower and upper quartile
29
What is tukey rule
outliers are values more than 1.5 times the IQR from the quartiles — either below Q1 - 1.5IQR, or above Q3 + 1.5IQR.
30
Where are outliers?
more than 1.5 times the IQR from the quartiles — either below Q1 - 1.5IQR, or above Q3 + 1.5IQR.
31
What are robust statistics for variance
Median and IQR
32
What to do to extremely skewed data?
natural log transformation
33
horizontal axis of histogram is ____
discrete
34
What is denoted by omega
sample space
35
What does a probability model describe
the uncertainty of a random process.
36
What is an outcome
mutually exclusive and collectively exhaustive results of a random process.
37
What is an event
collection of one or more outcomes. It is a subset of the sample space.
38
What is the probaility distribution ?
lists all possible outcomes and the probabilities with which each of them occurs.
39
What is cumulative probability distribution
"probability that a variable is less than or equal to a particular value. P(X<=2)"
40
What is disjoint outcomes?
cannot happen at the same time
41
What does it mean for 2 variables to be independent
occurrence of B provides no information about A.
42
What is P(AnB)?
P (B) × P (A|B)
43
"In 2013, SurveyUSA interviewed a random sample of 500 residence in North Carolina asking them whether the think widespread gun ownership protects law-abiding citizens from crime, or makes society more dangerous. 58% of all respondents said it protects citizens. 67% of White respondents, 28% of Black respondents, and 64% of Hispanic respondents shared the same view. Based on the probabilities above, opinion on gun ownership and race ethnicity are most likely complementary disjoint independent dependent"
Dependent (need to calculate using the given that…)
44
How to express joint probability in X and Y
"P (X = x, Y = y) eg: P (Rain, Long commute) = P (X = 0, Y = 0) = 0.15"
45
What is a random variable?
"numeric quantity whose value depends on the outcome of a random process. Smaller letters denote the values of variable"
46
What is the difference be DISCRETE RANDOM variable and CONTINUOUS RANDOM VARIABLE
"DISCRETE: takes integer values Continuous: takes real decimal values"
47
What is covariance?
extent to which 2 variables move in the same direction
48
What is correlation?
covariance between two variables divided by the product of their standard deviations.
49
What is bernoulli distribution?
"- for discrete variables - binary, with only 2 possible outcomes (0 or 1)"
50
How to express Bernoulli distribution?
"X ∼ Bernoulli(p) p is for prob that value is 1"
51
How to express normal distribution?
N (µ, σ2).
52
What is error?
= true value of population parameter - point estimate
53
What is bias?
the systematic tendency to over or under-estimate the true population parameter.
54
What is sample variability
how much an estimate will tend to vary from one sample to the next.
55
Sample average is…
a estimator of population MEAN
56
What does Y bar stand ofr?
sample mean, y bar is a random variable
57
What is population parameter?
fixed feature of a particular population - usually unknown in real life
58
What is sample statistic?
quantity that vary from one sample to another - easy to compute, as it is statistic of sample from simple random sampling
59
What kind of distribution is it when parameters and exact distributions are not known?
Asymptotic distribution (use approx on asmple) “Tending to a distribution”
60
What do we rely on when following asymptotic distribution?
Law of large numbers central limit theorem
61
What is law of large numbers?
sample mean approaches population mean as the sample size increases
62
What is central limit theorem?
using sample mean and sample variance to approximate distribution of sample mean
63
What is the law of central limit theorm? if population variance sigma^2 is known
When n is large, the sampling distribution of Y¯ is approximately normal, regardless of the distribution of the underlying population. sample mean approx normally distributed with mean miu and variance (sigma^2)/n random sample size=n
64
If population variance is unknown, what does sample mean follow?
student t distribution with n-1 degrees of freedom tails are higher than normal distribution variance is s^2/n
65
If you want to conduct hypo testing on whehter coin is fair, what is variance?
sigma^2 = p(1-p) (assuming the coin is fair) = 0.25 By clt, sample mean is approx normally distributed with : var(p hat)= sigma^2 / n = 0.0025 2 tail test
66
waht is confidence interval
plausible range of values for the population parameter.
67
What is 95% confidence interval?
1.96 +/- Standard error Suppose we take many samples and build a confidence interval from each sample, then about 95% of these intervals would contain the true population parameter
68
Standard error
standard deviation
69
What is margin of error?
width of CI
70
Linear Regression is ____. supervised unsupervised
supervised learning
71
What is a charcteristic of the y variable for linear regression?
continuous dependent
72
can linear regression be used to predict discrete outcomes ?
Yes (credit card default)
73
What does hat denote /
estimate, a predicted value
74
what is the typical equation of a linear regression model?
Y = β0 + β1X + ϵ
75
What does ϵ represent in the model linear regression
residual term/ erorr term DIFFERENCE BETWEEN THE REGRESSION LINE AND THE ACTUAL OBSERVED DATA
76
What is the equaiton for residual?
= yi − yˆi = yi − (β0 + β1xi) = vertical distance between each point to purported line
77
What is the residual sum of squares?
SUM( residuals) for all observations ALSO CALLED LEAST SQUARES the variance in Y that is left unexplained after fitting the regression model.
78
What is model supoposed to minimise in linear regression? How?
RSS 1. sum all residuals , with variables b0 and b1 etc. 2. Take the derivative wrt b0 and b1
79
The regerssion line always passes through which point?
(x bar, y bar) b0 = y hat - b1(x bar) sub into eqn y= b0 +b1 x y bar= y bar - b1 x hat + b1 x hat b1 x hat CANCEL OFF!!!!
80
What does best fit line do?
Minimises the square deviation to the proposed line ( least squares fit for the regression line)
81
How to interpret the y intercept for the y axis?
If there is 0 of x, then ON AVERAGE, able to have y amount
82
How to interpret the slope of a regression plot?
change of Y when X increases/decreases by one unit
83
What is residual standard erorr?
estimate of the standard deviation of the residual terms measures the lack of fit of a model to the data
84
How many degrees of freedom are there for RSE?
N-2 (scale down)
85
What is TSS?
total variance in Y can be explained by model(RSS) + cannot be explained
86
What is R^2?
measures the goodness of fit variance in y that can be explained (larger the R^2, the bigger the goodness of fit) Formula: (TSS- RSS)/ TSS
87
What is the purpose of hypo testing for linear regression?
how close the estimatoed b0 and b1 hat are to the true values of b0 and b1
88
how ot find standard error of an estimator?
repeated sampling, and see what values you get for b0 and b1
89
How do we conduct hypothesis testing for b0 and b1?
T test with n-2 degree of freedom, where n is sample size(cause estimate b0 and b1) t= (b1-0 )/ se(b1 hat)
90
What are the assumptions for the leeast squares line?
1. Relationship between X and Y should be linear 2. Residual nearly normal 3. Residual shave constant variability (homoscedaticity)
91
What graph should we use to check whether X and Y are linear?
Residuals vs Fitted plot RED LINE SHOULD BE HORIZONTAL
92
How to check whether nearly normal residual?
Normal Q-Q plot points should be roughly along straight diagonal line
93
What is hte formula for standardised residual?
(ei -e hat )/ SE(e)
94
How to check for constant variability?
Scale-Location plot (YOU WANT OT HAVE NO PATTERN IN RESIDUAL) red line is roughly horizontal
95
How ot check for influential values?
Residual vs leverage plot check for outlyying vales at upper-right or lower right If they fall outside of cook distance, then it is influential(should remove points)
96
How to improve model?
transforming variables(scaling) seeking additional variables to explain Y Using more advanced methods
97
How to read data in R?
read.csv("file", head=True)
98
How to create a linear model in R?
lm1= lm(y var~ x var, data= Advertising)
99
How to show the coefficients?
summary(lm1)$coefficients
100
When to reject null hypo with 95% confidence that b1 is more than 0?
when |t| for b1 greater than 1.96 There is relationship between variables
101
How to obtain confidence interval for b0 and b1 in R?
confit(lm1). By default 95%
102
How to find confidence interval of 90% in R for b0 and b1?
confit(lm1, level=0.90)
103
How to specify that you use a column in dataset?
data$column name
104
How much of the dataset lies within: 1sd 2sd 3sd
1sd: 68% 2sd: 95 3sd: 99.7