R-Code for Exam, Modules 9-13 Flashcards

1
Q

query_X = DATASET_NAME$CAT_VARIABLE == “X”
index_X = which(query_X)
quantities_X = DATASET_NAME$QUANT_VARIABLE[index_X]
m_quantities_X = mean(quantities_X)
sum_squares_X = sum((quantities_X - m_quantities_X)^2)

A

we have a data set with three categorical variables – X, Y and Z – and we want to find the sum of squares associated with the quantities for variable X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

query_Y = DATASET_NAME$CAT_VARIABLE == “Y”
index_Y = which(query_Y)
quantities_Y = DATASET_NAME$QUANT_VARIABLE[index_Y]
m_quantities_Y = mean(quantities_Y)
sum_squares_Y = sum((quantities_Y - m_quantities_Y)^2)
query_Z = DATASET_NAME$CAT_VARIABLE == “Z”

A

we have a data set with three categorical variables – X, Y and Z – and we want to find the sum of squares associated with the quantities for variable Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

query_Z = DATASET_NAME$CAT_VARIABLE == “Z”
index_Z = which(query_Z)
quantities_Z = DATASET_NAME$QUANT_VARIABLE[index_Z]
m_quantities_Z = mean(quantities_Z)
sum_squares_Z = sum((quantities_Z - m_quantities_Z)^2)

A

we have a data set with three categorical variables – X, Y and Z – and we want to find the sum of squares associated with the quantities for variable Z

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

within_group_ss = sum(sum_squares_X + sum_squares_Y + sum_squares_Z)

A

calculates the within group sum of squares (“within_group_ss”) for three categorical variables (X, Y and Z), whose individual sum of squares has already been calculated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

grand_mean = mean(DATASET_NAME$QUANT_VARIABLE)

A

calculates the grand mean of the quantitative variable of the dataset (saved as “grand_mean”), which represents the mean value across all the members of the categorical variable (X, Y, Z, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

among_group_ss =
(# REPS_X)(m_quantities_X - grand_mean)^2 +
(# REPS_Y)
(m_quantities_Y - grand_mean)^2 +
(# REPS_Z)*(m_quantities_Z - grand_mean)^2

A

code to find the among group sum of squares for three categorical variables (X, Y and Z), given the number of replications they have in the dataset, the grand mean among the three variables, and the mean quantities of each

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

total_ss = within_group_ss + among_group_ss

A

the total sum of squares equals the sum of the within group and the among group sum of squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

table_ss =
as.table(rbind(c(“total_ss”,
“within_group_ss”,
“among_group_ss”),
c(total_ss,
within_group_ss,
among_group_ss)))

A

code which establishes a table to compare the within, among, and total sum of squares for a set of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

anova_model =
aov(QUANTITATIVE~CATEGORICAL,
data = DATASET_NAME)

A

code to establish an ANOVA model for the data (“anova_model”) which relates a quantitative and categorical variable (“QUANTITATIVE”; “CATEGORICAL”) to a dataset (“DATASET_NAME”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

plot(factored_variable, DATASET_NAME$QUANT_VARIABLE,
ylim = c(#, #), ylab = “BLAH BLAH”,
xlab = “BLAH BLAH”,
col = “red”/”blue”/”green” [etc.])

A

creates a box-plot of a categorical variable which has been coded as a factor variable, which allows one to see if the categories differ significantly in their expression from one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

anova_model_1 = aov(quantitative_variable ~
factored_variable,
data = DATASET_NAME)

A

first method to establish an ANOVA model of a dataset, assuming we have already created the object “factored_variable” which relates to our variables “X”, “Y” and “Z”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

anova_model_2 = aov(quantitative_variable ~ factor(DATASET_NAME$CAT_VARIABLE, c(“X”, “Y”, “Z”),
data = DATASET_NAME)

A

second method to establish an ANOVA model of a dataset, assuming we have not yet created the object “factored_variable” for the variables “X”, “Y” and “Z”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

residuals = quantitative_variable – predict.lm(anova_model)
plot(predict.lm(anova_model), residuals, cex.lab = 1.15,
ylim = c(LOW #, HIGH #),
xlim = c(LOW #, HIGH #),
ylab = “BLAH BLAH”,
xlab = “YADA YADA”)
abline(a = 0, b = 0, lwd = 2/3, lty = “dashed”, col = “red”)

A

code we could establish to plot the residuals versus predicted values for the data set to check for the homogeneity of variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

stdRes = rstandard(anova_model)

A

calculates the standard residuals for the ANOVA model (“anova_model”) and saves them as the object “stdRes”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

qqnorm(stdRes,
ylab=”Standardized Residuals”,
xlab=”Theoretical Quantiles”)
qqline(stdRes, col=”red”, lwd=2)

A

establishes a normal Q-Q plot for the residuals, along with the linear line of best fit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

TukeyHSD(anova_model, ordered = TRUE)

A

conducts a Tukey’s Honestly Significant Difference test on the ANOVA model in order to determine which groups in the ANOVA vary significantly from one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

plot(TukeyHSD(anova_model, ordered = TRUE))

A

establishes a plot of the Tukey’s test of the ANOVA model – pairs of variables which don’t overlap with the dashed horizontal line at x = 0 are significantly different, while those which lie on the dashed line are significantly different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

quant_var = DATASET_NAME$QUANTITATIVE_VARIABLE
fact_var1 = factor(DATASET_NAME$CATEGORY_VARIABLE_1)
fact_var2 = factor(DATASET_NAME$CATEGORY_VARIABLE_2)
par(mfrow = c(1, 2))
plot(fact_var1, quant_var,
c(LOW #, HIGH #),
ylab = “BLAH BLAH”,
xlab = “YADA YADA”,
col = “red”)
plot(fact_var2, quant_var,
c(LOW #, HIGH #),
ylab = “BLAH BLAH”,
xlab = “YADA YADA”,
col = “blue”)

A

codes which if typed out would factor two categorical variables and establish an object for the quantitative variables, then would allow for two box-plots to be set up to compare the relationship between the first factored variable and the quantitative variable and the second factored variable and the quantitative variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

2way_anova_model = aov(quant_var ~ fact_var1 + fact_var2)

A

code to produce a 2-way ANOVA model, given we have the quantitative variable and have factorized the two categorical variables already

20
Q

summary(2way_anova_model
[Pr(>F); fact_var1 < 0.05,
fact_var2 <0.05]
summary(1way_anova_model)
[Pr(>F); factored_variable > 0.05]

A

describes the possibility that when running a block ANOVA, the P-values for the two factor variables may be significant, but having a simpler one-factor ANOVA model which doesn’t take one of those variables into account may produce a P-value which is not significant

21
Q

library(HSAUR2)
data(“USairpollution”)

A

loads the “USairpollution” dataset from the “HSAUR2” package; “HSAUR2” is ingrained into RStudio, so you can load it without having to rely on the data sets provided by the professors

22
Q

plot(USairpollution$predays, USairpollution$SO2,
xlab = “days with precipitation”,
ylab = “sulfur emissions”)

A

creates a box-plot which relates the days with precipitation on the x-axis and the sulfur emissions on the y-axis

23
Q

pollution_model = lm(USairpollution$SO2 ~ USairpollution$predays)
abline(pollution_model,
col = “red”, lwd = 3)

A

graphs a red linear line of best fit with a width of three for the relationship between sulfur emissions and days with precipitation from the “USairpollution” data set

24
Q

predicted = predict.lm(pollution_model)
residuals = USairpollution$SO2 - predicted
plot(predicted, residuals,
xlab = “Predicted Y”,
ylab = “Residuals”)
abline(a = 0, b = 0,
col = “red”, lwd = 3, lty = “dashed”)

A

commands to establish a plot of the residuals versus the predicted values for the sulfur emissions given the number of days with precipitation (variance appears heteroscedastic).

25
Q

stdRes = rstandard(pollution_model)
log_pollute_mod = lm(log(USairpollution$SO2) ~ USairpollution$predays)
plot(USairpollution$predays, log(USairpollution$SO2),
xlab = “days with precipitation”,
ylab = “sulfur emissions (log-transformed)”)
abline(log_pollute_mod,
col = “red”, lwd = 3)

A

graphs a red linear line of best fit with a width of three for the relationship between the log-transformed sulfur emissions and days with precipitation from the “USairpollution” data set

26
Q

log_predicted = predict.lm(log_pollute_mod)
log_residuals = log(USairpollution$SO2) - log_predicted
plot(log_predicted, log_residuals,
xlab = “Log-Trans Predicted Y”, ylab = “Log-Trans Residuals”)
abline(a = 0, b = 0, col = “red”, lwd = 3, lty = “dashed”)

A

commands to establish a plot of the log-transformed residuals versus the predicted values for the sulfur emissions given the number of days with precipitation (variance still appears heteroscedastic).

27
Q

rainydays_sulfur = lm(USairpollution$SO2 ~ USairpollution$predays)
stdRes = rstandard(rainydays_sulfur)
qqnorm(stdRes, xlab = “Theoretical Quantiles”, ylab = “Standardized Residuals”)
qqline(stdRes, col = “red”, lwd = 2, lty = “dashed”)

A

creates a QQ-plot for the relationship between SO2 emissions and days with precipitation

28
Q

log_stdRes = rstandard(log_pollute_mod)
qqnorm(log_stdRes, xlab = “Theoretical Quantiles (Log-Transformed)”,
ylab = “Standardized Residuals (Log-Transformed)”)
qqline(log_stdRes, col = “red”, lwd = 2, lty = “dashed”)

A

establishes the qq-plot for the log-transformed data from the USairpollution data set

29
Q

plot(x = DATASET_NAME$VARIABLE_1, y = DATASET_NAME$VARIABLE_2)
mean_var1 = mean(DATASET_NAME$VARIABLE_1)
mean_var2 = mean(DATASET_NAME$VARIABLE_2)
abline(v = mean_var1, col = “gray”)
abline(h = mean_var2, col = “gray”)

A

commands relevant to simple regression which are used to see if two variables which are quantitative (VARIABLE_1 and VARIABLE_2) have a positive, negative or no correlation

30
Q

cor(DATASET_NAME)

A

gives the correlation between the different variables in a dataset – each variable has a 1.000… correlation to itself but can have a corrleation of anywhere between -1.000… and 1.000… to another variable

31
Q

devi_x = DATASET_NAME$VARIABLE_1 - mean_var1
devi_y = DATASET_NAME$VARIABLE_2 - mean_var2
Pearson_numerator = sum(devi_x*devi_y)

A

manually calculates the numerator of the Pearson coefficient for VARIABLE_1 and VARIABLE_2

32
Q

sosq_devi_x = sum(devi_x^2)
sosq_devi_y = sum(devi_y^2)
Pearson_denominator = sqrt(sosq_devi_x*sosq_devi_y)

A

manually calculates the numerator of the Pearson coefficient for VARIABLE_1 and VARIABLE_2

33
Q

Pearson_coefficient = Pearson_numerator/Pearson_denominator

A

manually derives the Pearson coefficient, which is above zero if there is a positive correlation and below zero if there is a negative correlation

34
Q

query_neg_Pearson = devi_x*devi_y < 0
index_neg_Pearson = which(query_neg_Pearson)
length(index_neg_Pearson)/nrow(DATASET_NAME)

A

provides in the console window the percentage of coordinates which result in negative correlations when x and y are multiplied

35
Q

query_pos_Pearson = devi_x*devi_y > 0
index_pos_Pearson = which(query_pos_Pearson)
length(index_pos_Pearson)/nrow(DATASET_NAME)

A

provides in the console window the percentage of coordinates which result in positive correlations when x and y are multiplied

36
Q

x = seq(from = -10, to = 10, length.out = 100)
y = x^2
plot(x, y)
mean_x = mean(x)
mean_y = mean(y)
abline(v = mean_x, col=”grey”)
abline(h = mean_y, col=”grey”)

A

generates a graph of the quadratic relationship y = x^2, bound between [-10, 10] – while there is clearly a strong relationship here, a manually calculated Pearson coefficient would result in an answer of zero because there are as many coordinates in the positive quadrants as in the negative quadrants

37
Q

dataset_narrowed = dataset_general[, c(‘COLUMN_1’, ‘COLUMN_2’, etc.)]

A

for datasets with a crap-ton of columns, using brackets with the command “c(‘COLUMN_1’, ‘COLUMN_2’, etc.)” will reduce the number therein

38
Q

query_elim = !is.na(DATASET_NAME$BAD_COLUMN_1) &
!is.na(DATASET_NAME$BAD_COLUMN_2) &
DATASET_NAME$BAD_COLUMN_1 < HIGH # &
DATASET_NAME$BAD_COLUMN_1 > LOW # &
DATASET_NAME$BAD_COLUMN_2 < HIGH # &
DATASET_NAME$BAD_COLUMN_2 > LOW #
index_elim = which(query_elim)
dataset_retained = dataset_narrowed[index_elim, ]

A

if the dataset in the narrowed group has bogus values, this set of commands can keep only the true values which may be more reasonable

39
Q

lowess_retained = lowess(x = dataset_retained$COLUMN_1,
y = dataset_retained$COLUMN_2, f = 1/#)
lines(x = lowess_retained$x, y = lowess_retained$y, col = ‘red’, lwd = 3)

A

codes to establish a lowess line for the values

40
Q

plot(x = DATASET_NAME$VARIABLE_X, y = DATASET_NAME$VARIABLE_Y,
xlab = “VARIABLE X”, ylab = “VARIABLE Y”)
regression_results = lm(VARIABLE_Y~VARIABLE_X, data = DATASET_NAME)
lines = (x = DATASET_NAME$VARIABLE_X,
y = regression_results$fitted.values)

A

plots the simple regression model between two variables, X and Y

41
Q

regression_summary = summary(regression_results)

A

establishes the summary of the simple regression for two variables

42
Q

slope = regression_summary$coefficients[“VARIABLE_X”, “Estimate”]

A

establishes the slope of the standard regression

43
Q

slope_standard_error =
regression_summary$coefficients[“VARIABLE_X”, “Std. Error”]

A

establishes the standard error of the standard regression

44
Q

slope_tvalue = slope/slope_standard_error

A

calculates the t-value of the slope of the simple regression line

45
Q

intercept = regression_summary$coefficients[“(Intercept)”, “Estimate”]
intercept_standard_error =
regression_summary$coefficients[“(Intercept)”, “Std. Error”]
intercept_tvalue = intercept/intercept_standard_error

A

code used to find the intercept t-value

46
Q
A