R-Code for Exam, Modules 1-8 Flashcards

1
Q

sd(DATASET_NAME$VARIABLE_NAME)

A

gives the standard deviation for the observations in the variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

favstats(DATASET_NAME$VARIABLE_NAME)

A

provides summaries of the observations in the variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

hist(DATASET_NAME$VARIABLE_NAME)

A

produces a histogram of the variable from the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

boxplot(DATASET_NAME$VARIABLE_NAME)

A

produces a boxplot of the variable from the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

boxplot(Y~X, data = DATASET_NAME)

A

produces a box-plot for variable “Y” given variable “X” from the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

summary(DATASET_NAME)

A

gives numerical summaries of all of the variables in the data set
(minimum, maximum, median, mean, 1st quartile, 3rd quartile)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

t.test(Y~X, data = DATASET_NAME, conf.level = 0.95)

A

provides a two-sample t-test statistic, degrees of freedom, and p-value for a 95% confidence interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

runif(# OBS, X, Y)

A

produces a list of random numbers in the range (X, Y), with the number of observations specified by “# OBS”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

setwd(“C:/Users/Joseph Paoli/Downloads/Lessons in R for Stats”)

A

sets the working directory for “Lessons in R for Stats” in the Downloads folder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

rm(list=ls())

A

sets a clean working environment in RStudio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

“Flies” → Desired Folder (Click) → Blue Gear (Click) → “Set as Working Directory”

A

how to set the working directory if the RStudio code doesn’t work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

“Plots” → “Export” → “Save as Image…”

A

how to export a plotted graph in the viewing area to a .jpeg or .png file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

capture.output(summary(DATASET_NAME), file = “EXCEL_NAME.xls”)

A

saves the summary statistics for a dataset as an Excel file of a specified name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

data[(DATASET_NAME > X)]

A

modifies the dataset to include only values greater than “X”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

log(DATASET_NAME)

A

takes the common log (base-10) of the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

sqrt(DATASET_NAME)

A

takes the square root of the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

var(DATASET_NAME)

A

finds the variance in the set of values in the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

install.packages(“PACKAGE”)

A

installs a package called “PACKAGE” into RStudio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

library(“PACKAGES”)

A

loads the package “PACKAGE” for our use in RStudio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

?rstudio.command

A

this code would provide information on the command with the name “rstudio.command”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

help.search(“data.input”)

A

command which would locate a code for imputing data into RStudio unknown to the user

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

find(“rstudio.command”)

A

there’s a command called “rstudio.command” which you know the name of and want to use, but you don’t know the package it’s located under

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

example(rstudio.command)

A

we want to run an example of the command “rstudio.command” to become better acquainted with it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

demo(graphics)

A

generates a series of plots and shows the code to make them in the “Console” window in the lower-left of RStudio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

colnames(DATASET_NAME)

A

provides the names of all of the columns in a data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

dim(DATASET_NAME)

A

provides the number of columns and the number of rows in the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

str(DATASET_NAME)

A

provides the internal structure of the data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

range(DATASET_NAME$VARIABLE_NAME)

A

provides the range in values of a certain variable from the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

quantile(DATASET_NAME$VARIABLE_NAME, X%)

A

provides the X% quantile of a certain variable from the dataset, which is to say that X% of the other observations are below it and (100-X)% of the observations are above it (“X%” is expressed in decimal form, not as a percentage)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

quantile(DATASET_NAME$VARIABLE_NAME, X%, Y%, Z%)

A

provides the X%, Y%, and Z% quantiles for a certain variable from the dataset, with X% < Y% < Z%

31
Q

unique(DATASET_NAME$VARIABLE_NAME)

A

provides all of the observed values or name for a variable from the data set

32
Q

table(DATASET_NAME$VARIABLE_NAME)

A

for all of the unique entries of a given variable, this command tabulates the number times they appears and displays it in the “Console” area

33
Q

indexes = X:Y

A

would produce a list of all integer numbers between the lower integer “X” and upper integer “Y”

34
Q

DATASET_FEW_COLUMNS = DATASET_NAME[, indexes]

A

code we can write to produce a new data set with fewer columns, which includes only the columns earmarked by a list of integers between “X” and “Y” in the object called “indexes”

35
Q

DATASET_FEW_ROWS = DATASET_NAME[indexes ,]

A

code we can write to produce a new data set with fewer rows, which includes only the rows earmarked by a list of integers between “X” and “Y” in the object called “indexes”

36
Q

DATASET_FEW_BOTH = DATASET_NAME[indexes , indexes]

A

code we can write to produce a new data set with fewer columns and rows, which includes only the rows and columns earmarked by a list of integers between “X” and “Y” in the object called “indexes”

37
Q

main = “HISTOGRAM_TITLE”

A

provides a title for the histogram when typed into the “hist(DATASET_NAME)” command after a parenthesis placed after the text “DATASET_NAME”

38
Q

xlab = “X-AXIS_LABEL”

A

provides an x-axis label for the histogram when typed into the “hist(DATASET_NAME)” command after a parenthesis placed after the text “DATASET_NAME”

39
Q

ylab = “Y-AXIS_LABEL”

A

provides a y-axis label for the histogram when typed into the “hist(DATASET_NAME)” command after a parenthesis placed after the text “DATASET_NAME”

40
Q

aggregate(VARIABE_Y~VARIABLE_X, data = DATASET_NAME, sd/mean)

A

would provide the standard deviation or mean for two variables in the data set, with “VARIABLE_Y” being the y-variable and “VARIABLE_X” being the x-variable

41
Q

query_VARIABLE = is.na(DATASET_NAME$VARIABLE_NAME)
index_VARIABLE = which(query_VARIABLE)
NEW_DATASET = DATASET_NAME[-index_VARIABLE, ]

A

we have a data set which has rows with values “NA” under certain variables, and we want to exlude these from “NEW_DATASET”

42
Q

plot(x = NEW_DATASET$EXPLANATORY, y = NEW_DATASET$RESPONSE)

A

would create a scatter-plot relating an explanatory variable to a response variable for a new dataset, “NEW_DATASET”.

43
Q

pch = actual integer number (1, 2, 17, etc.)

A

line of code which, if typed inside of the parentheses in the “plot()” command, will change the open circles denoting coordinates from the open circles to something else

44
Q

xlim = c(LOW INTEGER, HIGH INTEGER)
ylim = c(LOW INTEGER, HIGH INTEGER)

A

two commands which, if typed inside of the parentheses in the “plot()” command, will denote the range in the x/y-axes for the viewing window

45
Q

text(x = X-COORDINATE, y = Y-COORDINATE, labels = “NAME”)

A

line of code written on the inside of the parentheses in the command “plot()” to place a desired name at a particular set of coordinates

46
Q

set.seed(1)

A

code written which generates the same set of random numbers and allows the use of those same integers later on

47
Q

n = number of observations (100, 104, etc.)
mu = average of observation values
sigma = standard deviation (1, 2, etc.)
norm_dist = rnorm(n, mean = mu, sd = sigma)

A

we want to establish a normal distribution curve (called “norm_dist”), which has an explicit number of observations (n), mean value of observations (mu), and an explicit standard deviation (sigma) – what are the four lines of code necessary to establish “norm_dist”?

48
Q

brk_points = seq(from = LOW, to = HIGH, by = SIZE)

A

to establish a list of numbers (called “brk_points”) which is bounded between two values (LOW, HIGH) and is sub-divided between each number in the set by ‘SIZE’

49
Q

hist(norm_dist, xlim=c(LOW, HIGH), breaks=brk.points)

A

command which would create a histogram for the normal distribution “norm_dist”, whose x-axis is bound between (LOW, HIGH) and which has a width of the bins defined by the number sequence from the previous question

50
Q

mu1 = mu2 = … = mu(n-1) = mun = x; sigma1 = y1, sigma2 = y2, etc. (different shapes)

A

if all of the averages (mu) for various normal distributions is the same value (x), but the standard deviations being different will cause some curves to be thinner (reduced variation) and those with larger variation will be wider and flatter (greater variation)

51
Q

mu1 = x1, mu2 = x2, etc.; sigma1 = sigma2 = … = sigma(n-1) = sigman = y (different places)

A

if averages (mu) for various normal distributions are different, but the standard deviations are all the same value (y), the curves will have the same overall shape but they will be centered at different parts of the x-axis

52
Q

b0 ; b1 ; sigma ; n

A

intercept ; slope ; measure of the spread of frequency ; sample size

53
Q

query_VARIABLE = DATASET_NAME$VARIABLE_X == included group
index_VARIABLE = which(query_VARIABLE)
RELEVANT_DATASET = DATASET_NAME$VARIABLE_Y[index_VARIABLE]

A

situation in which we have a dataset with an explanatory variable with multiple unique traits (producer 1, producer 2, etc.), and we want to establish a dataset which includes only the responses associated with one of those unique representatives (i.e., all of the y-outputs associated with producer 1, or all of the y-outputs associated with producer 2)

54
Q

qqnorm(RELEVANT_DATASET)
qqline(RELEVANT_DATASET, col = ‘red’/’blue’/’green’ [etc.])

A

two commands which establish a normal Q-Q plot for the relevant data we want to inspect for the normal distribution

55
Q

log_RELEVANT = log(RELEVANT_DATASET)

A

command would perform a common log transformation on the list of values in “RELEVANT_DATASET”, then place those values in the object “log_RELEVANT”

56
Q

hist(RELEVANT_DATASET) < hist(log_RELEVANT) [possible for normality]

A

expresses the possibility that a log transformation of the original sample values may adhere better to the normal distribution than the original values of the samples

57
Q

t.test(DATASET_NAME, mu = log(TRUE_MEAN), alternative=‘less’/‘greater’/’two-sided’)

A

code to run to perform a t-test on a dataset, with “mu” standing in for the true value of the means and “alternative” specifying if the alternative hypothesis is that the sample mean is “less”, “greater”, or has a (default) “two-sided” difference from the true mean in the population

58
Q

numerator_1sided = mean(SAMPLE_MEAN) - mean(TRUE_MEAN)

A

makes an object called “numerator_1sided” which is the mean of all of the values in the sample, minus the mean in the population (or in a claim made by a vendor)

59
Q

n = length(DATASET_NAME)

A

establishes an object which is as many units long as there are samples in the study with data

60
Q

denominator_1sided = sd(DATASET_NAME)/sqrt(n)

A

makes an object called “denominator_1sided” which is the standard deviation of the sampled values, divided by the square root of the number of samples present

61
Q

T_statistic_1sided = numerator_1sided/denominator_1sided

A

creates the object “T_statistic_1sided” for the manual calculation of the T-statistic associated with a manually-performed T-test – “T_statistic_1sided” makes use of two previously-generated values, “numerator_1sided” and “denominator_1sided”

62
Q

df_1sided = length(DATASET_NAME) - 1
(df_1sided = n - 1)

A

creates the object “df_1sided”, which represents the degrees of freedom present in a manually-generated t-test, based on the length of the dataset (HINT; the object “df_1sided” and the object “n” differ in regard to only one thing)

63
Q

pt(T_statistic_1sided, df = df_1sided)

A

allows one to calculate the P-value for a one-sided t-statistic, based on the pre-established objects “T_statistic_1sided” and “df_1sided”

64
Q

query_1/A = DATASET_NAME$CATEGORICAL_VARIABLE==1/“A”
query_2/B = DATASET_NAME$CATEGORICAL_VARIABLE==2/“B”
index_1/A = which(query_1/A)
index_2/B = which(query_2/B)
DATA_CAT_1/A = DATASET_NAME[index_1/A, ‘NAME OF QUANTITATIVE VALUES’]
DATA_CAT_2/B = DATASET_NAME[index_2/B, ‘NAME OF QUANTITATIVE VALUES’]

A

set of six commands we could input using the query~index~new dataset method to split a larger dataset with a category with two representatives (Producer 1 and Producer 2; Producer A and Producer B) into two new datasets with just their values

65
Q

t.test(x = DATA_CAT_1/A, y = DATA_CAT_2/B, var.equal = TRUE, alternative = ‘two-sided’)

A

command to run a two-sided t-test which relates the data from the values in one categorical explanatory variable (DATA_CAT_1/A) to the values in the other categorical response variable (DATA_CAT_2/B)

66
Q

numerator_2sided = mean(DATA_CAT_1/A) - mean(DATA_CAT_2/B)

A

creates an object (“numerator_2sided”) which can be used to calculate the numerator in order to derive the t-statistic by hand for a two-sided test

67
Q

n_1/A = length(DATA_CAT_1/A)
n_2/B = length(DATA_CAT_2/B)

A

creates two objects which have the number of samples in the two unique categories used for our two-sided t-test (i.e., Farm 1 and Farm 2; Producer A and Producer B)

68
Q

df_2sided = length(DATA_CAT_1/A) + length(DATA_CAT_2/B) - 2
(df_2sided = n_1/A + n_2/B - 2)

A

creates the object “df_2sided”, which represents the degrees of freedom present in a manually-generated t-test, based on the length of the dataset (HINT; the object “df_2sided” is related to the objects “n_1/A” and “n_2/B”

69
Q

samp_sig2 = ((n_1/A - 1)var(DATA_CAT_1/A) + (n_2/B - 1)var(DATA_CAT_2/B))/df_2sided

samp_sig = sqrt(samp_sig2)

A

list of commands used to find the standard deviation of the samples (“samp_sig”), which is the square root of the calculated pooled variance for DATA_CAT_1/A and DATA_CAT_2/B (“samp_sig2”)

70
Q

denominator_2sided = samp_sig*sqrt((1/n_1/A) + (1/n_2/B))

A

the denominator in a manually calculated t-statistic for a two-sided t-test (“denominator_2sided”) equals the product of our derived standard deviation of the samples (“samp_sig”), times the square root of the inverse values of the number of samples between Category 1/A and Category 2/B

71
Q

T_statistic_2sided = numerator_2sided/denominator_2sided

A

the t-statistic in a two-sided t-test is equal to the 2-sided numerator and the 2-sided denominator, which were previously worked out

72
Q

2*(1-pt(T_statistic_2sided, df=df_2sided))

A

code used to calculate the P-value associated with a two-sided t-test

73
Q

t.test(x = DATASET_NAME$TREAT_1, y = DATASET_NAME$TREAT_2, paired = TRUE)

A

command to run a paired two-sided t-test (as opposed to the default unpaired two-sided t-test) which relates the values of one variable (TREAT_1) to the values of another variable (TREAT_2)