Intro Flashcards

1
Q

Load libraries into R

A

library(Caret)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Edit data in R

A

fix()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

View names in data frame

A

names()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Load Variable Names Into Environment So Don’t have to type the name of columns

A

attach(DataFrame)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

basic linear model function R

A

lm.(y~x, data = DataFrame)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

display statistics on model

A

first send model output to variable lm.output = lm(y~x, data=Dataframe)
Then, summary(lm.output)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Show information after fitting a model

A

names(model.output)

summary(model.output)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Show model coefficients and confidence intervals

A

coef(model.output) #shows the coeff

confint(model.output) #shows the 95% conf interval for the coefficients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Use model to predict new values

A

Predict()

Predict(model.output, dataframeofx’s, interval=”confidence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Prediction vs. Confidence Interval

A

When predicting a new data point, want prediction interval. Confidence Interval is about the where the average of future values lie.
To get PI, Predict(model.output, dataframeofx’s, interval = “prediction”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Scatter Plot with Regresion Line

A

Plot(x, y)
abline(model.output, lwd=3, col= “red”) #adds line to scatterplot

lwd is for width

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

See diagnostic plots of linear regression

A

plot(model.output) #Automatically does it, b/c model output contains it, wow!
if there are 4 graphs, first create 4 tiles, so first do:
par(mfrow=c(2,2))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how does predict() work

A

predict(model.output) will return a vector of predicted Y values
predict(model.output,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

inspect functions

A

type function name predict

if there is call to method, use methods(methodname)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Get max of vector

A

which.max(vector), returns index of max position

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Shorthand formula for regression in R

A

formula = lm(Yvariable ~ ., data = DATAFRAME)

instead of writing x1 + x2 + etc you can just put a dot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Function to use when there is Colinearity

A

Need to see the Variance Inflation Factor VIF, part of car package.
library(car)
vif(lm.fit) #use on model output
remember VIF > 5 indicates colinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How to see a correlation matrix

A

cor() all columns must be numeric, if a column isn’t numeric use matrix notation such as cor(data.frame[,-9])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Logistic Regression R

A

logreg = glm(Direction ~Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = SMarket, family = binomial) key is family = binomial, using the glm() genaralized linear model function

20
Q

When doing logistic regression on a factor variable with two levels, how do we handle the dummy coding?

A

we dont have to do anything, the glm function will dummy code it for you automatically! However, you can retireve what the dummy coding values are by using contrasts()
attach(Smarket)
contrasts(Direction)

21
Q

How to use logistic regression output to predict values in your dataset

A

predict(logreg, type = ‘response’)

the type = ‘response’ tells R to output probabilities instead of other variables.

22
Q

How to convert Logistic regression probabilities into actual predictions.

A
  1. Create a vector corresponding to “zero” probability.
    logreg. predict = rep(‘Down’, 1250)
  2. Rename vector based on probability
    logreg. predict[logreg.probs > .5] = ‘Up’
23
Q

How to create a confusion matrix

A

use the table function
table(VectorOfPredictions, Vector of True Values)
table(logreg.predict, Direction)

The two vectors have to have the same values like “Up/Down”, so make sure they are converted to same values.

24
Q

sub select training data in time series

A
train = (Year<2005)  #Stores binary vector in trainingset
Smarket.2005 = Smarket[!train, ] #Data before 2005
Direction.2005 = Direction[!train]
25
Q

How to train a linear regression model on a subset of the data

A

Use the subset argument to glm

logreg. train = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial, subset=train)

Then predict using the predict function

26
Q

making prediction on test set, logistic regression

A

predict(glm.fit, Smarket.2005, type=’response’)

27
Q

LDA with R

A

library(MASS) #library for LDA
lda.fit = lda(Direction~Lag1+Lag2, data=Smarket, subset=train)
lda.pred = predict(lda.fit, Smarket.2005)
names(lda.pred)
lda.class = lda.pred$class #uses 50% probability
lda.probs = lda.pred$posterior #these are the probabilities

28
Q

How to center and scale variables for distance based learning methods KNN, K Means, etc

A

scale()

gives each variable a mean of zero and stdev of 1

29
Q

create vector

A
a = c('Hamel', 'Bob', 'Tom')  OR
a = 1:10   OR
a = c(1:10)
30
Q

Create a matrix

A

matrix(nrow, ncol)

matrix(1:6, 2, 3)

31
Q

see attributes of an object

A

attributes()

32
Q

combine two vectors of equal length into data frame

A

rbind(x,y) OR
cbind(x,y)
* trick question, doesn’t have to be equal length the shorter vector if not equal will just be duplicated to accommodate the bigger vector

33
Q

Factor Vector

A

x ))
The order of the levels can be set using the levels argument to factor(). This can
be important in linear modelling because the rst level is used as the baseline level.

34
Q

find missing values

A

is.na()

returns boolean vector

35
Q

differences b/w mattrices and data frames in R

A

in matrices, entire matrix has to be same class whereas data frame can store dierent classes of objects in each column

36
Q

convert data frame to matrix

A

data.matrix()

37
Q

count # of columns and rows

A

nrow(DataFrame)

ncol(DataFrame)

38
Q

how to change names of columns in R

A

names(x) )

39
Q

find the class or data type of each column

A

sapply(dataframe, class)

40
Q

connections to files in R

A
  1. file - python like interface to file
  2. gzfile opens to connection to gzipped file
  3. url
    see ?file
41
Q

Readlines from a file

A

con <- readLines(con, 10)

42
Q

readlines from webpage

A

con <- readLines(con)

see ?file

43
Q

Find number of missing values in column

A

sum(is.na(Dataframe$column))

44
Q

what is model.matrix()

A
#Create Model Matrix, which is normally used behind the scense
#by regression models to predict things, just converts your dataset
# to a model-freiendly format - adds a column of 1's for intercept and 
#dummy codes all variables
45
Q

What do you do when you perform 10 k-fold cross validation and you are left with 10 different error terms for a particular tuning parameter

A

you take the mean of the statistic in question for all 10 folds. You average the 10 folds together, because you can’t use any one fold by itself.

46
Q

What happens after you find the best model using k-fold cross validation or LOOCV?

A

Once you find the best model, you should fit the model over the ENTIRE dataset as a last step and use that model. You only used CV so that you could find the optimal model that generalizes, but you can go ahead and fit the model over the entire dataset when you are done.

47
Q

what package has lasso and ridge regression

A

library(glmnet)