R Details Flashcards

(104 cards)

1
Q

How can you join vectors together?

A

Using the names of the data sets/vectors you want to add- eg girls and boys this is the code
> children =c(girls,boys)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you check the length of the vector

A

> length(vector name)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When adding vectors what is key to remember?

A

Don’t put + signs - only commas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to extract particular elements/numbers from a vector / data set?

A

> nameofvector[1]
The square brackets tell r where in the vector you want to be shown
A range of elements is written
nameofvector[1:7]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you see a vector without certain elements

A

> nameofvector[-1]
Minus the first element

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Maximum value of the vector

A

> max(vectorname)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to work out if any vectors match our number

A

> which(vectorname==7)
Will give you the position of those values that match

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Change the name of a vector

A

Vector= nameofvector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to calculate the sum of all elements

A

> sum(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Mean of elements

A

> mean(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Median of elements

A

> median(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Variance of elements

A

> var(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Standard deviation

A

> std = function(x) sqrt(va(x))
std(vector)
You have to teach r how to calculate standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Normality test example

A

Shapiro- wilks test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When should you use Shapiro- wilks

A

To answer the null hypothesis: the data is drawn from a normal population
The p-value is the probability that our data are normal
A low p value lower than 0.05/5% allows us to reject the null hypothesis - meaning the alternative is true - the data is not normal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What do you do if your data is not Normal?

A

Calculate a non-parametric measure of data spread eg interquartile range
>IQR(vector)
Or
Median average deviation (MAD)- this finds the median of the absolute differences from the median and then multiples by a constant (1.4826)- which makes it comparable with the standard deviation
>mad(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the code for summary and what does it show you?

A

> summary(vector)
Reports:
Minimum
Maximum
Median
Mean
1st quartile
3rd quartile

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you graphically show that random data is approx normal?

A

“Normal probability plot”
Any curving will show that the distribution has short or long tails
The line is drawn through points formed by the 1st and 3rd quartiles
>qqnorm(vector,main=“normal (0,1)”)
>qqline(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does a data transformation do?

A

Attempts of approximate normality before parametric stats can be applied
If data cant be converted to normality non parametric stats have to be used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Common data transformation process

A

Logarithm of the data - log(x+1)
>qqnorm(log(vector+1))
>qqline(log(vector+1))
Test if it worked with a normality test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Barcharts in r

A

> barplot(vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to generate a more informative barplot

A

> table(vector)
barplot(table(vector))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How to change the scale on a barplot

A

> barplot(table(vector)/valuemeasured(vector))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How to add labels to a barplot

A

> labels=as.vector*(c(“one”, “two”,”three”))
barplot((table(vector)/measurement(vector)), names.arg=labels , xlab**=“Number of children”, ylab=“relative frequency”)
*actually write as.vector here
**label for x axis etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Histogram code
>hist(vector)
26
How to upload a larger data set
> dataset = read.table(“name of file”, header = TRUE) >attach(dataset) >dataset* *this will show you the attached data set >summary(dataset)
27
Binomial or chi squared
Nominal or frequency data 2 categories
28
Chi-squared
Nominal or frequency More than 2 categories
29
Pesaron product moment / spearman rank
Interval or ratio data and measures with a reasonably normal distribution 2 conditions Testing hypotheses about: Correlation - relationship between two dependent variables
30
Simple linear regression
Interval or ratio data and measures with a reasonably normal distribution 2 conditions Testing hypotheses: Regression - effect of an independent variable upon a dependant variable
31
T test
Interval or ratio data and measures with a reasonably normal distribution 2 conditions Testing hypotheses about Means Independent measures design
32
T test
Interval or ratio data and measures with a reasonably normal distribution 2 conditions Testing hypothesis about- means Matched measures or repeated measures designs
33
Analysis of variance - ANOVA Parametric
Interval or ratio data and measures with a reasonably normal distribution More than 2 conditions Testing hypotheses about - means Difference between means Null hypotheses= there is no significant difference between the means of two conditions
34
Multiple linear regression
Interval or ratio data and measures with a reasonably normal distribution More than 2 conditions Testing hypotheses about - regression (effect of 2 or more independent variables upon a dependant variable)
35
Spearman rank
Ordinal data or non-normal distribution of measure 2 conditions Testing hypotheses - correlation - relationship between two dependent variables
36
Mann- Whitney
Ordinal data or non-normal distribution or measure 2 conditions Testing hypotheses - medians Independent measures
37
Wilcoxon
Ordinal data or non normal distribution of measure 2 conditions Testing hypotheses about - medians Repeated measures
38
Krystal - Wallis
Ordinal data or non-normal distribution of measure More than 2 conditions Non-parametric analysis of variance Independent measures
39
Friedman
Ordinal or non-normal distribution of measure More than 2 conditions Non- parametric analysis of variance Repeated measures
40
Continuous variable
Take on any value within a given range There are an infinite number of possible values, limited only by our ability to measure them eg distance
41
Discrete variable
Only certain distinct values within a given range The scale is still meaningful - cant have half numbers
42
Categorical variable
One in which the value taken by the variable is a non numerical category or class
43
Ranked variable
Is a categorical variable in which the categories imply some order or relative positive Numerical values are usually assigned but 4 is not necessarily twice as many as two
44
How to set class intervals
1. Use intervals of equal length with midpoints at convenient round numbers 2. For small data sets, se a small number of intervals 3. For large data sets , use more intervals
45
Stem leaf plots
Allow a summary of the data , retaining the original values 1. Stem consists of a column of figures, omitting the last digit 2. Add the final digit of each weight in the final row 3. Put the “leaves” in order
46
Interquartile range IQR
Based on the median Divides the data into four equal groups and looks to see how far apart the extreme groups are 1. Put the data in numerical order 2. Find the overal median. Divide the data set in two subsets with an equal size. If n for the whole set of data is odd, out the overall median in both subsets 3. Find the median for the lower groups . This is the first quartile 4. Find the median for the upper group. This is the third quartile Interquartile range is IQR=Q3-Q1
47
What is a box whisker plot
A way to illustrate the IQR A good way to demonstrate the differences between groups
48
Standard deviation
A measure of spread around the mean A bit like the average of the data from the mean
49
Random variable?
Is the numerical outcome of a random experiment
50
Binomial distribution
Variable is one with just two possible outcomes eg a single toss of a coin - one outcome = success and the other= failure
51
What are the four attributes of the normal distribution?
Variously - wide and flat Or Narrow and high
52
Chi- squared test
Suitable for frequency data: counts of things Do the number of individuals in different categories fit a null hypothesis of some sort (the expectation)
53
Yates correction of 1df
Apply where there are only two categories of data (Eg. Male and female) Substract 0.5 from each value of O-E ignoring the sign IO-EI-0.5 Continue rest of calculation as normal
54
Mann-Whitney test- detailed
Non parametric alternative to the unpaired t-test Tests for the significant difference between the median of two independent groups Use this test when one or both groups have non-normal distribution
55
Wilcoxon paired sample test- detailed
Non parametric alternative to the paired t-test Tests for a significant difference between the medians of paired observations Use this test when one or both groups have a non normal distribution (or cannot be induced to be normal)
56
Krystal- Wallis
Non-parametric one-way analysis of variance Non-parametric alternative to one way ANOVA
57
Friedman’s
Non-parametric to way analysis of variance - alternative Used to detect differences in medians between three of more treatments of the same subject Wide variations of the standard deviations for rows or columns of a data matrix suggest that we cannot use parametric ANOVA
58
Parametric stats
Based on assumptions about the distributions of population from which the sample was taken Evaluate hypotheses for a particular parameter usually the population mean Quantitative data Require assumptions about the distributional characteristics of the population distribution - normal data - equal variance More powerful than non parametric test when assumptions are met
59
Non parametric stats
Evaluate hypotheses for entire population distributions Quantitative,ranked qualitative data Require no assumptions (distribution free) so used with non normal distributions and when variance of the groups are not equal Generally easy to compute
60
List of parametric tests
Paired t test Unpaired t test Pearson correlation ANOVA
61
Non parametric test- examples
Wilcoxon rank sum test Mann-Whitney U test Spearman correlation Kruskal Wallis test Friedman
62
Hierarchical clustering - what is it
1. A way to find hierarchical patterns of similarity between sets of objects 2. Not a test. There is no null hypothesis. No assumption about the distribution of the data
63
When to use it hierarchical clustering?
You have objects or things described by a large number of continuous or discrete variables Some implementations also work with ordinal or categorical variables Allows you to visualise this graphically (dendogram or tree)
64
Hierarchical clustering: three steps
1. Data transformation (eg Z-scores) 2. Matrix of similarities , differences or distances (eg Euclidean) 3. Clustering algorithm (eg UPGMA, average neighbour)
65
Principal component analysis (PCA)- what is it
1. A data reduction technique 2. Not a test. There is no null hypothesis. No assumption about the distribution of the data
66
When to use principle component analysis (PCA)- Ie which variables - what do you explore -what do you see
You have objects or things described by a large number of continuous or discrete variables (not ordinal or categorical) You want to explore the differences between the objects as measured by all the variables simultaneously Allows you to visualise this graphically (space- filling model)
67
Multiple regression - what is it?
1. An extension of linear regression to situations where there is more than one independent variable 2. A data reduction technique. Seeks to explain a reasonable fraction of the variance in the dependent variable using only some of the independent variables
68
When to use multiple regression?
You have objects or things described by a large number of continuous or discrete variables These are distributed reasonably normally
69
Why do you test for normality before performing a variance ratio test for the equality of variances
The variance test is sensitive to departures from normality
70
Independent variables = random
Temperature = random ANOVA
71
Independent variables - fixed
Barely = 2 varieties
72
Interval or ratio data and Measures with a reasonably normal distribution
categories ranked and have equal spacing between adjacent values only ratio scaled have a true zeros - zero is treated as a point of origin
73
2 categories - which parametric tests
Binomial / chi squared 2 conditions: Pearson product Simple linear regression T-test (unpaired and paired)
74
Factorial analysis
Multiple hypothesis describe variability among observed , correlated variables
75
Describe graphs of both Pierson’s regression and spearman’s ranks
Piersons regression = straight line Spearman’s rank= only has to be correlated
76
Simple linear regression
A regression model that estimates the relationship between one independent variable and one dependent variable using a straight line
77
Regression analysis
Reliable method of identifying which variables have an impact on a topic of interest. the process of performing a regression allows you to confidently determine which factors matter most , which factors can be ignored and how the factors influence each other
78
Tests that are used for more than 2 categories
Chi squared ANOVA Kruksal wallis Friedman
79
Multiple linear regression
Regression model that estimates the relationship between a quantitative dependent variable and 2 or more independent variables using a straight line
80
Principle component analysis regression - when is it used
For variables that are strongly correlated PCA technique is using in processing data where multi-linearity exists between the features / variables
81
How to test for significant differences between medians of 2 paired observations 3 steps
1- calculated difference between groups 2- absolute differences 3- rank absolute differences
82
Ordinal data meaning
Ordinal data violates the assumption of normal distribution - categories within a variables that have a natural rank order
83
Variance
Tells you the degree of spread in your data set
84
Variance test - what does it do
Sees if the variance of 2 populations from which the samples have been drawn is equal or not
85
SSyy = SSR + SSE
SSyy= variation explained by regression SSR= regression SSE= error
86
How to calculate SSE
Sum of the squared estimate of errors
87
The regression/least squares line…
Is the line with the smaller SSE
88
What is regression analysis - equation
X= predictor variable Y= response variable Y= a+bx
89
Regression analysis explained
Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables It can be utilised to assess the strength of the relationship between variables and for modelling the future relationship between them
90
What is SSR
SSR is the additional amount of explained variability in Y due to the regression model compared to the baseline model
91
What is multicollinearity and why is it a problem
It is the phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy It is a problem because it undermines the statistical significance of an independent variable
92
What does it mean if the standard error of a regression coefficient is large
The coefficient will be less statistically significant
93
What is PCA
Is a tool for exploring the structure of multi variate data Data reduction technique - allows us to reduce the number of variables to a manageable number of new variables or components
94
Limitations of PCA
Variables must be continuous or on an interval scale
95
Two types of PCA
Covariance matrix - applies more weigh to some variables than others Correlation matrix - expression each variable with equal weight
96
1 sample t test
1 mean value is significantly different to a set mean
97
2 sample t test
Test whether unknown population means are equal or not
98
Unpaired t test
2 different categories eg different weights of lemurs Or how many carrots boys and girls eat
99
Paired test
Speed of a human wearing 1 type of shoe compared to another The measurement use be paired due to different running speeds of humans no matter the shoe type
100
One tailed
Only one way the results can go - directional results The area of distribution is (for example) greater than the value specified in the null hypothesis
101
Two tailed
Critical area of distribution is two sided and tests whether a sample is greater than or less than a certain range of values Group higher scored higher or lower than Group B
102
Types of hierarchical clustering
Pubs in two towns = pre defined clusters (already close together) Geographical mid point of all Swindon pubs and the mid point of bath pubs and measure that distance = centroid clustering Average distance between every pub in Swindon and every pub in bath = average linkage clustering Closest pair of pubs one from each town = single linkage or nearest neighbour clustering Take the most distant pair = complete linkage clustering
103
Distance matrix
Condense univariate distances down to a single number Add them up: manhattan / duty block distance Or Euclidean distance = square root of the sum of the squares of univariate distances
104
Requirements of mannwhitney
Rank observations as if they were single sample - eg smallest to largest