Simple Linear Correlation and Regression Flashcards

1
Q

Regression Analysis

A
  • Regression Analysis is a way of estimating the relationship between different variables by examining the behavior of the system
  • There are many techniques for modeling and analyzing the dependent and independent variables
  • You are basically trying to derive an equation from the graph of your data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear Regression Analysis

A

The easiest kind of regression is linear regression. Imagine that all of your data lined up in a neat row. You could draw a straight line connecting all points and would be able to create a simple equation Y = mx + b that we talked about earlier. That way you would have a model that would faithfully predict what your system would do given any input of x.

But what if your data only “kinda-sorta” looks like a line?

Multiple linear regression is an extension to methodology of simple linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Linear Regression

A
  • Statistical technique to estimate the mathematical relationship between a dependent variable (usually denoted as Y) and an independent variable (usually denoted as X).
    • In other words, predict the change in the dependent variable according to the change in the independent variable.
    • Dependent Variable or Criterion Variable - is the variable for which we wish to make a prediction
    • Independent Variable or Predictor Variable - the variable used to explain the dependent variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When to Use Linear Regression

A
  • In simple linear regression, there is only one independent variable used to predict a single dependent variable.
  • In multiple linear regression more than one independent variables used to predict a single dependent variable.
    • The basic difference between simple and multiple regression is in terms of explanatory variables.
      • E.g. compare the crop yield rate against the rain fall rate in a season
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Notes about Linear Regression

A
  • The first step of linear regression is to test the linearity assumption, this can be performed by plotting the values in a graph known as a scatter plot, to observe the relationship between dependent and independent variable, because if the data is exponentially scattered then there is no meaning to create the regression equation.
  • Draw the line which covers the majority of the points
    • this line is considered best fit line or line of best fit
  • The mathematical equation of the line is
    • y=a+bx+ε
      • Where:
      • b – Slope of the line
      • a – y intercept when x=0
      • Random error (ε-Epsilon) – The difference between an observed value of y and the mean value of y for a given value of x.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Assumption of Linear Regression

A
  • Linear relationship between dependent and independent variable
  • All variables of regression to be multivariate normal
  • Particularly there is no or little multicollinearity in the data
  • Response variable is continuous and also residuals are almost same throughout the regression line
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The Method of Least Squares

A
  • The method of least squares is a standard approach in regression analysis to determine the best line for a given data
    • It basically provides a visual relationship between the given data points
  • In general, the dependent variables are demonstrated on the y-axis
  • The independent variables are demonstrated on the x-axis
  • The least square method determines the position of a straight line or also called trend line and the equation of the line.
    • This straight line is also known as best for line

The least square method means that the overall solution minimizes the sum of squares of the errors made in the results of every single equation. For instance, Least Squares Equation can be used to find the values of the coefficients a and b

The normal rules of Standard Deviation apply here; 68% of the points should be within +/- 1 Standard Error of the line, 95.5% of the points within +/- 2 Standard Error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Least Squares

a and b computed

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Linear Regression example in DMAIC

A
  • Linear Regression is specifically used in Analyze phase of DAMIC to estimate the mathematical relationship between a dependent variable and an independent variable.
  • Example: A passenger vehicle manufacturer reviewing the 10 salespersons training records. In fact, their main aim to compare the salespersons achieved target (in %) with the number of sales module training completed.
    • a^ = y^ - b^xbar
      • where y^ =10% of sales target achieved total = 10% of 822=82.2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Estimate the Variability of Random Errors

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Estimate the Variability of Random Errors

Example

A

o^e=square root of o^ 2 e

E.g. o^2 e = 28.95

o e = 5.38

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Test of Slope Coefficient

A
  • The existence of a signification relationship between dependent and independent variable can be tested by whether b is equal to 0. If b is not equal to 0 there is a linear relationship.
  • The null hypotheses and alternative hypotheses are:
    • The null hypothesis H0 : b=0
    • The alternative hypothesis H1: b≠0
  • Degrees of freedom = n-2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Test of Slope Coefficient

Example

A

t-table critical value chart

OR

Refer to Appendix Q

Values of the t-Distribution

Located in Handbook 2nd Edition Green Tab

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Confidence Interval Estimate for the Slop b

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Correlation Coefficient

Notes and Formula

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Correlation Coefficient

Example

A
17
Q

How is correlation analysis used to compare bivariate data?

A
  • Measure of central tendency, variance, or spread summarizes a single variable by providing important information about its distribution.
  • Often, more than one variable is collected in a study or experiment,
  • When two variables are measured on a single experiment unit, the resulting data are called bivariate data.
    • Ex. job satisfaction stratified by income
  • In most instances, in bivariate data, it determines that one variable influences the other variable.
    • The quantities from these two variables often represented using scatter plots to explore the relation between two variables
  • Depends on the type of data, bivariate data can be described with graphs and numerical measures.
    • If one or both variables are qualitative, then yes a pie chart or bar chart
      • For example, compare the relationship between opinion and gender. If the two variables are quantitative, use the scatter plot.
    • The Correlation Coefficient is often used in comparing bivariate data
18
Q

Correlation Coefficient Example

A
  • The correlation coefficient varies between -1 and +1.
    • Values approaching -1 or +1 indicate strong correlation (negative or positive) and values close to 0 indicate little or no correlation between x and y
  • Correlation does NOT mean causation
  • A positive correlation can be either good or bad news
  • A negative correlation is not necessarily bad news.
    • It merely means that as the independent variable goes more negative, the dependent variable goes negative as well
  • r = 0; does not indicate the absence of a relationship, a curvilinear pattern may exist; r=-0.76 has the same predictive power as r=+0.76
19
Q

Coefficient of determination (R2)

A
  • The coefficient of determination is the proportion of the explained variation divided by the total variation, when a linear regression is performed.
20
Q

Coefficient of determination (R2)

Example

A
21
Q

Coefficient of Correlation is r

A

Just take the square roof of the coefficient of determination. Sqrt(R Squared)

22
Q

Measuring the validity of the model

A
  • Use the F-statistic to find a p value of the system
  • The degrees of freedom for the regression is equal to the number of Xs in the equation (in linear regression, this is 1 because there is only 1 x in the equation y=mx+b)
  • The smaller the p value, the better.
    • But you really judge this by finding the acceptable level of alpha risk and seeing if that percent is greater than the p value.
      • Example: If your alpha risk level is 5% and the p value is 0.014, then you have to reject the hypothesis - in this case you’d reject that the line that was created is a suitable model as it was not able to create significant results
23
Q

Steps to Calculate the Correlation Coefficient

A
  1. Calculate the mean for all x values(x bar) and the mean for all y values (y bar)
  2. Calculate the standard deviation of all x values (Sx) and the standard deviation for y values (Sy)
  3. Calculate (x - x bar) and y - y bar) for each pair (x, y) and then multiply these differences together
  4. Get the sum by adding all these products together
  5. Divide the sum by Sx X Sy
  6. Divide the results of step 5 by n - 1, where n is the number of (x, y) pairs