Linear Regression

Kshitij Srivastava 2019-04-11 5 minute read

Regression on Auto dataset

We will first fit a simple linear regression model using the Auto dataset. Next we will examine the fit using various diagnostic plots.

library(ISLR)
## Warning: package 'ISLR' was built under R version 3.5.3
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

Fitting a simple linear model

Let’s use mpg as response and horsepower as predictor

modelFit <- lm(mpg~horsepower, data=Auto)
summary(modelFit)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

We have got a good R squared value. R squared value is the amount of variance in the response, which can be explained by the predictor/s. We also see that the predictor has a low standard error and low p value which tells us that it’s statistically significant. The p value here tests the null hypothesis that keeping all other predictors constant (none right now), horsepower has no effect on mpg. We can reject the null hypothesis because of the low p-value. We also note that the relationship between mpg and horsepower is negative.

Prediction using the model

Let’s try and predict the value of mpg when horsepower is 98. Also let’s get the confidence and prediction intervals around the prediction value. Prediction intervals are inherently wider than confidence intervals because they include the irreducible error in response as well.

predict(modelFit, data.frame(horsepower = c(98)), type = "response")
##        1 
## 24.46708
predict(modelFit, data.frame(horsepower = c(98)), type = "response", interval = "confidence")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
predict(modelFit, data.frame(horsepower = c(98)), type = "response", interval = "prediction")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

Regression diagnostic

Let’s now plot the response and predictor

plot(Auto$horsepower,Auto$mpg)
abline(modelFit)

We can see some hints of non-linearity here. Lets’s plot some diagnostics

par(mfrow=c(2,2))
plot(modelFit)

In the residuals vs fitted graph we can see a clear non-linear trend.

Let’s use the car package to plot some more diagnostics.

We can use the car package to plot studentized residuals. Studentized residuals are

library(car)
## Warning: package 'car' was built under R version 3.5.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.5.2
outlierTest(modelFit)
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferonni p
## 323 3.508709         0.00050294      0.19715

It can be difficult to reside the threshold for a point to be an outlier. Studentized residuals are obtained by dividing each residual by its standard error. Studentized residuals greater than 3 should be removed.

High leverage points are those points which influence the fit more than others. We can use the leveragePlots function in R to check leverage points.

leveragePlots(modelFit)

Detecting influential points

avPlots(modelFit)

Non-normality of residuals

qqPlot(modelFit, main="QQ Plot")

## 323 330 
## 321 328

Let’s try and get the distribution of studentized residulas using MASS package

library(MASS)
sresid <- studres(modelFit) 
hist(sresid, freq=FALSE, 
   main="Distribution of Studentized Residuals")

Heteroscedasticity

Heteroscedasticity is a property shown by residuals when they have non constant variance. This can be checked by plotting studentized residuals against fitted values

spreadLevelPlot(modelFit)

## 
## Suggested power transformation:  0.7680649

We can see a slight positive slope but not enough to be a cause of concern. Non linear distribution of studentized residuals is a bigger concern.

Trying the gvlma package for testing linear model assumptions

library(gvlma)
## Warning: package 'gvlma' was built under R version 3.5.2
gvmodel <- gvlma(modelFit) 
summary(gvmodel)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = modelFit) 
## 
##                      Value   p-value                   Decision
## Global Stat        110.921 0.000e+00 Assumptions NOT satisfied!
## Skewness            15.847 6.869e-05 Assumptions NOT satisfied!
## Kurtosis             1.458 2.273e-01    Assumptions acceptable.
## Link Function       81.186 0.000e+00 Assumptions NOT satisfied!
## Heteroscedasticity  12.430 4.224e-04 Assumptions NOT satisfied!

Global stat tells us if the relationship between y and x are truly linear. Clearly, as concluded earlier in the diagnostic plots, our model is not linear