# Linear Regression

## Regression on Auto dataset

We will first fit a simple linear regression model using the **Auto** dataset. Next we will examine the fit using various diagnostic plots.

`library(ISLR)`

`## Warning: package 'ISLR' was built under R version 3.5.3`

`head(Auto)`

```
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
```

## Fitting a simple linear model

Let’s use mpg as response and horsepower as predictor

```
modelFit <- lm(mpg~horsepower, data=Auto)
summary(modelFit)
```

```
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
```

We have got a good R squared value. R squared value is the amount of variance in the response, which can be explained by the predictor/s. We also see that the predictor has a low standard error and low p value which tells us that it’s statistically significant. The p value here tests the null hypothesis that keeping all other predictors constant (none right now), *horsepower* has no effect on *mpg*. We can reject the null hypothesis because of the low p-value. We also note that the relationship between mpg and horsepower is negative.

## Prediction using the model

Let’s try and predict the value of mpg when horsepower is 98. Also let’s get the confidence and prediction intervals around the prediction value. Prediction intervals are inherently wider than confidence intervals because they include the irreducible error in response as well.

`predict(modelFit, data.frame(horsepower = c(98)), type = "response")`

```
## 1
## 24.46708
```

`predict(modelFit, data.frame(horsepower = c(98)), type = "response", interval = "confidence")`

```
## fit lwr upr
## 1 24.46708 23.97308 24.96108
```

`predict(modelFit, data.frame(horsepower = c(98)), type = "response", interval = "prediction")`

```
## fit lwr upr
## 1 24.46708 14.8094 34.12476
```

## Regression diagnostic

Let’s now plot the response and predictor

```
plot(Auto$horsepower,Auto$mpg)
abline(modelFit)
```

We can see some hints of non-linearity here. Lets’s plot some diagnostics

```
par(mfrow=c(2,2))
plot(modelFit)
```

In the residuals vs fitted graph we can see a clear non-linear trend.

Let’s use the *car* package to plot some more diagnostics.

We can use the *car* package to plot studentized residuals. Studentized residuals are

`library(car)`

`## Warning: package 'car' was built under R version 3.5.3`

`## Loading required package: carData`

`## Warning: package 'carData' was built under R version 3.5.2`

`outlierTest(modelFit)`

```
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 323 3.508709 0.00050294 0.19715
```

It can be difficult to reside the threshold for a point to be an outlier. Studentized residuals are obtained by dividing each residual by its standard error. Studentized residuals greater than 3 should be removed.

High leverage points are those points which influence the fit more than others. We can use the *leveragePlots* function in R to check leverage points.

`leveragePlots(modelFit)`

## Detecting influential points

`avPlots(modelFit)`

## Non-normality of residuals

`qqPlot(modelFit, main="QQ Plot")`

```
## 323 330
## 321 328
```

Let’s try and get the distribution of studentized residulas using *MASS* package

```
library(MASS)
sresid <- studres(modelFit)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
```

## Heteroscedasticity

Heteroscedasticity is a property shown by residuals when they have non constant variance. This can be checked by plotting studentized residuals against fitted values

`spreadLevelPlot(modelFit)`

```
##
## Suggested power transformation: 0.7680649
```

We can see a slight positive slope but not enough to be a cause of concern. Non linear distribution of studentized residuals is a bigger concern.

## Trying the gvlma package for testing linear model assumptions

`library(gvlma)`

`## Warning: package 'gvlma' was built under R version 3.5.2`

```
gvmodel <- gvlma(modelFit)
summary(gvmodel)
```

```
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = modelFit)
##
## Value p-value Decision
## Global Stat 110.921 0.000e+00 Assumptions NOT satisfied!
## Skewness 15.847 6.869e-05 Assumptions NOT satisfied!
## Kurtosis 1.458 2.273e-01 Assumptions acceptable.
## Link Function 81.186 0.000e+00 Assumptions NOT satisfied!
## Heteroscedasticity 12.430 4.224e-04 Assumptions NOT satisfied!
```

Global stat tells us if the relationship between y and x are truly linear. Clearly, as concluded earlier in the diagnostic plots, our model is not linear