During this phase, you will check two features of your model that can hinder its usefulness: multicollinearity and residual non-normality.

Checking Multicollinearity

A useful statistic for estimating the strength of multicollinearity in a model is the variance inflation factor (VIF).  VIF estimates how much the variance of a regression coefficient is artificially increased due to multicollinearity between predictors in the model.  As a guideline, values above 2.5 are cause for concern.  If a predictor has a very large VIF, it is up to you to decide whether to remove that predictor from the model, or a different predictor, to reduce VIFs in the model.

The function vif() from the car package we installed earlier quickly computes the VIFs for a model.  The code below will let you load the car package (you only need to do that once) and then display the model VIFs.

Variance Inflation Factors

library(car)
vif(model_1)
CODE

There are more robust checks for multicollinearity built into other R packages but for now, we will only screen out models with VIFs above 2.5.

For each of the models you create with an adjusted-R2 greater than 0.5, record the VIF for each predictor in Table 1 below.  If your model has any predictor with a VIF greater than 2.5, then you should consider modifying the model by changing predictors or using transformations.

Table 1.  Checking results for regression models.

Model

Adjusted-R2

Predictor 1 VIF

Predictor 2 VIF

Predictor 3 VIF

1





2





3





4





5





6





7





8





9





10





Checking Residual Normality

R produces a number of useful diagnostic plots by using the plot() function on an lm model.  One of the plots it produces is the residual normal Q-Q plot, which can help identify if the residuals behave inconsistently with a normal distribution.

Diagnostic Plots

plot(model_1)
CODE

When running this function, prompts in the console will appear that tell you to "Hit <Return> to see next plot".  Click in the console to make sure it knows you're hitting <Return> and then hit <Return>.  It will prompt you four times, each time producing a new plot, and the second one is the normal Q-Q plot we are interested in.  If the points fall very close to the reference line without any serious curvature in the middle or the tail or the data, it is likely the residuals are normally distributed.  However, to ensure this, check the residuals using a formal test.

The Shapiro-Wilk test is a normality test based on order statistics that is built-in to R and can quickly produce a hypothesis test for whether or not data follow a normal distribution.  The resid() function lets us extract the residuals from a model, and the shapiro.test() function runs the Shapiro-Wilk test.

Testing the residuals using the Shapiro-Wilk test

shapiro.test(resid(model_1))
CODE

The null hypothesis of the Shapiro-Wilk test is that the sample is normally-distributed.  The p-value tells us if the deviation from normality is reasonable for a sample of the size of the data, or if it's very unlikely a normally-distributed population generated the sample.  Smaller p-values indicate a smaller chance the population is normally-distributed.  If we choose a significance level of 5% and our p-value is smaller than 0.05, we reject the null hypothesis.  The test will produce two statistics, a W statistic (https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test), and a p-value.  Check the residuals of your model to see if the residuals could be normally-distributed.  If not, consider transformations of your data or other predictor variables.

Final Model

Record the following information about your "best" model.  “Best” should reflect the power of the model, both in goodness of fit, but also in terms of meeting the assumptions.  A model with parameters all significant at p < 0.05 and a high adjusted-R2 but high VIF or non-linear residuals is probably not as good as a model that meets the multiple linear regression assumptions but has lower fit statistics.  Finally, the best model may not need all three predictors.

Table 2.  Final model information.

Parameter

Value

Predictor 1 Name


Predictor 1 Transform


Predictor 1 Coefficient


Predictor 1 P-Value


Predictor 2 Name


Predictor 2 Transform


Predictor 2 Coefficient


Predictor 2 P-Value


Predictor 3 Name


Predictor 3 Transform


Predictor 3 Coefficient


Predictor 3 P-Value


Adjusted-R2


ANOVA (F) P-Value


Maximum Parameter VIF


Residual Normality Shapiro-Wilk P-Value


Out of Sample Prediction

Finally, you will validate your model by performing an out-of-sample prediction using your best model.  The data you used to fit your model is a subset of a larger study area, and you will evaluate how well the model fit to a subset of the data predicts a different subset.

We loaded the second worksheet of the workbook into a data frame called validation_data.  This contains a collection of observations with the same data as the sites we used to fit the model, except they are for new, unobserved sites.  They have computed L-CV (t) values that we can compare our model prediction to.

R has a built-in function that allows for prediction from a model with new data.  However, if you have made any transformations to your data, we will have to add those columns to our validation dataset before applying the function.  We will use the pipe operator (%>%) and the mutate() function to add any transformed variables to the validation_data dataframe and pass it to the predict function.  The code block below shows the prediction being added as a column to the validation_data data frame which will make it easy to plot.  The code below will not run; it has a placeholder for any data transformations that need to be performed.

Predicting response at new sites

validation_data <-
  validation_data %>%
  mutate(<YOUR TRANSFORMATIONS HERE>) %>%
  mutate(t_predicted = predict(model_1, newdata = .))
CODE

To compare the predicted values for L-CV (t) from our regression model to the ones measured at the sites in the validation data, we can generate a quick plot of the measured vs. predicted values as a scatter plot, and then add a 1:1 reference line.  A perfect fit would have points all falling along the 1:1 reference line.  The ggplot() function generates the plot.  The aes() function tells the ggplot() function how to map variables to aspects of the plots.  geom_point() tells it to create a scatter plot.  geom_abline() adds the reference line.

Plotting measured vs. predicted values

ggplot(validation_data, aes(x = t, y = t_predicted)) + geom_point() + geom_abline(slope = 1)
CODE

When we predict the response of the model for new data that was not used in its fitting, we want to get an idea of how certain the response is.  For this we use the prediction interval.  The code below shows how to use the predict() function to generate a prediction interval for each new observation, and then display the fraction of true values that fall outside the prediction interval.  For a 95% prediction interval, we would expect 5% of measured L-CV (t) values to fall outside this interval.

Prediction interval

model_2.pi <- validation_data %>%
  mutate(<YOUR TRANSFORMATIONS HERE>) %>%
  select(<YOUR VARIABLES HERE>) %>%
  predict(model_2, newdata = ., interval="prediction", level = 0.95)
validation_data_with_model_2.pi <-
  validation_data %>%
  mutate(<YOUR TRANSFORMATIONS HERE>) %>%
  select(<YOUR VARIABLES HERE>) %>%
  cbind(., model_2.pi)
mean(with(validation_data_with_model_2.pi, t < lwr | t > upr))
CODE


Question 3: How well does the model perform out of sample?  Is this performance better or worse than you expected?  Which observations are predicted well?  Which are predicted poorly?

 To get the predicted values at the validation sites, you can use the predict() function with the model and validation data frame again, but including the argument type = "response".  The code below shows you how to compare the error in the predicted values at your validation sites to another variable, to see if the errors are systematic with respect to any of your other data.  This is one way to see how adding variables to the model could help your out-of-sample prediction.  Another way would be to re-fit the model and then re-check the validation.

Comparing prediction errors to another variable

model_2.response <- validation_data %>%
  mutate(<YOUR TRANSFORMATIONS HERE>) %>%
  select(<YOUR VARIABLES HERE>) %>%
  predict(model_2, newdata = ., type = "response")
resid_check_df <- tibble(pred_err = model_2.response - validation_data$t,
                         elev = validation_data$elev) #COMPARING PREDICTION TO ELEVATION
ggplot(resid_check_df, aes(x = elev, y = pred_err)) + geom_point()
CODE


Question 4 (time permitting): Is there a predictor variable that if added to the original model, would improve the prediction of those sites in the validation set that were predicted poorly?