Download PDF
Download page Phase 2: Predictor Selection Using R.
Phase 2: Predictor Selection Using R
In this phase, decide on the following:
- Which three predictors you prefer most to use for your model
- Any transformations that make sense for those predictors and/or the response variable
Variable Transforms
When performing a regression analysis, there are multiple reasons to transform the data. The primary two are: (1) improving linearity between a predictor and response variable and (2) fixing problems with residuals (usually addressed in the “assumption checking” part of the modeling effort). If, in inspecting the pairs plot, non-linearity is apparent, a transformation may be helpful. The logarithm is generally the most useful transformation available. It is especially useful when the data span more than one order of magnitude and are positively skewed. Conversely, the reciprocal transform can be useful when a variable only varies slightly within an order of magnitude.
You can create transformed versions of variables using the mutate()
function, as in the example below. This code takes the fitting_data
data frame and temporarily creates a new column called log_dist_to_coast
by taking the log of the values in the dist_to_coast
column. Creating the column in this way does not alter the original data frame containing the input data and is meant to be passed into further function calls - this process will be shown more below.
Transforming Variables
fitting_data %>%
mutate(log_dist_to_coast = log(dist_to_coast))
A quantitative way to assess whether or not a transform improved linearity is by comparing the Pearson correlation coefficient before and after transformation. If the magnitude of the Pearson coefficient increases, then the relation has become more linear, which is desirable for linear regression. To compute correlation between two variables in a data frame, use the cor()
function. The code below shows how to combine together a data transformation and computing the correlation coefficient between the t
and log_dist_to_coast
variables. Note that you do not always have to use a transformation - you can skip the mutate()
function if you just want to compute the correlation between two variables. Additionally, you can specify more variables inside the select()
function if you want to compute the correlation between all of them.
Correlation Between Selected Variables
fitting_data %>%
mutate(log_dist_to_coast = log(dist_to_coast)) %>%
select(t, dist_to_coast, log_dist_to_coast) %>%
cor(.)
Info
The %>%
is called the "pipe" and is used to pass results between functions; the mutate()
function creates a new column in a data frame; the select()
function filters down the columns in a data frame; the cor()
function computes the correlation coefficient; the .
tells the function to use the data frame it just received.
The workflows in the code blocks above do not alter the input data and only temporarily create new variables. You will often see piped operations of several functions chained together in this kind of way in functional programming, which is a useful paradigm for ensuring your input data stay clean and unaltered so the workflow remains repeatable.
Running that code block by selecting it and pressing Ctrl+Enter
produces a full correlation matrix in the console:
Try transforming several variables and seeing if correlation between t
and the variable improves. Sometimes the correlation gets worse like in this case, which indicates either that the relation cannot be made more linear, or perhaps a different transformation is needed.
If you find that between the variable and the transformed variable that the correlation coefficient is nearly ±1, then the transformation did not have much power in changing the shape of the variable. This means you will likely want to either not use a transformation, or try a different one.
You may also find that improvements in linearity occur when you apply a transformation to the response variable t
. This is effectively the same as performing the inverse of the transformation to each of the predictors and is very powerful. Typically, this is used when dealing with non-normality of residuals (discussed later).
Record your results in Table 1 (one example is shown).
Table 1. Results of testing transformations on linearity of data.
Variable | Transformation | Transformation | Correlation with t | Correlation with t |
dist_to_coast | logarithm | none | 0.524 | 0.254 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Multicollinearity
In a multivariate environment, especially when applied to a regression model, additional concerns arise. The primary additional concern is multicollinearity, where the predictors in a model are highly correlated with each other. Multicollinearity has consequences that reduce the ability to select predictors and obscure the significance of the model terms, among other effects. Try to evaluate which predictors you would avoid combining in a model because they are highly correlated. List some predictor pairs below in Table 2 (one example is shown) that have very high correlation. Do not include t
as a variable – high correlation with the response variable is the goal!
Table 2. Identifying potentially collinear predictors.
Predictor 1 | Predictor 2 | r |
longitude | winter_temp | -0.784 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
After considering which of the predictors might be powerful for explaining variability in t
, and which of the variables might be collinear, choose up to three predictors (and transformation, if necessary; “none” otherwise) to start your first multiple linear regression model. Also, identify if a transformation for t
will be helpful.
Table 3. Linear regression model initial predictors.
Predictor | Transformation Type |
Go to the Next Step: Phase 3: Model Construction Using R