What tool would you use to identify variables that are not stationary? Which of the variables appear to be non-stationary (with respect to mean, or to variance)? Recall that these are observations taken in order.

The run sequence plot is the simplest tool. Variable 1 appears to be non-stationary wrt its variance; variable 2 appears to be non-stationary wrt its mean.

Which variable(s) are likely to be modeled best using a normal distribution? Which variable(s) might have tails too short to be modeled with the normal distribution? Tails too long? Any variable(s) that is/are too asymmetrical?

Of the variables, variable 4 looks the most normal. Variable 2 is too short-tailed, variable 1 is too long-tailed, and variable 3 too asymmetrical. Variable 4 shows some asymmetry as well.

Which datasets (if any) appear to have many outliers? How well would the normal distribution perform as a model for those datasets?

Using the box plot, it would appear that Variable 1 has many outliers, both high and low, and Variable 3 has one high outlier. This could make variable 1 too long-tailed and variable 3 too asymmetrical to use the normal distribution. This is supported by the histogram and normal Q-Q panels of the 4-plot.

Reminder that the box plot labels any point that's smaller than $\begin{array}{l}Q_1-1.5*IQR\end{array}$ or greater than $\begin{array}{l}Q_3+1.5*IQR\end{array}$ as an outlier. The whiskers of the box plot are drawn to the value of the last data point inside that interval.

Are there any strong linear relationships between the variables? Nonlinear relationships?

Correlation between all of the variables is quite low. The correlation between variables 2 and 3 is the highest at about 0.27 (which is low).

After looking at Variable 1 in a number of ways, what do you think the biggest challenges to modeling this variable would be, and why?

Variable 1 likely has non-stationary variance which means its samples are probably not taken from the same population (violation of the “identically distributed” part of IID). Additionally, the values have high autocorrelation (with a negative coefficient) so they probably aren’t independent. The heavy tails and high peak of the distribution likely mean our typical distribution models (like the normal distribution) won't work.

After looking at Variable 2 in a number of ways, what do you think the biggest challenges to modeling this variable would be, and why?

Variable 2 appears to have a strong trend (non-stationarity in the mean). This means that the samples are autocorrelated (dependent in time), an inference supported by the lag plot. Additionally, the samples appear short-tailed even if they are symmetrical, so a model with thinner-than-normal tails would be necessary. There is a chance that it is bi-modal from the histogram, which would be very problematic for modeling with a single distribution. It may also indicate that the samples are not identically distributed. However, there is not a great deal of evidence that it has multiple modes.

After looking at Variable 3 in a number of ways, what do you think the biggest challenges to modeling this variable would be, and why?

Variable 3 appears to be asymmetrical with a positive skewness, so the normal distribution is probably not a good model. It also weakly shows non-stationarity with respect to its variance. A critical look at the run sequence plot and lag plot might indicate slight autocorrelation, but nothing like Variable 1 or 2.

After looking at Variable 4 in a number of ways, what do you think the biggest challenges to modeling this variable would be, and why?

Variable 4 has only slight asymmetry with positive skewness but otherwise does not show signs of non-stationarity or autocorrelation. The normal Q-Q plot suggests that the normal distribution would be a good candidate to model the data. A critical eye may suggest some trend in the run-sequence plot in the mean or variance, but overall, this dataset is fairly well-behaved.

The four variables in this dataset are described below.

Variable 1 is the year-over-year change in corn yield (bushels per acre, or bu/ac) for Linn County, IA from 1928 through 2017.

Variable 2 is the soybean yield (bushels per acre, or bu/ac) for Linn County, IA from 1928 through 2017.

In a hydrology setting, agricultural indicators such as yield or acres harvested is sometimes used as a proxy for land use. A common use in this way is for the attribution of non-stationarity in river discharges to either land-use or climatic changes in agricultural watersheds.

Variable 3 is a spatially-averaged estimate of annual total precipitation (in) near Cedar Rapids, Linn County, IA from 1928 through 2017.

Variable 4 is a spatially-averaged estimate of annual average temperature (deg F) near Cedar Rapids, Linn County, IA from 1928 through 2017.

Solution R Code

Here is how your final script should look. Note that line 1 (setting the working directory) will look different on your machine depending on which directory you selected.

Solution R Code for Exploratory Data Analysis Workshop

setwd("<Your working directory here>")

library(tidyverse)
library(readxl)
library(moments)
library(GGally)

# Functions
getMode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

midhinge <- function(x) {
  (quantile(x, 0.25, names = F) + quantile(x, 0.75, names = F)) / 2
}

trimean <- function(x) {
  (midhinge(x) + median(x)) / 2
}

yule <- function(x) {
  n <- quantile(x, 0.75, names = F) + quantile(x, 0.25, names = F) - 2 * median(x)
  d <- IQR(x)
  n/d
}

create_4_plot <- function(x) {
  par(mfrow = c(2,2))
  plot(x, main = "Run Sequence Plot of X")
  plot(x[-1], x[-length(x)],type="b", xlab = "lag 1", ylab = "x", main = "Lag Plot of X")
  abline(a = 0, b = 1, lty = 2, col = "grey")
  hist(x, main = "Histogram of X")
  qqnorm(x, main = "Normal Q-Q Plot of X")
  qqline(x)
  par(mfrow = c(1,1))
}

# Load data
eda_data <- read_excel("eda_ws_data.xlsx")

# Summary Statistics
n <- nrow(eda_data)
min <- apply(eda_data, 2, min)
max <- apply(eda_data, 2, max)
range <- max - min
mean <- apply(eda_data, 2, mean)
median <- apply(eda_data, 2, median)
mode <- apply(eda_data, 2, getMode)
variance <- apply(eda_data, 2, var)
skewness <- apply(eda_data, 2, moments::skewness)
kurtosis <- apply(eda_data, 2, moments::kurtosis)
cv <- sqrt(variance)/mean
mh <- apply(eda_data, 2, midhinge)
trimean <-  0.5*(mh + median)
yule_coeff <- apply(eda_data, 2, yule)
iqr <- apply(eda_data, 2, IQR)

pr <- c(0.010, 0.025, 0.050, 0.100, 0.200, 0.800, 0.950, 0.975, 0.990)  

var1_pctile <- quantile(pull(eda_data, Var1), probs = pr)
var2_pctile <- quantile(pull(eda_data, Var2), probs = pr)
var3_pctile <- quantile(pull(eda_data, Var3), probs = pr)
var4_pctile <- quantile(pull(eda_data, Var4), probs = pr)

# Data Visualization
create_4_plot(pull(eda_data, Var1))
create_4_plot(pull(eda_data, Var2)) 
create_4_plot(pull(eda_data, Var3)) 
create_4_plot(pull(eda_data, Var4))

boxplot(eda_data, ylab = "Value")

ggpairs(eda_data)

CODE

Download PDF

Task 5. Workshop Questions

Solution R Code