Download file for R Shiny application here: exploratory_data_analysis_app.zip

Note: An R script-based version of this workshop is available here: Exploratory Data Analysis R Script

Objectives

In this workshop, you will use the tools of exploratory data analysis (EDA) to investigate the data contained within a multivariate dataset. You will examine 90 observations of 4 variables, which are in observation order (the second row was observed after the first row, and so on). 

Recall some of the questions you might ask of a dataset:

  • What does the data look like?
  • What is a typical value?
  • How much do data in a sample vary?
  • What is a good model for a set of data?
  • How different are two sets of data?
  • Is a dataset taken from a single population?
  • Were the samples taken independently?

Steps

  1. Launch R Studio and open the provided .R file
  2. Along the top bar, press the green play button to run the R Shiny app. You may need to install packages if they are not already installed.
  3. Examine the statistics and plots on each of the tabs: Data Preview, Summary Stats, Percentiles, 4-Plot, Boxplot, and Pairs Plot.
  4. Answer the following questions:

Question 1: What tool would you use to identify variables that are not stationary? Which of the variables appear to be non-stationary (with respect to mean, or to variance)? Recall that these are observations taken in order.

The run sequence plot is the simplest tool. Variable 1 appears to be non-stationary wrt its variance; variable 2 appears to be non-stationary wrt its mean.

Question 2: Which variable(s) are likely to be modeled best using a normal distribution? Which variable(s) might have tails too short to be modeled with the normal distribution? Tails too long? Any variable(s) that is/are too asymmetrical?

Of the variables, variable 4 looks the most normal.  Variable 2 is too short-tailed, variable 1 is too long-tailed, and variable 3 too asymmetrical.  Variable 4 shows some asymmetry as well.

Question 3: Which datasets (if any) appear to have many outliers? How well would the normal distribution perform as a model for those datasets?

Using the box plot, it would appear that Variable 1 has many outliers, both high and low, and Variable 3 has one high outlier. This could make variable 1 too long-tailed and variable 3 too asymmetrical to use the normal distribution. This is supported by the histogram and normal Q-Q panels of the 4-plot.

Reminder that the box plot labels any point that's smaller than Q_1-1.5*IQR or greater than Q_3+1.5*IQR as an outlier. The whiskers of the box plot are drawn to the value of the last data point inside that interval.

Question 4: Are there any strong linear relationships between the variables? Nonlinear relationships?

Correlation between all of the variables is quite low. The correlation between variables 2 and 3 is the highest at about 0.27 (which is low).

Question 5: After looking at Variable 1 in a number of ways, what do you think the biggest challenges to modeling this variable would be, and why?

Variable 1 likely has non-stationary variance which means its samples are probably not taken from the same population (violation of the “identically distributed” part of IID).  Additionally, the values have high autocorrelation (with a negative coefficient) so they probably aren’t independent. The heavy tails and high peak of the distribution likely mean our typical distribution models (like the normal distribution) won't work.

Question 6: After looking at Variable 2 in a number of ways, what do you think the biggest challenges to modeling this variable would be, and why?

Variable 2 appears to have a strong trend (non-stationarity in the mean).  This means that the samples are autocorrelated (dependent in time), an inference supported by the lag plot.  Additionally, the samples appear short-tailed even if they are symmetrical, so a model with thinner-than-normal tails would be necessary. There is a chance that it is bi-modal from the histogram, which would be very problematic for modeling with a single distribution.  It may also indicate that the samples are not identically distributed.  However, there is not a great deal of evidence that it has multiple modes.

Question 7: After looking at Variable 3 in a number of ways, what do you think the biggest challenges to modeling this variable would be, and why?

Variable 3 appears to be asymmetrical with a positive skewness, so the normal distribution is probably not a good model.  It also weakly shows non-stationarity with respect to its variance. A critical look at the run sequence plot and lag plot might indicate slight autocorrelation, but nothing like Variable 1 or 2.

Question 8: After looking at Variable 4 in a number of ways, what do you think the biggest challenges to modeling this variable would be, and why?

Variable 4 has only slight asymmetry with positive skewness but otherwise does not show signs of non-stationarity or autocorrelation.  The normal Q-Q plot suggests that the normal distribution would be a good candidate to model the data. A critical eye may suggest some trend in the run-sequence plot in the mean or variance, but overall, this dataset is fairly well-behaved.

The four variables in this dataset are described below.

Variable 1 is the year-over-year change in corn yield (bushels per acre, or bu/ac) for Linn County, IA from 1928 through 2017.

Variable 2 is the soybean yield (bushels per acre, or bu/ac) for Linn County, IA from 1928 through 2017.

In a hydrology setting, agricultural indicators such as yield or acres harvested is sometimes used as a proxy for land use.  A common use in this way is for the attribution of non-stationarity in river discharges to either land-use or climatic changes in agricultural watersheds.

Variable 3 is a spatially-averaged estimate of annual total precipitation (in) near Cedar Rapids, Linn County, IA from 1928 through 2017.

Variable 4 is a spatially-averaged estimate of annual average temperature (deg F) near Cedar Rapids, Linn County, IA from 1928 through 2017.