While summary statistics are helpful, it is possible for multiple datasets to have the same summary statistics yet have very distributions and appearances when graphed. Anscombe's quartet is a dataset developed by Francis Anscombe to illustrate the importance of data visualization. In this task, you will plot the data using multiple data visualization techniques.

NIST 4-Plot

The National Institute of Standards and Technology (NIST) 4-Plot is comprised of the following graphical techniques:

  1. Run sequence plot to test randomness
  2. Lag plot to test randomness
  3. Histogram to test (normal) distribution
  4. Normal probability plot to test normal distribution

The user-defined create_4_plot function is called using the following code. Run this block of code to generate a 4-panel plot for each variable.

create_4_plot Function

# Data Visualization
create_4_plot(pull(eda_data, Var1))
create_4_plot(pull(eda_data, Var2)) 
create_4_plot(pull(eda_data, Var3)) 
create_4_plot(pull(eda_data, Var4))
CODE


Var1 NIST 4-Plot


Var2 NIST 4-Plot


Var3 NIST 4-Plot


Var4 NIST 4-Plot

Viewing Plots in RStudio

Graphs are shown in the Plot pane, in the lower right of RStudio. You can use the forward and back arrows to step toggle through the plots. If you run all four plot commands in one block, they will generate in order, and the one that displays when they finish will be for the last variable (Var4). Use the left arrow above the plot to scroll back through the variables. There is also a "Zoom" button that lets you view a larger version of the plot you currently have selected. This window can be resized to make the plot even larger.

Boxplot

Recall that the box starts at the first quartile Q_1 and ends at the third quartile Q_3. The thick line inside the box represents the median value of the dataset. The tails represent the maximum and minimum values (excluding outliers).

Values are considered outliers if they are greater than Q_3 + 1.5*IQR or less than Q_1 - 1.5*IQR.

Boxplot (base graphics)

boxplot(eda_data, ylab = "Value")
CODE


Scatterplot Matrix

The GGally library is an extention of ggplot2. The GGally function ggpairs generates a scatterplot matrix.

Scatterplot Matrix

ggpairs(eda_data)
CODE


Scatterplot matrix

The asterisk (*) next to the Var2 and Var3 correlation is indicative of the p-value of the correlation. A single asterisk indicates that the p-value < 0.05. 

Correlation

Correlation can be computed in R using the function cor. By default, the cor and the scatterplot matrix functions use the Pearson correlation coefficient which measures linear correlation. Other methods, such as Kendall's tau and Spearman's rho can be computed by specifying the method argument in the cor function.

In addition, HEC-SSP includes a Correlation Analysis which allows users to compute linear correlation coefficients using two or more data sets.

Continue to Task 5. Workshop Questions