While summary statistics are helpful, it is possible for multiple datasets to have the same summary statistics yet have very distributions and appearances when graphed. Anscombe's quartet is a dataset developed by Francis Anscombe to illustrate the importance of data visualization. In this task, you will plot the data using multiple data visualization techniques.

NIST 4-Plot

The National Institute of Standards and Technology (NIST) 4-Plot is comprised of the following graphical techniques:

Run sequence plot to test randomness
Lag plot to test randomness
Histogram to test (normal) distribution
Normal probability plot to test normal distribution

The user-defined create_4_plot function is called using the following code. Run this block of code to generate a 4-panel plot for each variable.

create_4_plot Function

# Data Visualization
create_4_plot(pull(eda_data, Var1))
create_4_plot(pull(eda_data, Var2)) 
create_4_plot(pull(eda_data, Var3)) 
create_4_plot(pull(eda_data, Var4))

CODE

Viewing Plots in RStudio

Graphs are shown in the Plot pane, in the lower right of RStudio. You can use the forward and back arrows to step toggle through the plots. If you run all four plot commands in one block, they will generate in order, and the one that displays when they finish will be for the last variable (Var4). Use the left arrow above the plot to scroll back through the variables. There is also a "Zoom" button that lets you view a larger version of the plot you currently have selected. This window can be resized to make the plot even larger.

Boxplot

Recall that the box starts at the first quartile $\begin{array}{l}Q_1\end{array}$ and ends at the third quartile $\begin{array}{l}Q_3\end{array}$ . The thick line inside the box represents the median value of the dataset. The tails represent the maximum and minimum values (excluding outliers).

Values are considered outliers if they are greater than $\begin{array}{l}Q_3 + 1.5*IQR\end{array}$ or less than $\begin{array}{l}Q_1 - 1.5*IQR\end{array}$ .

Boxplot (base graphics)

boxplot(eda_data, ylab = "Value")

CODE

Scatterplot Matrix

The GGally library is an extention of ggplot2. The GGally function ggpairs generates a scatterplot matrix.

Scatterplot Matrix

ggpairs(eda_data)

CODE

The asterisk (*) next to the Var2 and Var3 correlation is indicative of the p-value of the correlation. A single asterisk indicates that the p-value < 0.05.

Correlation

Correlation can be computed in R using the function cor. By default, the cor and the scatterplot matrix functions use the Pearson correlation coefficient which measures linear correlation. Other methods, such as Kendall's tau and Spearman's rho can be computed by specifying the method argument in the cor function.

In addition, HEC-SSP includes a Correlation Analysis which allows users to compute linear correlation coefficients using two or more data sets.

Continue to Task 5. Workshop Questions

Download PDF

Task 4. Data Visualization

NIST 4-Plot

Boxplot

Scatterplot Matrix