While summary statistics are helpful, it is possible for multiple datasets to have the same summary statistics yet have very distributions and appearances when graphed. Anscombe's quartet is a dataset developed by Francis Anscombe to illustrate the importance of data visualization. In this task, you will plot the data using multiple data visualization techniques.
Graphs are shown in the Plot pane, in the lower right of RStudio. You can use the forward and back arrows to step toggle through the plots. If you run all four plot commands in one block, they will generate in order, and the one that displays when they finish will be for the last variable (Var4). Use the left arrow above the plot to scroll back through the variables. There is also a "Zoom" button that lets you view a larger version of the plot you currently have selected. This window can be resized to make the plot even larger.
Boxplot
Recall that the box starts at the first quartile Q_1 and ends at the third quartile Q_3. The thick line inside the box represents the median value of the dataset. The tails represent the maximum and minimum values (excluding outliers).
Values are considered outliers if they are greater than Q_3 + 1.5*IQR or less than Q_1 - 1.5*IQR.
Boxplot (base graphics)
boxplot(eda_data, ylab = "Value")
CODE
Scatterplot Matrix
The GGally library is an extention of ggplot2. The GGally function ggpairs generates a scatterplot matrix.
Scatterplot Matrix
ggpairs(eda_data)
CODE
The asterisk (*) next to the Var2 and Var3 correlation is indicative of the p-value of the correlation. A single asterisk indicates that the p-value < 0.05.
Correlation
Correlation can be computed in R using the function cor. By default, the corand the scatterplot matrix functions use the Pearson correlation coefficient which measures linear correlation. Other methods, such as Kendall's tau and Spearman's rho can be computed by specifying the methodargument in the corfunction.
In addition, HEC-SSP includes a Correlation Analysis which allows users to compute linear correlation coefficients using two or more data sets.