This is a short workshop (15 minutes) intended to show exploratory data analysis concepts on a real dataset.

The data included in the workshop are a timeseries of annual peak discharges for the Des Plaines River near Des Plaines, IL (pronounced: "dez PLANES" with all apologies to the French language.)  The watershed above the stream gage is an urban waterway with a drainage area of approximately 360 mi2.  It runs from southern Wisconsin towards Chicago, IL.  When it meets the Kankakee River southwest of Chicago, it forms the Illinois River.

Peak discharges back to 1934 were taken from USGS-NWIS and placed in an Excel spreadsheet, available here:

EDA Short Workshop.xlsx

Part 1: Data Summary Statistics

In the Excel workbook, the first worksheet is labeled "Data" and contains the timeseries of peak discharges, as well as a large number of univariate summary statistics.

What type of summary statistics are the first four statistics?  How much do they vary?  What does this suggest about the dataset?

The mean, median, 50% trimmed mean, and trimean, are measures of central tendency.  The value varies between 2,400 (50% trimmed mean) and 2,530 (mean) cfs.  The mean is the most sensitive to outliers, and because it is higher than the more robust measures of central tendency, might indicate the presence of one or more high outliers.  The overall variation in magnitude of these values is small, however.


What type of summary statistics are in the next group?  What are Q1 and Q3 and why are they included in this group?

The IQR, standard deviation, coefficient of variation, and quartile coefficient of dispersion, are measures of dispersion.  Q1 and Q3 are the first quartile (25th percentile) and the third quartile (75th percentile) respectively.  They are used to compute the IQR and in turn the quartile coefficient of dispersion.


What type of summary statistics are in the third group?  What do their values suggest about the dataset?  What kind of plot would you use to confirm this?

Skew and Yule's coefficient are measures of asymmetry.  Both values are positive, suggesting that the distribution of the data has a long right tail.  A histogram or a normal Q-Q plot would help investigate the direction and magnitude of asymmetry.

Part 2: Data Visualization

To the right of the data summary statistics are three plots: a time series plot, a histogram, and a normal Q-Q plot.

The histogram starts with 4 equally-spaced bins.  Increase this number to get a better visualization of the shape of the data (this is subjective - change it until you are satisfied with it.)

To change the number of bins, right-click on the x-axis of the plot and select Format Axis:

In the pane that opens, make sure you have the graph icon selected, and Axis Options is expanded.  Make sure "Number of bins" is selected, and change the number to whatever you like.  Do this by highlighting the number, entering a new one, and pressing Enter.


What does the shape of the histogram and the shape of the normal Q-Q plot tell you about this dataset?

The histogram seems to show a right (positive) skew because it has a longer tail off to the right.  Using 7 bins in the histogram reveals that the top bin has more observations than the second largest bin, which indicates there may be some high outliers in the dataset.  The tallest bin is in the region of 1,800-2,500 cfs indicating the mode of the data is somewhere in this range.

The normal Q-Q plot shows some non-linearity (curvature) of the points.   The non-linearity is concave up, which indicates that the dataset has more skewness than the normal distribution.  The normal distribution has zero skewness, so this indicates that the dataset have positive skewness.

Generating a Normal Q-Q Plot

A table that demonstrates computing the points for a normal Q-Q plot is in the rightmost cells of the Data worksheet beginning in column X.  It also shows the computation of the Q1-Q3 reference line that helps to check linearity of the dataset.

Part 3: Comparing Two Parts of the Record

Beginning in water year 1993, the USGS has flagged the flow observations with code "C" which indicates "All or part of the record affected by Urbanization, Mining, Agricultural changes, Channelization, or other".  This means that flows in this part of the record are likely to have different properties than flows before it.

Based on the time series plot on the "Data" worksheet, are there visible changes to the data throughout time?

The average discharge over time seems to be increasing.  The largest peaks seem to be at about the same magnitude, but the smallest peaks seem to be getting larger.

Switch to the "Split Record" worksheet, which has divided the record between the periods prior to the C code, and the period with a C code.

Looking at the summary statistics for each group of data, what kind of differences do you see?

The later (with "C") period has a higher central tendency and similar or slightly lower dispersion.  It is better to use non-dimensional measures to compare the dispersion between the two (like CV or the quartile coefficient of dispersion) because the central tendency between the datasets is different.

One interesting property of the "with C" group is that it has a positive skew but a negative Yule's coefficient - this is because the average of Q1 and Q3 is below the median.  Looking at the histogram for the "with C" group this could be caused by the slight amount of bi-modality in the dataset.

Finally, look at the box and whisker plots in the upper right.  These plots are useful for quickly comparing the properties of two or more groups of data.  Can you see how the differences in the summary statistics are reflected in the box and whisker plots?