Download Excel File:  Sampling Distributions exercise 2024.xlsm

Objectives

The objective of this workshop is to explore the relationship between the size of a random sample and the uncertainty in estimates of a probability distribution made from that sample.

In this second task, we'll look at the uncertainty in estimating flood flow frequency curves with the LogPearson type III (LP3) distribution.  We'll explore an existing spreadsheet that starts with a known LP3 distribution, generates random samples of various sizes N, re-estimates the distribution from each sample, and explores the resulting errors in the parameters and quantiles.

The spreadsheet is complex, but this exercise involves just exploring it.


Steps:

In this section, we’ll look at uncertainty in the sample estimates of the parameters and quantiles of a LogPearson type III peak flow frequency curve, captured by the sampling distributions. We’ll be drawing random samples of size 10 and 50 from the frequency curve.

Open spreadsheet “sampling distributions exercise 2023.xlsm”, linked at the top of this page.  You should find yourself in the "actual" curve tab, as seen in the sampling distributions demo.  The "actual" (population) distribution parameters are found in the orange cells, with cell names mean, stdev, and skew.  The Cumulative Distribution Function (CDF) of the population curve is plotted in frequency curve form, and the Probability Density Function (PDF) of flow and log flow are found by scrolling down.  Familiarize yourself with the actual population frequency curve.  Note, throughout this spreadsheet, the B17B estimate of Kp,g for skew between 1 and -1 is used.  That estimate (called the Wilson-Hilferty transformation) is: Kp,g = 2/g*(((Zp - g/6) * g/6 + 1)^3 - 1), where g = skew and p = exceedance probability.

Where the in-lecture demo showed sample means, the samples tab here also contains the standard deviation and skew estimated from each sample, and the fitted LP3 frequency curve for each sample, plotted on the CDF figure to the right.

Hit F9 several times to see several samples of size N=10 (in blue) and N=50 (in red).

Sampling Distribution of an estimate of the MEAN

Tab mean 10 contains 1000 estimates of the mean from a sample size of 10. In this spreadsheet, instead of using a macro to record the results of 1000 samples of size N, the random values generated for each sample are saved to allow inspection of how the samples are created from the population LP3 flow frequency curve.  Each of the 1000 rows of the spreadsheet is a sample of size N=10 (or 50), with the random values in the first N columns, the flow and log flow values generated in the next two sets of N columns, and then the sample estimate of the chosen parameter (in this case, mean) in the final column.

The collection of 1000 sample means for sample size N=10 is in column AH. The statistics of those 1000 estimates are computed to the right, under the heading “statistics of estimates of mean,” and show in black. For the 1000 estimates of the mean based on a sample of size N=10, the mean, standard deviation, coefficient of variation (stdev/mean) and skew are computed.  (Note, these are the statistics of the sampling distribution of an estimate of the mean.)  Next to them, the sampling distribution’s anticipated mean and standard deviation are shown in green. 

Note, the sampling distribution’s anticipated mean is the population mean of mu μ =3.6, because the estimator for the mean is unbiased.  The sampling distribution’s anticipated standard deviation is the population standard deviation divided by the square root of N, σ/sqrt(N) or 0.5/sqrt(10).  For both estimates, the difference between the statistics of the 1000 estimates and the expected sampling distribution parameters are noted. 

Finally, a histogram is generated from the 1000 estimates of the mean from sample size N=10, showing the sampling distribution itself.

Tab mean 50 has the same structure as tab mean 10, with estimates of the mean from sample size N=50. Computations are the same as for N=10 but with the larger sample, and the histograms of N=10 and N=50 are plotted together.  These plots match the image in the sampling distribution demo shown in the "uncertainty" lecture, except they are not plotted with the population PDF.

 

Question 1: How does the histogram for the mean of a sample of size N=10 compare to the one for size N=50?

The histogram for N=10 is wider than the one for N=50, showing that estimates of the mean tend to be farther from the population value. 


Question 2: Is this difference explained by the anticipated sampling distribution parameters?

Yes.  The anticipated standard deviation of estimates of the mean for N=50 is lower at 0.07, compared to 0.16 for N=10.  Statistics of both groups of 1000 estimates of the mean have a sampling distribution mean value very close to the correct value of 3.6, and standard deviation close to those expected sampling distribution values in green.  (NOTE, these values are the mean of the mean, and the standard deviation of the mean!)


Sampling Distribution of an estimate of the STANDARD DEVIATION

Tabs st.dev 10 and st.dev 50 contain similar computations showing the 1000 estimates of the standard deviation from samples of size N=10 and N=50. Note, now we're exploring the sampling distribution of an estimate of the sample standard deviation from sample size N, rather than an estimate of the mean. 

The sampling distribution's expected mean is again the population value, in this case the population standard deviation of sigma σ=0.5, and the sampling distribution's anticipated standard deviation is σ/sqrt(2N). (NOTE, this is the standard deviation of the standard deviation!) .


Question 3: How does the skew of the 1000 sample estimates of standard deviation from size N=10 compare to the skew of the 1000 sample estimates of standard deviation from size N=50?   What might explain the difference?

The skew of the 1000 estimates of standard deviation is smaller for N=50 than for N=10.  This result is because the larger sample size more completely satisfies the Central Limit Theorem and so the sampling distribution is more closely a Normal distribution for N=50, having closer to skew=0.


Sampling Distribution of an estimate of the SKEWNESS COEFFICIENT

Tabs skew 10 and skew 50 contain similar computations showing the 1000 estimates of the skew γ of samples of size N=10 and N=50. Note that the estimator for skew is BIASED. Skew estimates from limited random samples are closer to zero than the population skew (so, for a skew of 0.3, estimates are too low) and the sampling distribution of skew estimates itself has a positive skew for positive population skews.


Question 4: Do the skews of the 1000 estimates of sample skew agree with the above discussion for N=10 and N=50?

Yes.  The skew of the N=50 estimates is larger and closer to the correct value of 0.3 than that of the N=10 estimates (at 0.25 for N=50 and 0.21 for N=10).  Since each estimate is better for N=50, they more accurately capture the actual sampling distribution and its skew.


Sampling Distribution of an estimate of Q-100 (the 100-year or 1% chance flow)

Tabs Q100 10 and Q100 50 examine estimates of the 100-year flow computed from the fitted LogPearson type III (LP3) distribution for each sample. Note the positive skew of the 1000 estimates of Q100 for N=10 and N=50.


Question 5: Are the positive skews of the 1000 estimates of Q100 expected?

Yes.  We’ve seen that the higher quantiles of an LP3 distribution have uncertainty that is positively skewed.  On a plotted frequency curve, this appears as a longer tail upward.


Compare the confidence interval developed here from the 1000 estimates to the confidence interval computed in green, reflecting the non-central t distribution used in Bulletin 17B.


Question 6: How do the confidence intervals from the 1000 estimates compare to Bulletin 17B’s non-central t distribution?  Is this result expected?  Why or why not?

The computed confidence intervals are wider than the non-central t confidence intervals.  This is because the non-central t distribution neglects the uncertainty in skew, while that uncertainty is captured in this experiment.


Sampling Distribution of estimates of probability

The next few tabs show the fitted frequency curves’ estimates of the probability of the actual 100-year flow and 50-year flow, for sample sizes N=10 and N=50. Note, the true values are 0.01 for the 100-year flow and 0.02 for the 50-year flow.  Examine how the probabilities are estimated for each sample.  The  sampling distributions have tremendous skew, as seen by both the computed skew (positive) and the shape of the histograms (appearing as negative skew, because the axis is reversed).  The first histogram reflects a linear probability axis. To the right, another histogram is created based on a Normal Probability axis (i.e., the standard Normal deviate), and the skew is opposite of the histogram based on the linear axis.

Finally, tabs curves 10 and curves 50 show the first 250 of the 1000 estimates of the LogPearson type III distributions. Study the array of curves to determine if the histograms of statistics and quantiles seem to match the full fitted LP3 frequency curves.

Close the spreadsheet.