Workshop Files

Download the workshop files here:

MyFirstRScript.Rstreamflow_daily.RData

Software

CRAN R version 4.3.1 or newer

RStudio version 2023.09.01 or newer

This workshop assumes you have no prior experience in using either the R programming language, or the RStudio integrated development environment.  It will help you get them installed and configured, and then begin to understand the foundations of working with R.  Both R and RStudio are CIO/G-6 approved software for use on USACE hardware, and they are freeware, meaning they do not require any purchasing or licensing to work.

The following major tasks will serve as an outline for the workshop:

  1. Installing and Configuring R and RStudio
  2. Getting Started in RStudio
  3. Introduction to R Fundamentals
  4. Running Your First R Script

Read through the documentation linked in Tasks 1-3.

In Task 4, you will run the provided script. 

Task 1. Installing and Configuring R and RStudio

First, follow these instructions: Installing and Configuring R and RStudio

Task 2. Getting Started in RStudio

R is a programming language geared towards statistical analysis.  It simplifies a lot of common statistical tasks and makes it easier to generate statistical graphics.  Its capabilities are expanded by open source libraries called packages.  RStudio is an IDE, which is a tool used to facilitate development of code in a programming language.

First, start up RStudio and then take a quick tour of its user interface with this guide: Introduction to RStudio.

Task 3. Introduction to R Fundamentals

There are several features fundamental to working with R that might make it different than other programming languages.  Read the following page to become familiar with some of these fundamentals before beginning the workshop: Introduction to R.

Task 4. Running Your First R Script

Set Working Directory

First, set your working directory. In the lower right window, select the Files tab. Select the file brower (...) and navigate to the folder containing the starting data for this workshop. You must select a folder for the working directory.

On the Files tab, select the More menu to open a drop-down menu. Select Set As Working Directory.  

Save both files linked above (MyFirstRScript.R and streamflow_daily.RData) to your working directory. Select File | Open File... and open the MyFirstRScript.R script.

In the upper right window, select the History tab. You should see a command ("setwd") in the terminal.  This sets the working directory to the directory noted inside the function.

On the History tab, select that last command (setwd). The row containing the command should change colors when it is selected. Press the To Source button to add it to your script on line 2.  

A working directory allows you to set the root directory for relative paths, and lets you access files by name in the working directory without needing to specify a full file path.

Set Working Directory

# Set working directory
setwd("<Your file directory here>")
CODE

To specify a directory path in R, you have two options:

  • Use two backslashes: setwd("C:\\Projects\\ILoveR")
  • Use a forward slash: setwd("C:/Projects/ILoveR")

Using a single backslash will result in an error since a single backslash is used to denote an escape character (e.g. \n is the escape character for a new line).

Install Packages

Packages can be installed using the package installation tool as described in Installing and Configuring R and RStudio. There are three libraries attached to the script: tidyverse, dataRetrieval, and moments. As a result, you will be prompted to install the packages at the top of the script. Click Install (shown below in the red box) to install the packages. 

In the bottom pane, click on the Background Jobs tab. You will see text appearing as the packages are installed. Wait until package installation is complete and you see the following text:

✔ Packages successfully installed.
CODE

Attach Packages

Libraries should be attached at the top of the script using the library function.  

The dataRetrieval package is used to download U.S. Geological Survey hydrologic data. 

The tidyverse package is a set of packages that are among the most commonly used R packages. The ggplot2 package is contained within the tidyverse package. We will use ggplot2 for data visualization later in this tutorial.

The moments package is used to compute mathematical moments.

The lubridate package is used for 

Run lines 5-7 to attach packages.

Attach Packages

# Attach packages
library(tidyverse) 
library(dataRetrieval)
library(moments)
CODE

Define Custom Functions

Review the Functions section of the Introduction to R page for additional information. An R function is created using the keyword function. A function in R has the following general syntax:

Function Syntax

function_name <- function(<arg1>, <arg2>, ...) {
 # Function body
}
CODE

 A function is comprised of the following components:

  • Function name: The name of the function. The function is stored in R environment as an object with this name.
  • Arguments: Arguments are optional placeholders. When a function is called, you pass a value to the argument. 
  • Function body: The function body contains statements used to perform a specific task.
  • Return value: The return value of a function is the last expression in the function body.

Variables defined within the function body are local variables. These variables exist only within the function. A function can invoke another function.

In this code block, we will define 2 custom functions: weibull_pp and wy.

The weibull_pp function computes the Weibull plotting position using the formula P = \frac {i}{n + 1} where i is the rank of the data and n is the number of data values. In R, the rank function assigns a rank of 1 to the smallest value. Therefore, this version of the Weibull plotting position describes empirical non-exceedance probability.

The wy function computes the water year (WY) based on a date. In the United States, a WY spans October 1 through September 30 of the following year and is named by the year in which the measurements end (e.g., WY 2022 spans October 1, 2021 through September 30, 2022).

Run lines 9-25.

Define Custom Functions

# Custom functions

# Computes the Weibull plotting position (empirical non-exceedance probability)
weibull_pp <- function(x) {
  n <- length(x)
  i <- rank(x) # In R, rank = 1 corresponds to the smallest value
  return(i / (n+1))
}

# Computes water year
wy <- function(date) {
  ifelse(month(date) > 9, year(date)+1, year(date))
}
CODE

Get Gage Site Info Using the dataRetrieval Package

The dataRetrival package readNWISsite function imports site information from USGS based on a site number. Run lines 26-30.

Get Gage Site Info Using the dataRetrieval Package

# Get gage site info
siteNo <- "11266500" # Merced River at Pohono Bridge near Yosemite, CA
siteInfo <- readNWISsite(siteNo)
siteInfo
CODE

QUESTION 1. What information is provided by the readNWISsite function?

The function returns the following information:

NameTypeDescription
agency_cdcharacterThe NWIS code for the agency reporting the data
site_nocharacterThe USGS site number
station_nmcharacterSite name
site_tp_cdcharacterSite type
lat_vanumericDMS latitude
long_vanumericDMS longitude
dec_lat_vanumericDecimal latitude
dec_long_vanumericDecimal longitude
coord_meth_cdcharacterLatitude-longitude method
coord_acy_cdcharacterLatitude-longitude accuracy
coord_datum_cdcharacterLatitude-longitude datum
dec_coord_datum_cdcharacterDecimal Latitude-longitude datum
district_cdcharacterDistrict code
state_cdcharacterState code
county_cdcharacterCounty code
country_cdcharacterCountry code
land_net_dscharacterLand net location description
map_nmcharacterName of location map
map_scale_fccharacterScale of location map
alt_vanumericAltitude of Gage/land surface
alt_meth_cdcharacterMethod altitude determined
alt_acy_vanumericAltitude accuracy
alt_datum_cdcharacterAltitude datum
huc_cdcharacterHydrologic unit code
basin_cdcharacterDrainage basin code
topo_cdcharacterTopographic setting code
instruments_cdcharacterFlags for instruments at site
construction_dtcharacterDate of first construction
inventory_dtcharacterDate site established or inventoried
drain_area_vanumericDrainage area
contrib_drain_area_vanumericContributing drainage area
tz_cdcharacterTime Zone abbreviation
local_time_fgcharacterSite honors Daylight Savings Time
reliability_cdcharacterData reliability code
gw_file_cdcharacterData-other GW files
nat_aqfr_cdcharacterNational aquifer code
aqfr_cdcharacterLocal aquifer code
aqfr_type_cdcharacterLocal aquifer type code
well_depth_vanumericWell depth
hole_depth_vanumericHole depth
depth_src_cdcharacterSource of depth data
project_nocharacterProject number

However, not every field will return a value for every gage. 

Load an *.RData File of Mean Daily Flow

The load function is used to load R objects. The head function is used to display the first n rows of the provided data frame. By default, n = 6.

Run lines 31-34.

Load an *.RData File of Mean Daily Flow

# Load the *.RData file containing mean daily streamflow for WY 1917-2022
load("streamflow_daily.RData")
head(streamflow_daily)
CODE

You should see the following output in the Console:

  agency_cd  site_no       Date Flow Flow_cd
1      USGS 11266500 1916-10-01  285       A
2      USGS 11266500 1916-10-02  285       A
3      USGS 11266500 1916-10-03  285       A
4      USGS 11266500 1916-10-04  285       A
5      USGS 11266500 1916-10-05  285       A
6      USGS 11266500 1916-10-06  285       A
CODE

Use the dataRetrieval Package to Download Annual Peak Flows

The dataRetrieval package readNWISpeak function is used to download the annual peak flows at USGS 11266500 Merced River at Pohono Bridge near Yosemite, CA.

Run lines 35-38.

Use the dataRetrieval Package to Download Annual Peak Flows

# Read peak flows
peak_flow_table <- readNWISpeak(siteNumbers = siteNo) 
head(peak_flow_table)
CODE

You should see the following output in the Console:

  agency_cd  site_no    peak_dt peak_tm peak_va peak_cd gage_ht gage_ht_cd year_last_pk ag_dt ag_tm ag_gage_ht ag_gage_ht_cd peak_dateTime
1      USGS 11266500 1917-06-10    <NA>    5880    <NA>    8.75       <NA>           NA  <NA>  <NA>         NA          <NA>          <NA>
2      USGS 11266500 1918-06-12    <NA>    4000    <NA>    7.05       <NA>           NA  <NA>  <NA>         NA          <NA>          <NA>
3      USGS 11266500 1919-05-29    <NA>    6150    <NA>    9.80       <NA>           NA  <NA>  <NA>         NA          <NA>          <NA>
4      USGS 11266500 1920-05-20    <NA>    5050    <NA>    8.80       <NA>           NA  <NA>  <NA>         NA          <NA>          <NA>
5      USGS 11266500 1921-06-11    <NA>    4610    <NA>    8.40       <NA>           NA  <NA>  <NA>         NA          <NA>          <NA>
6      USGS 11266500 1922-06-05    <NA>    6370    <NA>   10.00       <NA>           NA  <NA>  <NA>         NA          <NA>          <NA>
CODE

Apply Custom Functions and tidyverse Functions to the Annual Peak Flows

We will assign the peak_flow_table dataframe to a new variable named ams. We will add two columns, named wy and weibull_pp, using the dplyr package mutate function. The values of these columns are computed by calling the custom functions wy and weibull_pp.  Finally, we select the peak_dt, peak_va, wy, and weibull_pp columns using the dplyr packag select function and assign those to the ams variable.

Run lines 39-49.

Operators

The %>% is the pipe operator. The pipe operator is a mechanism for chaining commands together. The operator will forward a value, or the result of an expression, into the next function call/expression.

The double colon operator, ::, is used to access a specific function from a package (package::function). The code dplyr::select accesses the select function from the dplyr package. 

This is needed because different packages have functions with the same name but different functionality. The double colon operator allows us to specify a particular function that we want to use. Without a specific call, the function from the most recently loaded package masks the other functions of the same name from different packages.

Apply Custom and tidyverse Functions to Annual Peak Flows

# Create new dataframe named ams from the existing peak_flow_table dataframe
# Add columns for WY and Weibull plotting position
# Select the peak flow, WY, and Weibull PP columns 
ams <- peak_flow_table %>%
  mutate(wy = wy(peak_dt),
         weibull_pp = weibull_pp(peak_va)) %>% # Add columns for WY and Weibull PP; call user-defined functions wy and weibull_pp
  dplyr::select(peak_dt, peak_va, wy, weibull_pp) 

# Notice that the ams dataframe contains only peak_dt, peak_va, wy, and weibull_pp columns
head(ams)
CODE

You should see the following output in the Console:

     peak_dt peak_va   wy weibull_pp
1 1917-06-10    5880 1917  0.6635514
2 1918-06-12    4000 1918  0.3738318
3 1919-05-29    6150 1919  0.6915888
4 1920-05-20    5050 1920  0.5887850
5 1921-06-11    4610 1921  0.4859813
6 1922-06-05    6370 1922  0.7149533
CODE

Compute Summary Statistics for All Season, January, and May Mean Daily Flows

We will compute a handful of summary statistics on the mean daily streamflow. We will compute the following summary statistics for the all season, January, and May streamflows: mean, maximum, minimum, range, interquartile range, coefficient of variation, and skewness.

Run lines 50-89.

Compute Summary Statistics for All Season, January, and May Mean Daily Flows

# Compute summary statistics for all seasons
summary_stats_all_season <- streamflow_daily %>%
  summarise(mean = mean(Flow), # Compute mean of daily average flows
            max = max(Flow), # Compute maximum of daily average flows
            min = min(Flow), # Compute minimum of daily average flows
            range = max - min, # Compute range of daily average flows
            iqr = IQR(Flow), # Compute interquartile range: IQR = Q3 - Q1 (3rd quartile - 1st quartile)
            cv = sd(Flow)/mean, # Compute coefficient of variation: cov = standard deviation/mean
            skewness = moments::skewness(Flow))

# Subset daily streamflow to January and May flows
jan_flows <- subset(streamflow_daily, month(Date) == 1) # Keep all flows that occurred during January (month = 1)
may_flows <- subset(streamflow_daily, month(Date) == 5) # Keep all flows that occurred during May (month = 5)

# Compute summary statistics for all January flows in the period of record
summary_stats_jan <- jan_flows %>% 
  summarise(mean = mean(Flow), 
            max = max(Flow), 
            min = min(Flow),
            range = max - min,
            iqr = IQR(Flow),
            cv = sd(Flow)/mean,
            skewness = moments::skewness(Flow)) 

# Compute summary statistics for all May flows in the period of record
summary_stats_may <- may_flows %>%
  summarise(mean = mean(Flow),
            max = max(Flow),
            min = min(Flow),
            range = max - min,
            iqr = IQR(Flow),
            cv = sd(Flow)/mean,
            skewness = moments::skewness(Flow))

# Bind rows into a single summary statistics table
summary_stats <- rbind(all = summary_stats_all_season,
                       jan = summary_stats_jan,
                       may = summary_stats_may)
summary_stats
CODE

You should see the following output in the Console:

         mean   max   min   range    iqr       cv  skewness
all  623.9755 21000   5.4 20994.6  642.0 1.6520261  3.149698
jan  210.2115 21000  14.0 20986.0  167.0 2.4991858 23.595191
may 2321.3287  9760 198.0  9562.0 1797.5 0.5774552  1.030760
CODE

QUESTION 2. Comment on the differences in flow magnitudes and summary statistics between the all season, January, and May datasets.

The maximum mean daily flow for the all season and January time frames are the same (21,000 cfs) indicating that the maximum flow during the period of record occurred in January.

The minimum flow in January is substantially lower than the minimum flow in May but the maximum flow in January is substantially higher than the maximum flow in May.

The range, interquartile range, and coefficient of variation are all measures of spread, or dispersion. The range of flows during January (20,986 cfs) is nearly the same as the all season range (20,995 cfs). Both these ranges are much larger than the May flow range (9,562 cfs).

The interquartile range is the difference between the third quartile (the 75th percentile) and first quartile (25th percentile). The range of January flows is really large but the interquartile range is much smaller than that of the May flows. This indicates that the middle 50% of January flows doesn't exhibit as much spread as the middle 50% of May flows but the highest and lowest flow values exhibit a much wider spread than the maximum and minimum May flows.

The coefficient of variation is the ratio of the standard deviation to the mean and is a standardized measure of dispersion. The January flows have the highest coefficient of dispersion (2.50), followed by the all season flows (1.65) and the May flows (0.58).

Skewness is a measure of distribution asymmetry. For example, the skewness of a normal distribution is zero. Streamflow is typically characterized by a positive skewness value, indicating a long right tail. All three skewness values are positive. The January flow skewness value is extremely positive (23.6), indicating a very heavy right-tailed distribution.


Negative and positive skew diagrams (English).svg

Plot the Mean Daily Hydrograph and Annual Peak Flows

This code block uses the ggplot2 package (one of the tidyverse packages) for data visualization. We can add additional commands to ggplot objects using the + operator. Think of ggplot objects as buildable. 

Run lines 90-100.

Plot the Mean Daily Hydrograph and Annual Peak Flows

# Plot mean daily hydrograph
ts_plot <-
  ggplot(streamflow_daily, aes(x = Date, y = Flow)) + 
  geom_line() + # Plot hydrograph as a line
  xlab("Date") + # Add x-axis label
  ylab("Mean Daily Flow (cfs)") # Add y-axis label

# Add annual peak flow values to hydrograph plot
ts_plot + geom_point(data = ams, aes(x = peak_dt, y = peak_va), colour = "red") + # Plot AMS as points
  labs(title = "Merced River at Pohono Bridge near Yosemite, CA (11266500)") # Add plot title
CODE

You should see the following plot in the lower right panel under the Plots tab:

Plot the All Season Flow Duration Curve

A flow duration curve shows the percentage of time that a specific discharge value is equaled or exceeded. Flow duration curves are typically computed using mean daily data. The exceedance probability is computed as  P = \frac {i}{n + 1} where i is the rank of the data (i = 1 is the largest value) and n is the number of data values. This equation should look familiar (Weibull plotting position) but now i = 1 corresponds to the largest value instead of the smallest value. Instead of using the custom function weibull_pp, use the following code to add a column to the streamflow_daily data frame. Notice that we could have computed the Weibull plotting position this way instead of defining a custom function. We apply a negative sign in front of the flow values so that i = 1 corresponds to the largest flow value.

Run lines 101-112.

Plot the All Season Flow Duration Curve

# Compute exceedance probability for flow duration curve
streamflow_daily <- streamflow_daily %>%
  mutate(exPr = rank(-Flow)/(nrow(streamflow_daily) + 1))

# Plot flow duration curve
ggplot(streamflow_daily, aes(x = exPr, y = Flow)) +
  geom_line() +
  scale_y_log10(labels = scales::comma) +
  xlab("Exceedance Probability") +
  ylab("Flow (cfs)") +
  theme_bw() # Change theme to black and white
CODE

QUESTION 3. What percentage of the time is a discharge of 100 cfs equaled or exceeded at this site?

A discharge of 100 cfs is equaled or exceeded approximately 63% of the time (based on the period analyzed).

Plot Empirical Frequency of the Annual Peak Flows

The following code plots annual peak flows against annual exceedance probability (computed from the Weibull plotting position formula). 

Run lines 113-128.

Plot Empirical Frequency of the Annual Peak Flows

# Define non-exceedance probability breaks for plot
pbreaks <- c(0.001, 0.01, 0.10, 0.25, 0.50, 0.75, 0.90, 0.99, 0.99)

ggplot(ams, aes(x = weibull_pp, y = peak_va)) + 
  geom_point() +
  scale_x_continuous("Annual Exceedance Probability",
                     trans = scales::probability_trans("norm"), # Applies normal transformation to x-axis
                     breaks = 1-pbreaks, # Defines x-axis breaks based on specific exceedance probabilities (1 - non-exceedance probabilities)
                     labels = prettyNum(pbreaks), # Changes formatting of x-axis labels
                     limits = c(0.1, 0.99)) + # Set x-axis limits
  scale_y_log10(labels = scales::comma,
                limits = c(1000, 30000)) + # Add comma to labels on y-axis, set y-axis limits
  ylab("Flow (cfs)") + 
  theme_bw() +
  annotation_logticks(sides = "l") + # Add log tick marks for readability
  labs(title = "Merced River Empirical Frequency")
CODE

QUESTION 4. What discharge corresponds to an annual exceedance probability (AEP) of 10%?

A discharge of 11,000 cfs has an AEP of approximately 10%, or a 1/10 chance of occurring in any given year.