R Coding Practices: Organization

In general it is good practice to organize your script so that others can easily follow and read it.

Using white space to break up groups of related commands makes the code easier to read.  I tend to place extra blank lines between groupings like library calls, setting the working directory, loading data, etc.

You can insert a coding section to keep scripts organized using Ctrl+Shift+R.

Placing library calls at the beginning of the script ensures that when it is run, all of the commands have the necessary functions loaded.  It also instructs someone that they might have to install packages if they don't have them already.

Placing the setwd directory at the beginning of the script lets someone else running the script know that they will need to change that directory to make it run on their machine.

Loading all of the necessary data near the beginning of the script ensures it is always available and ready to go after that.

Lines of R code can be broken up so that they don't get so long that they can't be read.  Some functions take a large number of parameters so it can make a function call easier to read to break up the parameters after commas.

Functions are typically defined at the beginning of a file since they must be defined before they can be called. An argument is a value provided to a function to obtain a result.

RStudio User Interface

The RStudio user interface is comprised of four windows:

  • Source window (top left)
  • Environment/History/Connections/Tutorial window (top right)
  • Console window (bottom left)
  • Files/Plots/Packages/Help/Viewer window (bottom right)

RStudio User Interface

Set Up Project

  1. Start RStudio and create a new script by selecting File | New File | R Script. Alternatively, you can use the keyboard shortcut Ctrl+Shift+N.  Save the newly created script.  Name it "eda_workshop.R".
    Create new R script in RStudio
  2. Save the eda_ws_data.xlsx spreadsheet to the same directory as your R script.
  3. Next, set your working directory. In the lower right window, select the Files tab. Select the file brower (...) and navigate to the folder containing the starting data for this workshop. You must select a folder for the working directory.
  4. On the Files tab, select the More menu to open a drop-down menu. Select Set As Working Directory.  
  5. In the upper right window, select the History tab. You should see a command ("setwd") in the terminal.  This sets the working directory to the directory noted inside the function.
  6. On the History tab, select that last command ("setwd"). The row containing the command should change colors when it is selected. Press the To Source button to add it to your script.
  7. Save the script.

Installing and Attaching Packages

  1. In the lower right window, select the Packages tab. 
  2. In the Packages field of the Install Packages dialog, enter the following: "tidyverse, readxl, moments, reshape2, GGally" without the quotes.  You can copy the text inside the quotes into the text box in RStudio.
  3. Click Install to install the packages and their dependencies.
    1. You may already have some of these packages installed and that's ok. RStudio will skip installing them if you already have them.
  4. You may be prompted with a dialog box asking if you want to restart R prior to install. If you are, select Yes.
  5. After the installation is complete, select the Update button under the Packages tab. This will ensure all installed libraries are up to date. 
  6. Click Select All then Install Updates.
  7. When the process is complete, navigate to the Script pane. Attach each package using the library function (shown in the first few lines of code block below - place them just after your setwd command from the prior step).

    Attaching Packages

    setwd("<Your working directory here>")
    
    library(tidyverse)
    library(readxl)
    library(moments)
    library(GGally)
    CODE
  8. There are a number of ways to execute R code:
    1. Place your cursor anywhere on line of code you want to run and click Ctrl+Enter on your keyboard.
    2. Click the Run button at the top right of the Script console.
    3. To execute multiple lines of code at a time, highlight a section of code and use one of the previously described methods.

Creating Functions

A function is a set of statements used to perform a specific task. R contains many built-in functions and you can create your own functions. An R function is created using the keyword function. A function in R has the following general syntax:

Function Syntax

function_name <- function(<arg1>, <arg2>, ...) {
 # Function body
}
CODE

 Comment lines are used to explain code and are not executed. Comment lines begin with an octothorpe (#). A function is comprised of the following components:

  • Function name: The name of the function. The function is stored in R environment as an object with this name.
  • Arguments: Arguments are optional placeholders. When a function is called, you pass a value to the argument. 
  • Function body: The function body contains statements used to perform a specific task.
  • Return value: The return value of a function is the last expression in the function body.

Variables defined within the function body are local variables. These variables exist only within the function.

  1. Create 5 functions with the following names:
    • getMode
    • midhinge
    • trimean
    • yule
    • create_4_plot

Defining Functions

# Functions
getMode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

midhinge <- function(x) {
  (quantile(x, 0.25, names = F) + quantile(x, 0.75, names = F)) / 2
}

trimean <- function(x) {
  (midhinge(x) + median(x)) / 2
}

yule <- function(x) {
  n <- quantile(x, 0.75, names = F) + quantile(x, 0.25, names = F) - 2 * median(x)
  d <- IQR(x)
  n/d
}

create_4_plot <- function(x) {
  par(mfrow = c(2,2))
  plot(x, main = "Run Sequence Plot of X")
  plot(x[-1], x[-length(x)],type="b", xlab = "lag 1", ylab = "x", main = "Lag Plot of X")
  abline(a = 0, b = 1, lty = 2, col = "grey")
  hist(x, main = "Histogram of X")
  qqnorm(x, main = "Normal Q-Q Plot of X")
  qqline(x)
  par(mfrow = c(1,1))
}
CODE

Note that the way that getMode computes the mode of the dataset is by getting a list of unique values in the dataset, counting how many times they happen, and then returning the one that happens the most. This may not have much meaning for continuous data but is useful for discrete data.

A function can invoke another function. For example, the trimean function invokes the user-defined midhinge function (line 19) and the yule function invokes the built-in IQR function (line 24).

Explanation of the create_4_plot Function

The par(mfrow) function is used to arrange subplots within a plotting space. A 2x2 matrix of plots was created. The second call to the function resets the plot setting so any subsequent created plots aren't placed in panels.

The plot function is a generic plotting function. By default, a simple scatter plot is generated.

The abline function adds a one-to-one line to the current plot.

The hist function computes a histogram of the data. By default, the number of bins are computed using the Sturges' formula: k=1+log_2(n) where k is the number of classes and n is the size of the data.

The qqnorm function produces a normal quantile-quantile (QQ) plot. Sample quantiles are plotted on the y-axis while theoretical quantiles (from the standard normal distribution) are plotted on the x-axis. The qqline function adds a theoretical line according to the normal distribution.

Loading Data

  1. The readxl library can be used to read the data workbook directly.
  2. Using the code below, insert the file path to the eda_ws_data.xlsx workbook. 
  3. Save the script.

Loading Data

# Load data
eda_data <- read_excel("eda_ws_data.xlsx")
CODE

Continue to Task 3. Compute Summary Statistics