In the last post, S&P 500 for the Long Haul, we discussed the results of a 66 year analysis of the S&P 500. This post will discuss the R code used to generate tables and plots in that post. We will focus on

While reading, you may want to view the entire script: R Script as Github Gist. Readers can also download the R Script with Yahoo! Finance Data ZIP.

## Reading and Preparing Data

The goal of this script is to dynamically generate summary statistics and plots given daily time series data and a list of time periods. The main functions of this section of code:

- Read in data with
*read.csv()*, making sure not to convert date strings to factor variables. - Store the closing prices and dates in a
*zoo*object. We use*strptime()*within the*zoo()*call to convert the date strings to POSIXlt objects, which are required by the*index=*parameter in the*zoo()*call. The*zoo*class is preferred to others in this script because it automatically formats plots to include dates. - Call
*par()*to force three vertical plots. This can be adjusted depending on which plots and values of*yvec*are kept/added. - The variable
*yvec*is a parameter in the script that stores time periods over which we will measure**n-over-n**returns. It is worth pointing out that the numbers**0.0833**and**0.25**represent months and quarters, respectively. - Declare a variable,
*out_mat*in this case, to which we can append summary statistics.

# Point filepath to data sp <- read.csv("~/script_and_data/sp500.csv", stringsAsFactors = FALSE) # Convert data to zoo object, keep only date and closing price library(zoo) sp <- zoo(sp$Close, order.by = strptime(sp$Date, "%Y-%m-%d", tz = 'EST')) # Set plotting window to 3 vertical plots par(mfrow = c(3, 1)) # List time periods (in years) to analyze yvec <- c(0.0833, 0.25, 0.5, 1, 2, 3, 5, 8, 10, 12, 15, 20) # Initialize the matrix we will use to store summary stats out_mat <- NULL

## Begin Looping Through Time Periods

For every value in *yvec*, we translate years to an approximate number of trading days and store the value in *lag_dist*. The final line of code in this snippet that declares *yoy* is complex-looking way to compute the **n-over-n** return given a single-column *zoo* object. We remove the first *lag_dist* values of the denominator and the last *lag_dist* values of the numerator, then divide them, subtract one, and multiply by 100. For *lag_dist = 1* this is the formula for a simple return series. This computation is a generalization of the return series allowing for any amount of lag.

for(yrs in yvec){ # Convert years to discrete number of datapoints (assumes 5-day trading week) lag_dist <- round(252 * yrs) # Time series of n-over-n returns yoy <- (as.numeric(sp[-(1:lag_dist)]) / as.numeric(lag(sp, -lag_dist)) - 1) * 100 # Begin plotting...

## Generate Summary Statistics

This snippet computes the empirical CDF (cumulative distribution function) and a handful of summary statistics on the **n-over-n** returns. We compute them in the beginning of the loop because we will use them throughout the plots before outputting them to *out_mat* at the end. Note that the *ecdf()* returns a special object that we can use to check quantiles and plot the ECDF later.

The variable *bep* here is the break-even probability or **B.E. PROB** discussed in the original post. To compute the break-even probability, we sort the returns, then find the indices of the sorted list that are closest to zero from below and from above. We take the average of those two values and divide it by the length of the return series, giving us the percentage of returns that were below zero. We subtract this percentage from 100% to give us the percentage of returns above 0, thus we have the break-even probability. The remaining code adjusts for instances where all returns are above or below zero and R return positive or negative infinity. In this case, we want to register the break-even probability as 100% or 0%, respectively. Note that this computation is performed very inefficiently (multiple calls to *sort()*, calling *min()* and *max()* on ordered lists of integers, etc.) and may need to be optimized for larger time series.

################################################### # Compute emperical cdf and various summary stats # ################################################### cdf_yoy <- ecdf(yoy) cdf_out <- quantile(yoy, c(0.01, 0.05, .10, 0.25, 0.5, 0.75, 0.90, 0.95, 0.99)) mu <- mean(yoy) med <- median(yoy) sig <- sd(yoy) bep <- 100 * (1 - ((max(which(sort(yoy) <= 0)) + min(which(sort(yoy) > 0)))/2) / length(yoy)) if( is.infinite(bep) & sign(bep ) == 1 ){ bep <- 100 } else if( is.infinite(bep) & sign(bep) == -1){ bep <- 0 }

## Append Output to the Output Matrix

This snippet checks to see if *out_mat* is empty, and either assigns or appends a row of summary statistics to it.

################################################### # Organize final data matrix (table seen in blog) # ################################################### out_vec <- c(yrs = y, cdf_out, mean = mu, median = med, std_dev = sig, BEprob = bep) if(is.null(out_mat)){ out_mat <- matrix(ncol= length(out_vec), nrow = 1, out_vec) } else { out_mat <- rbind(out_mat, out_vec) }

## Generate Time Series Plot of Returns

This brief snippet creates a time series plot of the return series. Notice the dynamically formatted titles and labels. Also note how we use *ylim=* to force the y-axis to include zero, even in cases where all returns are above or below zero. We use the same logic as in the beginning of the loop to pass the correct date values to *x=*.

############################################### # Plot simple time series of n-over-n returns # ############################################### plot(y = yoy, x = index(sp)[-(1:lag_dist)], type = 'l', main = paste0(yrs, "-Year Returns on the S&P 500 (Measured Daily)"), ylab = paste0(yrs, "-Year Return (%)"), xlab = "", ylim = c(min(c(yoy, 0)), max(yoy, 0)) ) abline(h = 0, col = 1)

## Generating ECDF’s and Histograms

This section uses many of the same principles as the previous to generate ECDF’s and historgrams. We use *paste0()* here to generate a vector of strings we can use in the legend for each plot. We also attempt to plot more intuitive grid lines for the ECDF. This is better than the alternative, but a little much to explain for this example. If having trouble debugging this section for your own purposes, replace lines 41 through 46 with a simple call to *grid()*.

##################################### # Plot histogram (or empirical PDF) # ##################################### hist( yoy, freq = FALSE, breaks = 50, main = paste0(yrs, "-Year Returns on the S&P 500 (Measured Daily)"), xlab = paste0(yrs, "-Year Return (%)"), ylab = "Frequency") # Draw grid, zero-line, mean-line, and median-line grid() abline(v = 0, lwd = 2) abline(v = mu, col = 2, lwd = 2) abline(v = med, col = 4, lwd = 2) # Create vector of strings for legend table (used throughout) leg_str <- paste0( c("Mean = ", "Median = ", "Std Dev = ","Breakeven Prob. = "), round(c(mu, med, sig, bep), 2), c("%", "%", "", "%") ) # Plot legend legend('topright', legend = leg_str, col = c(2, 4, 0, 0), lwd = c(2, 2, 0, 0)) ###################### # Plot empirical CDF # ###################### plot(cdf_yoy, main = paste0(yrs, "-Year Returns on the S&P 500 (Measured Daily)"), xlab =paste0(yrs, "-Year Return (%)"), ylab = "Cumulative Frequency") # Place grid lines at 10% through 90% in incriments of 10 for(i in seq(.1, .9, .10)){ abline(h = i, col = 8, lwd = 0.7) } # Place solid lines at 0% and 100% abline(h = 0) abline(h = 1) # Dynamically compute and plot vertical grid lines (slight overkill for this script) interv <- round((max(yoy) - min(yoy)) / 25, -1) if(interv == 0) interv <- 10 for(i in seq(round(min(yoy), -1), round(max(yoy), -1), interv)){ abline(v = i, col = 8, lwd = 0.7) } # Plot zero, mean, and median lines just like in the histogram abline(v = 0, lwd = 2) abline(v = mu, col = 2, lwd = 2) abline(v = med, col = 4, lwd = 2) # Plot legend with strings generated earlier legend('bottomright', legend = leg_str, col = c(2, 4, 0, 0), lwd = c(2, 2, 0, 0)) } # end loop

## Wrapping up…

The remainder of the original script prints *out_mat* using *knitr::kable()*. Readers are encouraged to print *out_mat* in their own R sessions and use whichever package they prefer to render the results. Suggested packages include: *htmlTable*, *xtable*, *ztable*, and of course *knitr*.

Personally, I think R is best-suited for exactly these types of analyses. We made great use of native functions in the pure R language and relied on only a single CRAN package, *zoo*. Even then, *zoo* was used out of habit more than necessity. The script is brief, lightweight, and easy to write and debug. It is exactly the type of thing that can be produced on-demand within an hour or two. Readers are encouraged to comment with bugs and let me know if there are any other posts they would like code explanations for.

## Leave a Reply