In the last post, S&P 500 for the Long Haul, we discussed the results of a 66 year analysis of the S&P 500. This post will discuss the R code used to generate tables and plots in that post. We will focus on
While reading, you may want to view the entire script: R Script as Github Gist. Readers can also download the R Script with Yahoo! Finance Data ZIP.
Reading and Preparing Data
The goal of this script is to dynamically generate summary statistics and plots given daily time series data and a list of time periods. The main functions of this section of code:
- Read in data with read.csv(), making sure not to convert date strings to factor variables.
- Store the closing prices and dates in a zoo object. We use strptime() within the zoo() call to convert the date strings to POSIXlt objects, which are required by the index= parameter in the zoo() call. The zoo class is preferred to others in this script because it automatically formats plots to include dates.
- Call par() to force three vertical plots. This can be adjusted depending on which plots and values of yvec are kept/added.
- The variable yvec is a parameter in the script that stores time periods over which we will measure n-over-n returns. It is worth pointing out that the numbers 0.0833 and 0.25 represent months and quarters, respectively.
- Declare a variable, out_mat in this case, to which we can append summary statistics.
# Point filepath to data sp <- read.csv("~/script_and_data/sp500.csv", stringsAsFactors = FALSE) # Convert data to zoo object, keep only date and closing price library(zoo) sp <- zoo(sp$Close, order.by = strptime(sp$Date, "%Y-%m-%d", tz = 'EST')) # Set plotting window to 3 vertical plots par(mfrow = c(3, 1)) # List time periods (in years) to analyze yvec <- c(0.0833, 0.25, 0.5, 1, 2, 3, 5, 8, 10, 12, 15, 20) # Initialize the matrix we will use to store summary stats out_mat <- NULL
Begin Looping Through Time Periods
For every value in yvec, we translate years to an approximate number of trading days and store the value in lag_dist. The final line of code in this snippet that declares yoy is complex-looking way to compute the n-over-n return given a single-column zoo object. We remove the first lag_dist values of the denominator and the last lag_dist values of the numerator, then divide them, subtract one, and multiply by 100. For lag_dist = 1 this is the formula for a simple return series. This computation is a generalization of the return series allowing for any amount of lag.
for(yrs in yvec){ # Convert years to discrete number of datapoints (assumes 5-day trading week) lag_dist <- round(252 * yrs) # Time series of n-over-n returns yoy <- (as.numeric(sp[-(1:lag_dist)]) / as.numeric(lag(sp, -lag_dist)) - 1) * 100 # Begin plotting...
Generate Summary Statistics
This snippet computes the empirical CDF (cumulative distribution function) and a handful of summary statistics on the n-over-n returns. We compute them in the beginning of the loop because we will use them throughout the plots before outputting them to out_mat at the end. Note that the ecdf() returns a special object that we can use to check quantiles and plot the ECDF later.
The variable bep here is the break-even probability or B.E. PROB discussed in the original post. To compute the break-even probability, we sort the returns, then find the indices of the sorted list that are closest to zero from below and from above. We take the average of those two values and divide it by the length of the return series, giving us the percentage of returns that were below zero. We subtract this percentage from 100% to give us the percentage of returns above 0, thus we have the break-even probability. The remaining code adjusts for instances where all returns are above or below zero and R return positive or negative infinity. In this case, we want to register the break-even probability as 100% or 0%, respectively. Note that this computation is performed very inefficiently (multiple calls to sort(), calling min() and max() on ordered lists of integers, etc.) and may need to be optimized for larger time series.
################################################### # Compute emperical cdf and various summary stats # ################################################### cdf_yoy <- ecdf(yoy) cdf_out <- quantile(yoy, c(0.01, 0.05, .10, 0.25, 0.5, 0.75, 0.90, 0.95, 0.99)) mu <- mean(yoy) med <- median(yoy) sig <- sd(yoy) bep <- 100 * (1 - ((max(which(sort(yoy) <= 0)) + min(which(sort(yoy) > 0)))/2) / length(yoy)) if( is.infinite(bep) & sign(bep ) == 1 ){ bep <- 100 } else if( is.infinite(bep) & sign(bep) == -1){ bep <- 0 }
Append Output to the Output Matrix
This snippet checks to see if out_mat is empty, and either assigns or appends a row of summary statistics to it.
################################################### # Organize final data matrix (table seen in blog) # ################################################### out_vec <- c(yrs = y, cdf_out, mean = mu, median = med, std_dev = sig, BEprob = bep) if(is.null(out_mat)){ out_mat <- matrix(ncol= length(out_vec), nrow = 1, out_vec) } else { out_mat <- rbind(out_mat, out_vec) }
Generate Time Series Plot of Returns
This brief snippet creates a time series plot of the return series. Notice the dynamically formatted titles and labels. Also note how we use ylim= to force the y-axis to include zero, even in cases where all returns are above or below zero. We use the same logic as in the beginning of the loop to pass the correct date values to x=.
############################################### # Plot simple time series of n-over-n returns # ############################################### plot(y = yoy, x = index(sp)[-(1:lag_dist)], type = 'l', main = paste0(yrs, "-Year Returns on the S&P 500 (Measured Daily)"), ylab = paste0(yrs, "-Year Return (%)"), xlab = "", ylim = c(min(c(yoy, 0)), max(yoy, 0)) ) abline(h = 0, col = 1)
Generating ECDF’s and Histograms
This section uses many of the same principles as the previous to generate ECDF’s and historgrams. We use paste0() here to generate a vector of strings we can use in the legend for each plot. We also attempt to plot more intuitive grid lines for the ECDF. This is better than the alternative, but a little much to explain for this example. If having trouble debugging this section for your own purposes, replace lines 41 through 46 with a simple call to grid().
##################################### # Plot histogram (or empirical PDF) # ##################################### hist( yoy, freq = FALSE, breaks = 50, main = paste0(yrs, "-Year Returns on the S&P 500 (Measured Daily)"), xlab = paste0(yrs, "-Year Return (%)"), ylab = "Frequency") # Draw grid, zero-line, mean-line, and median-line grid() abline(v = 0, lwd = 2) abline(v = mu, col = 2, lwd = 2) abline(v = med, col = 4, lwd = 2) # Create vector of strings for legend table (used throughout) leg_str <- paste0( c("Mean = ", "Median = ", "Std Dev = ","Breakeven Prob. = "), round(c(mu, med, sig, bep), 2), c("%", "%", "", "%") ) # Plot legend legend('topright', legend = leg_str, col = c(2, 4, 0, 0), lwd = c(2, 2, 0, 0)) ###################### # Plot empirical CDF # ###################### plot(cdf_yoy, main = paste0(yrs, "-Year Returns on the S&P 500 (Measured Daily)"), xlab =paste0(yrs, "-Year Return (%)"), ylab = "Cumulative Frequency") # Place grid lines at 10% through 90% in incriments of 10 for(i in seq(.1, .9, .10)){ abline(h = i, col = 8, lwd = 0.7) } # Place solid lines at 0% and 100% abline(h = 0) abline(h = 1) # Dynamically compute and plot vertical grid lines (slight overkill for this script) interv <- round((max(yoy) - min(yoy)) / 25, -1) if(interv == 0) interv <- 10 for(i in seq(round(min(yoy), -1), round(max(yoy), -1), interv)){ abline(v = i, col = 8, lwd = 0.7) } # Plot zero, mean, and median lines just like in the histogram abline(v = 0, lwd = 2) abline(v = mu, col = 2, lwd = 2) abline(v = med, col = 4, lwd = 2) # Plot legend with strings generated earlier legend('bottomright', legend = leg_str, col = c(2, 4, 0, 0), lwd = c(2, 2, 0, 0)) } # end loop
Wrapping up…
The remainder of the original script prints out_mat using knitr::kable(). Readers are encouraged to print out_mat in their own R sessions and use whichever package they prefer to render the results. Suggested packages include: htmlTable, xtable, ztable, and of course knitr.
Personally, I think R is best-suited for exactly these types of analyses. We made great use of native functions in the pure R language and relied on only a single CRAN package, zoo. Even then, zoo was used out of habit more than necessity. The script is brief, lightweight, and easy to write and debug. It is exactly the type of thing that can be produced on-demand within an hour or two. Readers are encouraged to comment with bugs and let me know if there are any other posts they would like code explanations for.
Leave a Reply