Chris Conlan

Financial Data Scientist

  • About
  • Blog
    • Business Management
    • Programming with Python
    • Programming with R
    • Automated Trading
    • 3D Technology and Virtual Reality
  • Books
    • The Financial Data Playbook
    • Fast Python
    • Algorithmic Trading with Python
    • The Blender Python API
    • Automated Trading with R
  • Snippets

S&P 500 for the Long Haul: Code Explanation Addendum

January 14, 2017 By Chris Conlan Leave a Comment

In the last post, S&P 500 for the Long Haul, we discussed the results of a 66 year analysis of the S&P 500. This post will discuss the R code used to generate tables and plots in that post. We will focus on

While reading, you may want to view the entire script: R Script as Github Gist. Readers can also download the R Script with Yahoo! Finance Data ZIP.

Reading and Preparing Data

The goal of this script is to dynamically generate summary statistics and plots given daily time series data and a list of time periods. The main functions of this section of code:

  • Read in data with read.csv(), making sure not to convert date strings to factor variables.
  • Store the closing prices and dates in a zoo object. We use strptime() within the zoo() call to convert the date strings to POSIXlt objects, which are required by the index= parameter in the zoo() call. The zoo class is preferred to others in this script because it automatically formats plots to include dates.
  • Call par() to force three vertical plots. This can be adjusted depending on which plots and values of yvec are kept/added.
  • The variable yvec is a parameter in the script that stores time periods over which we will measure n-over-n returns. It is worth pointing out that the numbers 0.0833 and 0.25 represent months and quarters, respectively.
  • Declare a variable, out_mat in this case, to which we can append summary statistics.
# Point filepath to data
sp <- read.csv("~/script_and_data/sp500.csv", stringsAsFactors = FALSE)

# Convert data to zoo object, keep only date and closing price
library(zoo)
sp <- zoo(sp$Close, order.by = strptime(sp$Date, "%Y-%m-%d", tz = 'EST'))

# Set plotting window to 3 vertical plots
par(mfrow = c(3, 1))

# List time periods (in years) to analyze
yvec <- c(0.0833, 0.25, 0.5, 1, 2, 3, 5, 8, 10, 12, 15, 20)

# Initialize the matrix we will use to store summary stats
out_mat <- NULL

Begin Looping Through Time Periods

For every value in yvec, we translate years to an approximate number of trading days and store the value in lag_dist. The final line of code in this snippet that declares yoy is complex-looking way to compute the n-over-n return given a single-column zoo object. We remove the first lag_dist values of the denominator and the last lag_dist values of the numerator, then divide them, subtract one, and multiply by 100. For lag_dist = 1 this is the formula for a simple return series. This computation is a generalization of the return series allowing for any amount of lag.

for(yrs in yvec){

  # Convert years to discrete number of datapoints (assumes 5-day trading week)
  lag_dist <- round(252 * yrs)
  
  # Time series of n-over-n returns
  yoy <- (as.numeric(sp[-(1:lag_dist)]) / as.numeric(lag(sp, -lag_dist)) - 1) * 100
  
  # Begin plotting...

Generate Summary Statistics

This snippet computes the empirical CDF (cumulative distribution function) and a handful of summary statistics on the n-over-n returns. We compute them in the beginning of the loop because we will use them throughout the plots before outputting them to out_mat at the end. Note that the ecdf() returns a special object that we can use to check quantiles and plot the ECDF later.

The variable bep here is the break-even probability or B.E. PROB discussed in the original post. To compute the break-even probability, we sort the returns, then find the indices of the sorted list that are closest to zero from below and from above. We take the average of those two values and divide it by the length of the return series, giving us the percentage of returns that were below zero. We subtract this percentage from 100% to give us the percentage of returns above 0, thus we have the break-even probability. The remaining code adjusts for instances where all returns are above or below zero and R return positive or negative infinity. In this case, we want to register the break-even probability as 100% or 0%, respectively. Note that this computation is performed very inefficiently (multiple calls to sort(), calling min() and max() on ordered lists of integers, etc.) and may need to be optimized for larger time series.

  ###################################################
  # Compute emperical cdf and various summary stats #
  ###################################################
  cdf_yoy <- ecdf(yoy)
  cdf_out <- quantile(yoy, c(0.01, 0.05, .10, 0.25, 0.5, 0.75, 0.90, 0.95, 0.99))
  
  mu <- mean(yoy)
  med <- median(yoy)     
  sig <- sd(yoy)
  bep <- 100 * (1 - ((max(which(sort(yoy) <= 0)) + min(which(sort(yoy) > 0)))/2) / length(yoy))
  
  if( is.infinite(bep) & sign(bep ) == 1 ){
    bep <- 100
  } else if( is.infinite(bep) & sign(bep) == -1){
    bep <- 0
  }

Append Output to the Output Matrix

This snippet checks to see if out_mat is empty, and either assigns or appends a row of summary statistics to it.

  ###################################################
  # Organize final data matrix (table seen in blog) #
  ###################################################
  out_vec <- c(yrs = y, cdf_out, mean = mu, median = med, std_dev = sig, BEprob = bep)
  
  if(is.null(out_mat)){
    out_mat <- matrix(ncol= length(out_vec), nrow = 1, out_vec)
  } else {
    out_mat <- rbind(out_mat, out_vec)
  }

Generate Time Series Plot of Returns

This brief snippet creates a time series plot of the return series. Notice the dynamically formatted titles and labels. Also note how we use ylim= to force the y-axis to include zero, even in cases where all returns are above or below zero. We use the same logic as in the beginning of the loop to pass the correct date values to x=.

  ###############################################
  # Plot simple time series of n-over-n returns #
  ###############################################
  plot(y = yoy, x = index(sp)[-(1:lag_dist)], type = 'l',
       main = paste0(yrs, "-Year Returns on the S&P 500 (Measured Daily)"),
       ylab = paste0(yrs, "-Year Return (%)"),
       xlab = "",
       ylim = c(min(c(yoy, 0)), max(yoy, 0)) )
  
  abline(h = 0, col = 1)

Generating ECDF’s and Histograms

This section uses many of the same principles as the previous to generate ECDF’s and historgrams. We use paste0() here to generate a vector of strings we can use in the legend for each plot. We also attempt to plot more intuitive grid lines for the ECDF. This is better than the alternative, but a little much to explain for this example. If having trouble debugging this section for your own purposes, replace lines 41 through 46 with a simple call to grid().

  #####################################
  # Plot histogram (or empirical PDF) #
  #####################################
  hist( yoy, freq = FALSE, breaks = 50,
        main = paste0(yrs, "-Year Returns on the S&P 500 (Measured Daily)"),
        xlab = paste0(yrs, "-Year Return (%)"),
        ylab = "Frequency")
  
  # Draw grid, zero-line, mean-line, and median-line
  grid()
  abline(v = 0, lwd = 2)
  abline(v = mu, col = 2, lwd = 2)
  abline(v = med, col = 4, lwd = 2)
  
  # Create vector of strings for legend table (used throughout)
  leg_str <- paste0( c("Mean = ", "Median = ", "Std Dev = ","Breakeven Prob. = "),
                     round(c(mu, med, sig, bep), 2),
                     c("%", "%", "", "%") )
  
  # Plot legend
  legend('topright', legend = leg_str, col = c(2, 4, 0, 0), lwd = c(2, 2, 0, 0))
  
  
  ######################
  # Plot empirical CDF #
  ######################
  plot(cdf_yoy,
       main = paste0(yrs, "-Year Returns on the S&P 500 (Measured Daily)"),
       xlab =paste0(yrs, "-Year Return (%)"),
       ylab = "Cumulative Frequency")
  
  # Place grid lines at 10% through 90% in incriments of 10
  for(i in seq(.1, .9, .10)){
    abline(h = i, col = 8, lwd = 0.7)
  }
  
  # Place solid lines at 0% and 100%
  abline(h = 0)
  abline(h = 1)
  
  # Dynamically compute and plot vertical grid lines (slight overkill for this script)
  interv <- round((max(yoy) - min(yoy)) / 25, -1)
  if(interv == 0) interv <- 10
  for(i in seq(round(min(yoy), -1), round(max(yoy), -1), interv)){
    abline(v = i, col = 8, lwd = 0.7)
  }
  
  # Plot zero, mean, and median lines just like in the histogram
  abline(v = 0, lwd = 2)
  abline(v = mu, col = 2, lwd = 2)
  abline(v = med, col = 4, lwd = 2)
  
  # Plot legend with strings generated earlier
  legend('bottomright', legend = leg_str, col = c(2, 4, 0, 0), lwd = c(2, 2, 0, 0))
    
} # end loop

Wrapping up…

The remainder of the original script prints out_mat using knitr::kable(). Readers are encouraged to print out_mat in their own R sessions and use whichever package they prefer to render the results. Suggested packages include: htmlTable, xtable, ztable, and of course knitr.

Personally, I think R is best-suited for exactly these types of analyses. We made great use of native functions in the pure R language and relied on only a single CRAN package, zoo. Even then, zoo was used out of habit more than necessity. The script is brief, lightweight, and easy to write and debug. It is exactly the type of thing that can be produced on-demand within an hour or two. Readers are encouraged to comment with bugs and let me know if there are any other posts they would like code explanations for.

Filed Under: Programming with R Tagged With: investing benchmarks, plotting, r programming, return series, s&p 500, time series, visualization

Leave a Reply Cancel reply

For Traders

Algorithmic Trading with Python by Chris Conlan

Available for purchase at Amazon.com.

Algorithmic Trading

Pulling All Sorts of Financial Data in Python [Updated for 2021]

Calculating Triple Barrier Labels from Advances in Financial Machine Learning

Calculating Financial Performance Metrics in Pandas

Topics

  • 3D Technology and Virtual Reality (8)
  • Automated Trading (9)
  • Business Management (9)
  • Chris Conlan Blog (5)
  • Computer Vision (2)
  • Programming with Python (16)
  • Programming with R (6)
  • Snippets (8)
  • Email
  • LinkedIn
  • RSS
  • YouTube

Copyright © 2022 · Enterprise Pro Theme On Log in