In my latest book, Algorithmic Trading with Python (2020), readers work through the process of developing a trading strategy, simulator, and optimizer against a portfolio of 100 assets. Each asset has 10 years of end-of-day data, creating about 2,500 data points per asset, totaling 250,000 data points.
A lot of similar work in this field focuses on hyper-analyzing very dense data on single assets. For example, tick-by-tick data on S&P 500 futures can have 100,000+ data points per day at times. I chose to structure my book in this way for a few reasons. The reasons not only inform the structure of my book, but also provide certain advantages to traders that might be interested in structuring their strategies in a similar way.
Effects on Trading Frequency
Retail traders often encounter trading frequency restrictions in the form of government regulation or broker-level rules. These rules typically prevent traders from making more than three trades per week on a specific asset. Thus, for a trader with a small account, it is not worthwhile to study high-frequency data on a single asset. Any discoveries or trading strategies developed against that data will not translate into a workable investment plan. On the other hand, if your strategy includes 100, 500, or 1,000 assets, it is likely not making frequent trades against any single asset. Multi-asset strategies seem to be operate inline with government and brokerage expectations of low-risk investing without limiting profit opportunities.
Effects on Sample Size
The multi-asset setup provides additional benefits during the research phase. Many financial machine learning strategies utilize event-based labels that can potentially overlap. A high incidence of overlapping labels has the potential to create significant data leakage problems in your modeling step. In my experience, this effect can be mitigated significantly by using multi-asset strategies that generate signals less frequently on a per-asset basis, but more frequently on a per-strategy basis. The resulting machine learning data set that can be built from a multi-asset strategy will have a higher number of unique samples per row of data than the alternative high-frequency single-asset strategy.
Collaboration and Reproducibility
I mentioned that the quantity of data in the aforementioned book was about 250,000 rows. When compressed, the data is about 11MB total. I chose this format for the book, because I considered it important that the data was freely hosted and shared on a GitHub repo. Since the data fits, I figured it was the perfect size to serve as a small-scale benchmark for reproducible finance research.
Most financial literature is plagued by reproducibility issues because of the size and propriety of the data. Other fields of study have large-scale open-source data sets against which researchers test and compare the results. Finance has no such thing, because the data under consideration has such high value. I hope that the data sets published with Algorithmic Trading with Python will start to get people thinking about what a good benchmark data set looks like in this field.
Effects on Diversification
One interesting and unintended effect of transitioning to a multi-asset strategy is that stocks tend to move in lockstep during major macroeconomic events like to 2008 recession or the 2020 COVID-19 crash. At the same time, stocks tend to rise simultaneously during bull markets. When your multi-asset strategy uses a basket of highly correlated U.S. equities, you have to be sure to compare your performance to the market. Some periods in history, like 2010-2020, were massively long bull markets. As a result, beating the base case is fairly difficult on simulated data during the 2010’s.
Getting a trading strategy off the ground with high-frequency data is a daunting undertaking. Retail traders should start their algo-trading efforts by looking at end-of-day data and multi-asset strategies. Otherwise, it will be pretty easy to get discouraged and give up.