Beating the S&P Using Machine Learning


EDIT: This post was removed from the other stock market subreddit due to “self-promotion”; I want to be clear – I am not selling anything, I do not have a website or subscription service, and am just looking to share my results with people who know with me and/or have worked on something similar.

Hello All,

I wanted to tell you all about a project I'm working on and hear the thoughts of people who likely know a lot more than me.

I saw some articles on using machine learning to beat the S&P's performance and wanted to try something similar.

I created a model using quarterly report data to determine whether the stock price of companies would beat the S&Ps over the next year. Specifically, I looked at whether a stock's performance 2 days post-quarterly report release would beat the S&Ps performance (represented by SPY) during that same time period a year + 2 days after the release of the quarterly report. Why 2 days following the release of the financial reports? Because i wanted to be realistic in how i use the model; I likely wouldn't update it everyday and wanted some leeway between the release of a report and the “starting point” for the model

I downloaded cash flow statements, balance sheets, and stock prices/stock price ratios from 2014 – 2023. i had also downloaded and originally included income statements in the model, but had forgot to incorporate them in the original model. Once i realized that, i included them but found that accuracy actually dropped quite significantly, so I ended up excluding those numbers.

After excluding columns where more than 50% of the data was missing, i random a model with all the variables on the training set (quarterly reports published in the year prior to the test year; e.g. to predict 2020, the training set was data published from 2013 – 2019). I did this from 2014 – 2023 (with returns being realized for the following year; e.g. if the model was predicting stocks to pick in 2020, returns came in 2021).

While accuracy of the model was okay (~50%), i was most interested in the model's ability to pick correctly pick stocks that beat the S&P and see if relying on these picks was profitable. In other words i was interested in examining the precision of the model, and ensuring that it wasn't giving me too many false positives, rather than the total accuracy itself.

With respect to this, the model did quite well, as seen in the table below (years are for returns realized):

Baseline All (>50%) 60%+ 65%+ 70%+ 80%+
2015 43% 50% 56% 62% 72% 95%
2016 47% 52% 52% 50% 55% 50%
2017 51% 56% 56% 55% 54% 50%
2018 42% 51% 61% 67% 80% 80%
2019 41% 46% 62% 73% 73%
2020 32% 43% 57% 65% 64% 50%
2021 57% 57% 53% 55% 57% 63%
2022 41% 40% 56% 55% 78%
2023 52% 64% 74% 79% 80% 67%
Average 45% 51% 59% 62% 68% 65%

Table formatting brought to you by ExcelToReddit

Baseline is the number of stocks that actually beat the S&P500's performance, and the following columns indicate the model's performance based on the probability that the stock would beat the market. We can see that stocks with the probability of 70% tend to perform the best. 80%+ stocks vary considerably as there are only a handful of stocks that actually reach that benchmark (the number of total stocks recommended varied between 456 – 916 depending on the year.)

If we look at average change in stock price, we see that those stocks that are 70%+ have the best average returns:

SPY All (>50%) 60% (SPY) 60% 65% (SPY) 65% 70% (SPY) 70% 80% (SPY) 80%
2015 10.3% 10.6% 10.5% 14.1% 10.9% 16.1% 11.4% 20.8% 12.7% 33.0%
2016 3.9% 2.4% 2.1% 2.9% 1.6% 2.6% 1.4% 4.0% -0.9% 0.4%
2017 20.1% 30.7% 20.2% 31.9% 20.5% 32.9% 20.7% 35.5% 21.6% 25.3%
2018 14.2% 22.1% 14.9% 24.0% 15.3% 28.0% 15.6% 43.4% 12.4% 20.3%
2019 9.1% 6.3% 7.6% 20.6% 6.0% 32.8% 4.8% 65.1% 0.0% 0.0%
2020 12.2% 7.9% 13.2% 18.6% 13.4% 24.0% 12.3% 22.1% 20.8% 43.9%
2021 36.9% 65.7% 36.2% 57.8% 36.5% 67.3% 32.2% 53.4% 34.3% 49.1%
2022 -0.7% -4.5% 7.2% 3.3% 6.3% 15.1% 5.4% 24.6% 0.0% 0.0%
2023 2.0% 8.8% 0.4% 12.4% -0.4% 10.9% -2.2% 9.1% 1.9% 4.1%
AVG 12.0% 16.7% 12.5% 20.6% 12.2% 25.5% 11.3% 30.9% 11.4% 19.6%
StDev 11.3% 21.1% 10.8% 16.8% 11.3% 18.7% 10.6% 20.2% 12.4% 19.5%
Sharpe 0.7 0.6 0.8 1.0 0.7 1.1 0.7 1.3 0.6 0.8

Table formatting brought to you by ExcelToReddit

Of course this approach works if you only buy the same $ amount per/stock (i.e buying partial stocks).

If you look at buying 1 share/stock recommended (or 10 shares, etc.), the returns are as follow:

SPY All (>50%) 60% (SPY) 60% 65% (SPY) 65% 70% (SPY) 70% 80% (SPY) 80%
2015 6.9% 10.1% 10.3% 14.0% 10.7% 17.4% 11.2% 22.1% 1.6% 3.5%
2016 2.4% 0.7% 2.0% 0.4% 1.6% 2.4% 1.4% 5.8% -0.9% 4.7%
2017 14.3% 14.8% 11.8% 15.1% 10.1% 16.6% 11.3% 15.8% 2.2% -0.8%
2018 14.0% 9.9% 14.7% 11.2% 15.1% 6.0% 15.5% 4.2% 11.9% -22.6%
2019 9.0% 5.3% 7.7% 16.4% 5.9% 20.6% 4.7% 27.7% 0.0% 0.0%
2020 12.2% 7.9% 13.0% 18.5% 13.2% 25.0% 12.1% 20.9% 20.8% 46.4%
2021 36.3% 43.9% 35.4% 43.5% 36.6% 47.3% 33.7% 42.6% 33.9% 44.2%
2022 -1.6% -20.4% 2.4% -1.6% 5.4% 5.2% 5.0% 18.0% 0.0% 0.0%
2023 -27.0% -19.4% -16.1% -6.0% -12.3% -6.3% -13.0% -1.1% 1.6% 3.5%
AVG 7.4% 5.9% 9.0% 12.4% 9.6% 14.9% 9.1% 17.3% 7.9% 8.8%
StDev 16.7% 19.1% 13.6% 14.6% 13.0% 15.7% 12.5% 13.4% 12.1% 22.3%
Sharpe 0.2 0.1 0.4 0.6 0.4 0.7 0.4 1.0 0.3 0.2

Table formatting brought to you by ExcelToReddit

With the exception of 1 year (2018), the stocks picked at 70%+ beat the market every year between 2015 – 2023. Furthermore, it has the best Sharpe Ratio of all the models as well.

Anyways, that's what I've been working on, and I'm interested in hearing what people think, anything they think I'm missing with respect to calculating return, a reason this doesn't actually work, etc. I'm relatively new to investing, so any advice would be helpful.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *