Training the Machine Learning Algorithms

With our features generated and scaled, we are now free to begin training different machine learning algorithms, and select which to use for our final predictor. It is here that we implement a fundamental, bad assumption: the trend in the closing stock price is only dependent on the features we have selected, and no other external factors. We make this assumption for simplicity, we are trying to predict trends, not be able to nail down the exact stock price after the third quarter earnings are announced, just before the yearly trade show for a company. It is more of a tool that could be used with that external model to make investing decisions.

There are two important things that fall out of making this assumption, in regards to training the algorithm. First, we are able to train on all of our quote data over the full time window. Ordinarily with time series data, it is vital to train on the earliest data, and train on the latest data. With our assumption, this goes away, since past data is in the features we include, and we assume that information is all we need to determine future trends. Second, as we can train on the most recent data, we include the most recent data in the market, and don't have to worry that our data is trained on markets from 2-10 years in the past, which may behave very differently.

Model Selection and Tuning

The data was shuffled and split into an 80/20 training/test samples. The target variables were the percentage increase/decrease in the rolling mean of the stock price for different day means. We used Python's scikit-learn package to generate our machine learning algorithms, and tuned the hyperparameters of the estimators using cross validation on different folds of the training data.

We decided to test several machine learning regressors for this project, including SVM, Random Forest, Multi Layered Perceptron, a Bagging Regressor, and finally an Adaboost Regressor. These were each selected for their potential strengths with this kind of data. Support vectors tend to perform well with large amounts of continuous data. A Random Forest may simulate the way an investor may thing, comparing multiple traditional estimators to determine a future trend. Similarly, neural nets have proved excellent at combining a large amount of chaotic data to make accurate predictions, adding the Multi Layered Perceptron to our list. Bagging Regressors are another good choice when dealing with noisy data, combing multiple other regressors. Finally, Adaboost is an excellent choice for noisy data, adapting weights and repeatedly refitting to provide better results.

We also considered implementing a model that included principle component analysis. As many of our features only differed in the number of days the variable was calculated over, there is huge overlap in the information contained. We thus fit models with and without PCA implementation, reducing the dimensionality of the momentum variables from 5 to 2, the Bollinger Band variables from 5 to 2, and the RSI variables from 2 to 1 (using a 90% threshold for information).

We further tested whether it would be worth attempting to predict the rolling mean change, not the percent change.

For the initial model selection, we compared the predictions for the 5 day and 15 day rolling means. This subset was chosen to be a simple test of short and long term predictions. The mean square error of the predictions, with PCA and non-percentage variations, can be seen in table 1.

Table 1 - Mean square errors for rolling mean predictions, for different machine learning models. The ordinary predictor uses a percent change, Abs indicates an absolute change (dollar value), and PCA indicated the model used PCA for dimensionality reduction
N days Rolling Mean SVR SVR, PCA SVR, Abs RFR RFR, PCA RFR, Abs MLP MLP, PCA MLP, Abs Bag Bag, PCA Bag, Abs Ada Ada, PCA Ada, Abs
5 Day 0.00145 0.00133 33.5 0.00128 0.00133 27.4 0.00135 0.00141 24.3 0.00129 0.00133 25.5 0.00125 0.00130 27.9
15 Day 0.00294 0.00270 81.8 0.00273 0.00265 63.9 0.00293 0.00304 64.2 0.00275 0.00270 62.9 0.00263 0.00255 64.3



Looking at the table, there is no clear benefit from including PCA to our pipeline. Without an added benefit, we chose to omit PCA, to include as much information as possible, and decrease data reduction time. Less obvious, is whether or not the absolute change in rolling mean provides different results than a percent change in rolling mean. A percent change is easy to interpret as MSE, however prices can vary wildly, making a MSE of say, $25.5, harder to interpret. A few percent MSE seems smaller than a thirty dollar MSE, but let's examine the results in more detail.

Generally, Adaboost demonstrated the best performance, followed by the Bagging regressor, then the Random Forest. Figure 1 shows the predicted 15 day percentage increase, as a function of the true 15 day percentage increase, for our test dataset. They all demonstrate the same general trends, over predicting the downward trends, and under predicting the upward trends, but overall much of the variability is captured. Figure 2 shows the same data, but for the models predicting absolute changes in the predicted price. Clearly, these algorithms fail, so we will utilize the percent change model. Specifically, we will adopt the Adaboost regressor, which scored the best for both the 5 and 15 day tests.

Predictions of 15 day rolling mean percent changes
Figure 1 - Prediction of the percent change in the 15 day rolling mean for our three best regressors. There is a clear skew in the predicted values as a function of the true values.
Predictions of 15 day rolling mean absolute changes
Figure 2 - Prediction of the absolute change in the 15 day rolling mean for our three best regressors. Predicting the absolute change clearly is not an option.



One may be inclined to fit a linear regression on the results in Figure 1, and transform the results so they fit on a 1:1 line. This is would however result in worse performance of our algorithm. To understand why, we plot in Figure 3 the difference in the predicted percent change and true percent change, as a function of the predicted change for the 15 day mean. There is clear symmetry in the uncertainty, a symmetry that is necessary to have unbiased results, as we are assuming the predicted value as the true value. If we attempted further transform, it would skew this error, and lead to biased results.

Difference in predicted change vs true change
Figure 3 - Difference between the predicted percent change, and the true percent change.


Next up: Interpolating Rolling Means