I’veÂ replicated the following academic paper from my favourite journal;

â€˘ Title: Nonlinear support vector machines can systematically identify stocks with high and low future returns

â€˘ Authors: Ramon Huerta, Fernando Corbacho, and Charles Elkan

â€˘ Journal: Algorithmic Finance (2013) 45-58 45, DOI 10.3233/AF-13016, IOS Press,

http://algorithmicfinance.org/2-1/pp45-58/

**Summary**

The authors explore if there are features in accounting data and in historical price information that can help predict the stock price changes of companies.

The original source data was from the **CRSP**(Center for Research in Security Prices)/**Compustat **merged database (**CCM**); 7 technical features are calculated from CRSP and 44 fundamental features are obtained from Compustat.

All U.S. stocks between 1981 and 2011 are used.

**Support Vector Machines** (SVM) is used as the classifier technique to help predict the future direction of the stock price returns.Â A distinct contribution by the authors is the selection of hyper-parameters of the SVM model by a type of **reinforcement learning**.

Stocks that do not seem to have strong correlations with the technical and fundamental features are removed from the training data set. This leads to a significant reduction in computational time without hindering the predictive power of the model.

When forming the tail sets that constitute the positive and negative classes of the training data, an ordered list of stocks with volatility-adjusted returns is created. The estimate of the volatility is an exponential moving average, using a type of absolute deviation calculation.

The fundamental features come from accounting data; Income Statements, Balance Sheets, Statement of Cash Flows (e.g. BV, EPS). Technical features were selected by the authors based on the following claims:

â€˘ Stocks with high (low) returns over periods of three to 12 months continue to have high (low) returns over subsequent three to 12 month periods.

â€˘ Volume is a way to characterize under-reactions and over-reactions in stock price changes.

â€˘ The number of n-day highs and n-day lows as suggested in prominent literature.

â€˘ The maximum of daily returns is considered an indicator of the interest for traders with few open positions.

â€˘ A proxy for the resistance level is used (i.e. the percentage difference from the closest peak in the past) because, at least psychologically, it can be an important factor for traders.

A model for each sector is built, as defined by the Global Industry Classification Standard (GICS); Energy (10), Materials (15), Industrials (20), Consumer Discretionary (25), Consumer Staples (30), Health Care (35), Financials (40), Information Technology (45), Telecommunication Services (50), and Utilities (55). Stocks without a sector are omitted and because the Telecommunication Services and Utilities sectors have so few stock, they too are omitted.

**My implementation**

I was able to download the CCM data through the Wharton Research Data Service (WRDS) interface. Substantial effort was required to load, subset, merge and clean the data.

It is worth noting that a Google Summer of Code project to develop a WRDS CRSP R package is in progress and itâ€™s use is something to consider in the future.

Typically, the number of features characterizing each stock varies from 7 to 51 depending on whether technical data, fundamental data, or both are used.

However, using all these features with 30 years of data for ALL stocks in the U.S. resulted in the replication taking far too long to simulate. The SVM model was also **re-trained every day**. I contacted the original authors of the paper and they told me I may reproduce the results in 3-4 months of computation if I have access to several multicore computers.

Therefore, to ensure I could at least work through the strategy replication process and test the main hypothesis, I had to reduce the dataset as follows:

â€˘ Removal of the fundamental features. The paper demonstrated that these were not as good as the technical features in characterizing the stocks (i.e. the predictive power was not as good).

â€˘ The stocks are divided into 8 sectors. I have concentrated on one sector only (Energy).

â€˘ Reduce the time series to 2 years only.

â€˘ I did not apply the stock filters (liquidity (LIQ) and dollar trading volume (DTV)). These filters eliminate stocks that do not have sufficient capacity to be traded by large mutual funds only.

I developed the SVM model using the **e1071 **R package. Each day, the model was trained on tail sets using a quantile of 25% (I.e. the 25% highest and 25% lowest ranked vol-adj-returns) using a history of 10 days. The model was then used to predict the future (1-day-ahead) direction of stock prices. After each prediction was made, the model was tested for accuracy.

I was unable to use the R package **PortfolioAnalytics** because the portfolio rebalancing was too complicated – i.e. form portfolios of 10 equally weighted long and 10 equally weighted short positions. Each position is closed at the end of the last trading day in the following 91 days. Every 28 days we open an additional 20 positions.

Hence, I wrote my own code to manage and track the sub-portfolios (i.e. I used a “queue”. The first in the queue is always the sub-portfolio to be sold first).

I implemented all 7 technical features from the original source paper:

â€˘ Momentum 3 months

â€˘ Momentum 1 year

â€˘ Volume change 3 months

â€˘ Volume change 1 month

â€˘ 12-month Highs/Lows

â€˘ maxR

â€˘ Resistance levels

A full set of summary statistics was written, utilizing R packages **PerformanceAnalytics** and **quantmod** when needed:

â€˘ Annualized Returns

â€˘ Cumulative Returns

â€˘ Annualized Sharpe ratio (SR)

â€˘ Non-normality adjusted Standard Error (se) for the Annualized SR

â€˘ Annualized STARR using a Non-Parametric Expected Tail Loss (ETL) estimate

â€˘ Maximum Drawdown

â€˘ Average Turnover

â€˘ Average Diversification

â€˘ Accuracy of the SVM model

Please note, when constructing and rebalancing the sub-portfolios, transaction costs were ignored and it was assumed that execution was instant (i.e. no price slippage).

An especially common error when backtesting and making a decision at time t is to use/include the data at time t+1. Thus, particular attention was given to any possibility of look-ahead bias. For example, when using the model to make predictions for time t+1, only data at time t was used; i.e. the 7 technical features from EOD yesterday were used to predict the price direction for today.

*Â *

**Results**

## [1] “====================================================================”

## [1] “====================================================================”

## [1] “”

## [1] “ENERGY SECTOR (U.S. STOCKS) 2009-01-01::2010-12-31 SVM MODEL”

## [1] “”

## [1] “”

## [1] “Annualized return (arith): 9.9 %”

## [1] “Annualized Geo mean return: 8.7 %”

## [1] “Cumulative returns: 131.9 %”

## [1] “Annualized volatility: 15.2 %”

## [1] “Annualized sharpe ratio: 0.65 (0.326)”

## [1] “Ratio (SR/se(SR)): 0.502”

## [1] “STARR (5% ETL): -24.321”

## [1] “NSTARR (5% NETL): 2.151”

## [1] “Annualized STARR (5% ETL): -175.384”

## [1] “Annualized NSTARR (5% NETL): 15.511”

## [1] “Max Drawdown: 0.366”

## [1] “SVM Accuracy (Mean): 48.8 %”

## [1] “SVM Accuracy (Max): 58.1 %”

## [1] “SVM Accuracy (Min): 41.1 %”

## [1] “====================================================================”

The SVM model accuracy is not good at all. Itâ€™s possible that the technical features do not characterize the stocks well enough or the hyper-parameters are not tuned.

The returns and SR are similar to those demonstrated in the original source paper but the backtest needs to be run for longer to have statistical significance in its results.

If the model was not re-trained every day, a significant reduction in theÂ time needed to run the backtest would be achieved. How often the classifier should be trained is an active area of research.

It was fun to replicate the paper and much was learned. The paper is definitely worth reading !