Million Dollar Challenge

Full Version: Classifier accuracy metrics and testing on market
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I made a model and test it on unseen data. I am using Random Forest algorithm. The classifier,get 3 candlestick OHLC data and predict 4th candlestick direction. The evaluation of model are:

Accuracy : 89.8 %

Classification Report:
    precision    recall  f1-score   support

          0       0.89      0.91      0.90     15987
          1       0.91      0.89      0.90     16099

avg / total       0.90      0.90      0.90     32086



Roc_Auc : 89.8 %
Cohen Kappa Score: 79.60 %

Due to imbalance of dataset, Is this a good classifier? When I tested on a demo account,It has many false signals, sometimes It gives many same direction signal. I really doubt about these metrics.
Train dataset is OHLC of 2/6/2017 until 31/10/2017
Test dataset is OHLC of 1/5/2018 until 1/6/2018
Class imbalance is a serious problem. Since you are using color as the target variable, I don't think it will be a problem as both bullish and bearish labels will have almost same population. Random Forest builds multiple trees and outputs the class that is the mode.  If you have read my previous replies, I told you something very important that you are still missing. Financial time series data is not stationary. Its distribution is changing with each new data. So data mining algorithms like random forest, naive bayes, logistic regression etc don't work.

You should also develop a backtesting program that backtests your trading model. In the above confustion matrix, did you use the unseen data? If you have used the unseen data, the results seems to be good and I don't know why you are not getting the same results on demo trading. I think you haven't used the unseen data, rather you have used the training data in the confusion matrix. Training data results are always biased and cannot be used in evaluating a trading model. You should read the book:"Evidence Based Technical Analysis" by David Aronson. In his book, David Aronson explains in depth how traditional technical analysis is flawed. He explained why before doing any data mining we need to make the data stationary.

You should make the data stationary. For example log returns are stationary. In the same manner (EMA/Price)-1 or (SMA/Price)--1 is also stationary. How do you know this? The mean of log returns is zero. The mean of (EMA/Price)-1 as well as (SMA/Price)-1 is also zero. I gave you just two ways to make the data stationary. There are many other way also to make data stationary. How do you know the data is stationary? Mean should be zero and variance should be constant. Using these tips, you should remodel your data and again do the training and testing.
I used unseen data,but I found out my problem was the feature calculating was on next candlestick data,instead of previous one. I used Random Forest because I searched about that this algorithm works best when we have class imbalance.

Before making the dataset to train and test, I performed MinMaxScalar and StandardScalar to normalize and scale data. Does not this make the data stationary? Should I use (SMA/Price)-1 as the target?

The features I used are:
df['rb'] = abs(df.Close - df.Open) / df.Open * 100
condOC = (df['Close'] > df['Open'])

df['us'] = np.where(condOC,((df.Close - df.Open)/ (df.High - df.Open) * 100),(((df.Open - df.Close)/ (df.High - df.Close) * 100)))
df['ls'] = np.where(condOC,((df.Close - df.Open)/ (df.Close - df.Low) * 100),(((df.Open - df.Close)/ (df.Open - df.Low) * 100)))
df['maxOC'] = df[['Open','Close']].max(axis=1)
df['minOC'] = df[['Open', 'Close']].min(axis=1)
df['Lupper'] = ((df.High - df.maxOC)/df.Open) * 10000
df['Lbody'] = ((df.maxOC - df.minOC) / df.Open) * 10000
df['Llower'] = ((df.minOC - df.Low)/df.Open) * 10000

And also a positioning feature based on previous candlestick Open and Close.
Feature engineering is one of the most difficult task in machine learning. Data scientist spend a lot of time on feature selection and engineering. If you have heard of the Kaggle competitions, most winners stress the importance of feature engineering. Building machine learning models can take weeks. A model may work in one situation and may not work in another situation. So we need to build a number of models.

This is what we do in algorithmic trading. We build a number of models and then combine them. Looking at the features that you have selected, I don't think the RandomForest algorithm will be able to read the candlestick patterns. If you take a look at the RandomForest algorithm, it randomly selects a few features and then build a decision tree with it. Then it again resamples with replacement and randomly selects new features and again build another decision tree. In the end, it combines the decision trees it has built and takes the majority vote as the final result.

In combining models we can use the majority voting. But I don't think that with the present features RandomForest will be able to make any sense. It will just randomly select the features. It can happen that it randomly selects features that have only noise in them. When we use these machine learning algorithms, we need to understand these machine learning algorithms in depth. In the same manner, deep learning also needs a good understanding of the statistical theory. We cannot blindly use these algorithms.

I think you should use the return as the target variable in one model. We can classify return as big and small. Then we can build another model that has direction as the target which you have classified as color. In the return model you can use the candlestick patterns like doji, engulfing pattern, inside bar, shooting star etc. These are rough ideas. First you will make return stationary. Then you can use return lags as input features.

Mostly financial modeling is done using closing price. When we use closing price, we lose the information that is provided by the open, high and low. Using candlestick patterns can add information to the model. The challenge is how to add that information to the model. As said above, RandomForest may not be a good algorithm for modeling returns but it can be a good algorithm for modeling direction. Don't get frustrated if your model is not working. It can take some time. You will have to do some good brain storming.