Introduction
Developed for research purposes for SIT Academy, DOPE is a project designed to simulate getting data and how to process it in real-time. What do you do with the results? How are they evaluated? Machine Learning is only a small aspect of Data Science. The largest part of the work is the whole infrastructure around the Machine Learning model.
DataOps is a process-oriented, automated methodology to reduce the cycle time of data analytics and improve its accuracy. It fosters collaboration between several parties to leverage data. SIT Academy strives to stay up-to-date with the newest tools and developments in the industry. As an example use case, our two Data Science students chose to build an ML-based cryptocurrency trader. The goal was to explore the emerging field of DataOps and MLOps and to build an end-to-end Machine Learning project using a microservice architecture. The team created an adaptable and scalable architecture and developed an end-to-end platform with predictions as an outcome. They combined it with Machine Learning models for time series prediction and a live dashboard for visualizing the performance. As a result, DOPE (DataOps Prediction Engine) was born!
Tools and technologies used in this project
- Python: scikit-learn, Pandas, NumPy
- Models: Random forest, LSTM (TensorFlow)
- Messaging: zeromq
- Live dashboard: Bokeh
Project details
To use DOPE to predict the price of cryptocurrencies and to trade them, our students built a service for continuously aggregating live data from the Binance exchange for multiple currencies. The aggregated and pre-processed data is forwarded to multiple ML models via a publisher-subscriber messaging system. This makes deploying multiple models in parallel very easy. The structure of the microservices was inspired by the so-called "Rendezvouz architecture". This is especially suitable for data streams and for monitoring multiple models in parallel to a real production environment.
The focus was set on two different models: A Random Forest based model and a Neural Network (LSTM) created in TensorFlow. Both models were trained on historical data of 10 seconds granularity and a time range of around 1 year. The project was completed with a trader service, responsible for deciding whether to buy or sell the traded currency (buy, if the prediction is higher than an upper threshold and sell, if the prediction is lower than a lower threshold) as well as a dashboard to monitor the live performance of the trader. Furthermore, they wrote a backtesting service for optimizing the models and the trading thresholds and for systematically comparing the performance of different approaches.
After 1 month of intensive work, the trading system could show predictions in real-time during the project presentation.
Conclusion
The main focus of this project was on building a system not only containing a Machine Learning algorithm but also all the infrastructure around it. This showed clearly that in any bigger project, providing Machine Learning models with data and evaluating their performance is more challenging and much more time-consuming then building the models itself. Evaluating the performance of the models for financial time series prediction turned out to be difficult due to the limited amount of data, the high amount of randomness and the tendency towards overfitting by using backtesting. With the limited amount of data, more complex models like LSTM could not improve the performance. Another crucial part is creating suitable features from the raw data before feeding it into the models. To get really performant models, the price and volume data needs to be enriched with alternative data, which requires a great deal of additional engineering for building reliable data pipelines. The project was a great testing ground for experiencing all the challenges for bringing a ML system into production. At the same time it was very rewarding to build such a system from ground up and getting it working on real time data streams.