Data Science Capstone Projects #18

by Ekaterina Butyugina

In this blog post, we highlight the projects from both the part-time and full-time Data Science students that were completed at the end of the program. Take a look at the results they've achieved in such a short period of time.


Cortexia: Sustainable Clean City - Darkzones Analytics

Students: Dominik Bacher, Valeriia Rutskaia

Cortexia, offers a world-leading solution to clean cities efficiently while saving resources, preserving the quality of drain water, and conserving the landscape. It utilizes a computer vision system mounted on sweepers and other vehicles to detect and count different types of litter left on the streets. Measurements are done every day in different regions of the city. However, the coverage of moving cameras in space and time is rather low, around 30 - 40% in the whole city. In order to predict the amount of litter in overlooked areas and make cleaning schedules more efficient, Machine Learning algorithms were applied. Since the amount of litter is dependent on common features, the idea is to sample a part of the city and predict the remaining uncovered streets, referred to as the “Dark-zones.”

 Results after the predictions

The picture on the left shows the street segments in green where measurements were taken and in red - the streets which have to be predicted. The right picture shows the result after the calculations.
Students Dominik and Valeria received the measurements from various sweepers, then engineered new parameters, such as weather conditions, the proximity of a bar or a restaurant, etc, to determine the amount of litter on the streets. They choose and trained various machine learning models for litter prediction, as well as built a ready-to-use data science pipeline that included data aggregation and model training. Through the use of predictions on measured data and common metrics such as D-square and R-square, they were able to evaluate the models.
Since the predicted data has a lot of noise, the model gives better accuracy on the larger quantity of street segments, which gives a better score on a sweepers route instead of one street segment.
The graph below shows the more street segments we take into consideration, the less margin of error we get.
Margin of error

The future vision is that this project will be the first step for efficient cleaning of cities in a more sustainable way.

Talmis: Macroeconomic forecasting using machine learning methods

Students: Hussam Al-Homsi, Patrizia Will

Talmis, providing consulting and advisory services in treasury, ALM, capital markets, risk & finance at both strategic and operational levels, is responsible for advising banks on how they should react during stress conditions. Patrizia and Hussam were chosen to help Talmis tackle this challenge.  

In order to test the resilience of the banks towards challenging economic situations,  stress tests are used. These stress tests are performed using various hypothetical future scenarios, including optimistic to pessimistic global economic outlooks. 

The predictive power of the stress tests is based upon high-quality data sets that include two types of data: a) bank’s internal financial statements; b) external economic data - MEV (Macroeconomic variable) including GDP, CPI, unemployment rate, property price index. While the internal bank documents are readily available, high-quality economic data, i.e., the projections of MEVs, can be hard to come by.

Thus, Hussam and Patrizia tested various approaches to enrich the forecast of MEVs for one country by exploiting correlations between different MEVs within a country and globally.    

For the scope of this work, they used the data set provided by the International Monetary Fund (IMF). This data set gives annual entries of the main MEVs of 196 countries ranging back to as early as 1980. 

In an attempt to find the best model for their task, they tested many algorithms and evaluated applicability based on the performance metrics. 

The following approach produced the best performance metrics and is used for this project:
  • First, they applied time-series clustering to group the 196 countries into clusters with similar historical trends/shapes of the respective MEV. 
  • Then, proceeded to perform statistical filtering using the Granger Causality Test and thereby select countries with higher predictive power towards their targeted country per respective MEV (we used p < 0.05). 
  • Finally, by applying a combination of Facebook’s additive model “Prophet” and the multivariate vector autoregressive model (VAR) they were able to stepwise predict the MEVs year by year.

Based on the GDP, they defined four clusters. They noticed an uneven distribution of the countries caused by the inherent similarities of their GDP trends, additionally influenced by the clustering method itself. Therefore, they chose one cluster and applied the Granger criterion of p < 0.05 to select the more strongly correlated countries for further work. 

The GDPs of these countries were, together with the given predicted GDP of the UK, used as input to the machine learning pipeline (Prophet and VAR). As so-called “shocks” were applied to the GDP by the model, predicted values quickly “recovered” from the “shocks” similar to the economic trends due to the Covid crisis.

Target Country GDP vs User Imputed GDP UK   Target Country GDP Netherlands vs User Imputed GDP Mexico

In conclusion, their model was able to predict MEVs of countries from the same cluster based on the historical MEV and the publicly available data from the UK.

Based on their work, they suggested the following additional developments to the model.
  • Additional algorithms should be tested to expand and deepen the understanding of the resilience of the banks.
  • The global MEV data set should be enhanced and include quarterly data to allow for higher precision forecasting.
  • The approach does not include weighing in the trading relations between the countries. For instance, countries with stronger ties in global trade should receive more weight by the model than countries with lower mutual trade volume. This factor should be included as the next step in future models.

CancerDataNet: Time predictions for follow-up treatment in cancer patients

Students: Muchun Zhong, Jacques Stimolo, Ernest Mihelj

CancerDataNet, their mission is to create, develop and maintain a framework of digital tools to advance precision medicine and patient-focused drug development in oncology.
By leveraging its experience with numerous stakeholders from the healthcare ecosystem, CancerDataNet conceptualizes and designs solutions to maximize the transformational potential of real-world data to real-world evidence. 

The goal of this project was to improve the prediction accuracy of the duration between two cancer treatments, for patients sick with multiple myeloma.
The main challenge of the project was the high proportion of missing values in the raw data, up to 85% for some patients. Therefore, it was mandatory to cautiously remove data from the dataset. 
They divided the workflow into three parts:
  • In the first part, they conducted research within the medical study documentation and the data to gain a better understanding of the data and hence to find anomalies in it. 
  • The second step was the cleaning of the data, where they removed the anomalous data and cleaned the data based on the missing rate. 
  • The final step was to take the final version of the dataset and create a synthetic replacement for the missing values (imputation). Muchun, Jacques and Ernest implemented different strategies to impute the data and compared the performance/accuracy of the prognostic models.

The Prognostic Models 

Consequently, the data imputation increased model performance. The imputation strategies were found to improve the prognostic model performance in a similar range. Yet, combining imputation techniques led to the strongest improvement in prognostic model performance.

Prognostic Model Performance

The future plans will be to improve the accuracy for better prediction of remission by using different imputation strategies on different features, working with medical professionals to obtain a more complete dataset, and minimizing the anomalies. 

360° Stock Prediction: Predicting the highest return stocks globally via robust KPIs and perceived company confidence

 Students: Karim Khalil, Fernando Beato, Lukas Doboczky, Rafael Zack

Stock 360 Logo

As financial markets continually grow worldwide and more people seek access to them for investing, veterans and newcomers alike will rely on intuition for making decisions. With the availability and volume of financial data increasing, investment strategies are becoming more data-driven than ever, but there is still a wealth of data being underutilized. That’s where the team set out to build a machine learning model to find the most promising stocks with a global scope beyond the S&P 500.

After scraping over 2 decades of financial quarterly data, representing over 35k companies across 84 countries, the returns were calculated with notable outliers removed. To make models more robust, different financial ratios (included in a total of ~200 key performance indicators) were calculated to represent the financial performance of each company, including ones comparing their performance relative to the global market, their specific market, and their respective industries and sector.

Country Distribution

Figure 1. The companies represented were mostly from markets in the US and China, but the major sectors were fairly equally distributed throughout.

Another important feature for the model was the use of a “sentiment score” that was calculated for quarterly earning calls transcripts using VADER (Valence-Aware Dictionary and Sentiment Reasoner). Sentiment analysis was performed on transcript texts to infer each company’s outlook based on the language used by company executives when speaking to investors.

For better predictions
Figure 2. Strongly positive language in terms of company performance results in a sentiment score like the one above, becoming another feature (used along with company financial data) to predict future performance more accurately.

Different regression models were compared, with an XGBoost model outperforming the rest. The model was able to predict company performance (in the form of returns) of the next financial quarter with a relatively low mean absolute percentage error (~10%).

Future Predictions
Figure 3. Once the future returns have been predicted, users can identify the top promising stocks to be able to make a more informed decision.

The next steps for the project involve building a user-friendly app that will allow people to track company performance as well as easily build diversified, optimized portfolios with respect to their acceptable risk and markets of interest. Another unexploited facet of the earnings calls is to use the recorded audio directly in addition to the transcripts. This would allow the use of sentiment analysis to determine the emotions and confidence of the speaker, allowing things like pointing out contradictions in the language used relative to the way speech is delivered.

Thank you all for a fantastic time and a great project phase! Constructor Academy wishes all our Data Science graduates the best for their future.

Interested in reading more about Constructor Academy and tech related topics? Then check out our other blog posts.

Read more