by Marcus Lindberg
During the 22-week part-time bootcamp, our first Data Science batch completed the following capstone projects:
Students: Martina Klose, Shih-Chi Yang
Zaamigo aims to provide “an artificial dentist to prevent, diagnose and eventually treat diseases anytime at home”. To achieve this goal, Zaamigo sells an easy to use, affordable, yet professional oral camera empowered by a mobile app which analyzes those images using deep learning to identify teeth, stains, and inflamed gums.
To be able to analyze dental health, being able to accurately detect and identify individual teeth is of utmost importance. As such, Martina and Shih-Chi systematically optimized each aspect of the model training process and implemented better reporting of metrics to evaluate model performance.
Figure 1: By optimizing image augmentations and letting the model train on the data for longer, the model could better identify individual teeth correctly, while also reducing the number of false positives in background images.
The optimizations which contributed the most to improving the model were optimizing augmentations of the data used for training the model, letting the model train for longer, and identifying appropriate thresholds with the best compromise between false positive and false negative classifications.
Student: Ilario Giordanelli
Contovista’s expertise in data-driven, AI-assisted banking gives individuals the opportunity to analyze and track their spending habits to make informed decisions to understand and control their finances. The location of merchants with which transactions have been made would be a powerful tool for better categorization of transactions, for companies to decide where to open new branches, and to quickly identify abnormal and fraudulent transactions. Unfortunately, current transactional metadata is quite limited, where only 10% of transactions have a location that can be inferred.
Ilario transformed individual-level data into merchant-level data, where each merchant has individual clients and transactions. By calculating the frequency of how often each person purchased from each merchant, the cosine similarity of merchants could be compared. Using spectral clustering, Ilario was able to recapitulate the regionality of merchants at a canton and area code level using a semi-supervised approach.
Figure 1. Merchants that were closer geographically (by area code) were also more likely to cluster together using the client transaction frequency data alone. One of the over 80 clusters is shown here (concentrated around Bern).
The identity of each cluster can be determined based on the highest percentage of merchants within it containing location data. The remainder of merchants in that same cluster can then be inferred as being in that same region or area. This approach allows the richening of merchant information where it was previously absent or unknowable, especially for smaller or more local merchants. The clusters also provide insight into the traveling patterns of individuals, where different behaviors could potentially be identified (e.g. people who prefer to shop more locally).
Student: Simon Zschunke
For logistics companies that fulfill many different functions for various sectors, there are often service portals that receive a large volume of daily incident requests that then need to be allocated to the appropriate departments. While users can specify the issue they are submitting the request for, it is not uncommon for people to incorrectly or inadequately fill such forms. To better triage and classify these requests, Simon wanted to leverage the power of natural language processing to develop a model that would be able to do so.
One major source of difficulty lies in the nature of the data itself, being text data extracted from emails that vary in length (from a few words to lengthy documents), formality (abbreviations and emoticons), and contain variable formatting due to inclusion of things like signatures. After a significant amount of pre-processing of the data, similarities could be calculated and messages could be compared and grouped together.
Figure 1: Using LDA (Latent Dirichlet Allocation), at least 4 discrete topics were identified in the messages that could then be mapped to specific company areas, with the topic representing real estate pictured.
Using the pre-trained BERT model as a foundation, Simon trained, fine-tuned, and optimized the model for this task to be able to predict the 3 different categories (process kind, group, causer of incident) that describe an incident with accuracies of 81%, 98%, and 90% respectively. Oversampling and undersampling were both used to improve classification of less frequent categories.
The model will be able to significantly reduce the need for human intervention for triaging incidents, saving the company time that can be used elsewhere, as well as being able to provide clients better service at a faster rate.
Student: Dejan Micic
The loss and training of staff is one of the most costly expenditures for many companies. From the perspective of an HR department, improving retention of employees and identifying the factors that cause people to leave would serve as a way to address those issues to reduce turnover. As such, Dejan’s goal for this project was to create a model which can identify employees who are mostly likely to leave in the relative future so that they can proactively have their concerns addressed, as well as identifying the factors that serve as the main driver for people quitting.
The first part of the project was to evaluate the data set and figure and describe the potential identifiers why people are quitting their jobs. For this it was necessary to do some feature engineering and analysis on the data.
Figure 1: The highest portion of people who have left in the past 3 years were those in their mid-20s to late-30s, peaking in the early to mid-30s group. (left) The majority of the people who had left did not see an increase in salary from the previous year of employment (right).
After training and optimizing several models with different architectures, the one trained using the XGBoost algorithm showed the best overall performance (with a weighted F1-score of around 90%), with the presence or absence of a recent salary increase being determined as the best delineator for classifying potential leavers, while other factors like the employee’s salary and length of employment also showing to be crucial.
Figure 2: While employee salary, length of service, and position were more likely to contribute to a prediction for them leaving, the absence or presence of a recent salary increase showed the best separation between classifying a potential leaver or not.
Using the status of employees who have resigned but have not yet been registered in the system, the model accurately predicted over half of them as being potential leavers based on their HR data alone. This gives companies an opportunity to meaningfully reduce the number of employees resigning for reasons that are not spontaneous, and act to improve employee satisfaction and morale.
Students: Gabriele Tocci, Raffaella Anna Marino
Bitcoin and other cryptocurrencies have the potential to change the concept of money and the world of finance as we know it. Understanding and modeling price changes in financial systems has fascinated mathematicians, scientists, economists and traders for decades. In recent years, data science and machine learning have entered the game. Inspired by a recent Kaggle challenge, Rafaella and Gabriele decided to analyze and forecast cryptocurrency price fluctuations on long- and short-time scales, ranging from minutes to years.
They show that moving averages and momentum oscillators, two of the most widely used indicators in the technical analysis of financial time series data, are relevant to describe price fluctuations and trends on a long-time scale. Seasonality among several months was observed, indicating periodicity of the whole crypto market.
In a minute time-scale time series data exhibits apparent correlation in time, which allowed the development of a machine learning model for the prediction of the returns of the cryptocurrency prices, as required in the Kaggle challenge. Armed with the information resulting from long-timeframe analysis, the group built features based on moving averages and momentum oscillators and developed an XGBoost model for the prediction of price changes over a 15 minutes time range. Walk-forward validation shows that our model exhibits great potential for the prediction of short-time behavior of cryptocurrency returns.
Figure 1. Walk-forward validation of the Log of return of the closing price of Bitcoin evaluated from the XGBoost model (blue) and from a simple baseline model (yellow), compared with the expected value. Data is evaluated each minute for an hour.
Thanks to everyone for the great collaboration over the past few months! On behalf of SIT Academy, we wish all the best to our first part-time Data Science graduates.