Our Data Science students from Batch #16 present their capstone projects, which they realized in the last four weeks of their bootcamp.
Sidecar: Generating Meaningful Business Descriptions
Students
: Marlies Monch,
Dae-Jin Rhee
Sidecar aims to help Business Analysts, Data Scientists, Chief Data Officers to better understand their data assets with a tool for managing their metadata – allowing clients to easily navigate and analyze their data. One way Sidecar wants to improve its Data Asset Management system is by automating data entry for meaningful business descriptions. Currently, this task is executed by the data steward.
The objective of the project was to leverage
Natural Language Processing (NLP) to generate meaningful business labels and business descriptions from the labels generated by the database. NLP focuses on giving computers the ability to process and generate text and speech as spoken by humans.
Marlies and Dae-Jin first created a look-up dictionary to match the automatically generated label to the corresponding business label. To improve performance, they leveraged Deep Learning and trained a
character-level sequence-to-sequence model – a special type of recurrent neural network most often used in language translation – on the existing technical labels to generate the business labels, achieving 99.6% accuracy (see Figure 1).
For generating meaningful business descriptions, they used the
GPT-2-124M –a pre-trained text generating transformer model– and further trained it on existing descriptions (see Figure 2). This model was able to predict plausible and grammatically correct sentences, but these sentences did not quite give an accurate description of the Metadata.
In conclusion, the models developed during our project can save enormous time and resources for data entry and provide a promising approach to leverage NLP for metadata generation.
SIX: Using AI to Create Synthetic Data
Students: Nicolas Bidaux,
Darya Bomberger,
Kacper Krylowicz,
Lucas Fernandez Vilanova
An ever-increasing number of companies rely on significant amounts of data to help guide their business decisions. This analytical process, however, can sometimes contain highly sensitive information such as Personally Identifiable Information (PII), making its use particularly problematic.
Due to these privacy constraints which are further bound by regulatory frameworks (e.g. GDPR in the EU), SIX Banking Services is investigating new solutions to “anonymize” its data as an initial step in the analytical process. This is where Artificial Intelligence (AI) can be used to create synthetic data.
Thanks to various Machine Learning (ML) models, it is possible to create synthetic tabular data that: (1) keep sensitive information private and non-identifiable, (2) maintain statistical properties of the original data. One of the main deep learning models which the project group retained was CT-GAN, which is a Generative Adversarial Model (GAN), whose basic schema is illustrated below:
After generating the synthetic data, the team evaluated the performance of various models through the lens of 3 pillars: Resemblance, Utility, and Privacy. The first two pillars pertain to the synthetic data’s similarity to the real data while the last pillar concerns privacy.
Although they observed a general trade-off between Utility and Privacy, i.e. maximizing Utility translates to lower Privacy and vice versa, the usage of GANs models, especially with Differential Privacy (DP) can be a viable solution to generating synthetic data while protecting sensitive data.
VE COOK: Sustainability Optimized Recipes
Student
: Solomon G. Araya
VECOOK, a food start-up based in Zurich, produces vegan cooking kits and wants to develop a tool that uses Data Science methods for making recipes more sustainable.
The project involved the following steps:
● The relevant components of the original ingredients are identified
● A sustainability score is calculated based on an external data source
● Different Machine Learning methods like PCA, UMAP, and clustering are used to find different alternative variants of the original recipe aiming to improve the sustainability scores while preserving the main properties of the original recipe
During the project, Solomon was able to develop an algorithm that could reduce the overall carbon footprint of a recipe by around 30 percent by substituting ingredients for chemically similar alternatives. VE COOK is aiming to develop this into a general process for making food production more environmentally friendly.
POSTme! - A Tool for Social Media Messaging Optimization
Students:
Sibel Yasemin Özgan,
Amalia Temneanu,
Marcela Helena Perez Ulloa
Social media environments have become remarkable platforms for companies, as they lead the way to enthusiastically communicate with users from all over the world. Hence, social media engagement has become a vital part of business success and a key part of marketing strategy. Having this in mind, a leading pharmaceutical company based in Switzerland approached SIT Academy with a specific task: to build a model which optimizes the engagement rate for Twitter and LinkedIn posts. The project team had to overcome two main challenges and these revolved around the question: How can a company improve its organic engagement rate?
POSTme! is the ultimate solution to their social media strategy. It enables the brand team to input a potential social media post in order to get an assessment of the estimated engagement rate and feedback on how it can be improved.
Sibel, Amalia, and Marcela applied syntactic parsing, entity extraction techniques, topic modelling, and sentiment analysis with pre-trained transformers and word embeddings. They tested different features and models in order to identify the best performing combinations. For explaining the results they evaluated the importance of the individual features and performed a SHAP and LIME analysis.
Figure 1: Demonstration of how the POSTme! platform is used for analysing a post.
Many thanks to our students and partners for the exciting project period!