Data Science capstone projects batch #25

by Ekaterina Butyugina

We want to take a moment to give a big shout-out to all the students who joined us in November and gave it their all to push through the course and capstone projects.

In just three short months, the incredible Data Science enthusiasts from Batch #25 in Zurich, along with the accomplished fifth cohort from Munich, admirably tackled a diverse array of challenging projects. Their outstanding skills and unwavering dedication were on full display. This time, a significant role in the students' success was played by HP who provided us with exclusive Z by HP workstations.

We encourage you to witness firsthand the transformative power of data science as they push boundaries, uncover insights, and drive meaningful impact.

AI-powered hotel ranking: streamlining your booking experience

Students: Asterios Raptis, Guillem Montoya, Kunal Sharma, Lorenz Schmid

Expedia Group, a prominent online travel agency, streamlines trip planning by providing a platform where users can compare prices, review amenities, and book accommodations through sophisticated recommendation and ranking systems. Consider a search for a "4-star hotel for three adults in Geneva in early May 2024." Without sorting options based on user-specified features like star rating or number of guests, a traveler might face hundreds of choices. This could extend search times and decrease bookings, countering the platform’s objective to simplify travel arrangements. This project aims to evaluate various machine learning models to efficiently prioritize the most relevant search results.

Our diverse team of students, consisting of a statistician, an engineer, an IT consultant, and a data consultant, analyzed the Expedia RecTour research dataset. This dataset included 1 million searches over two months in 2021, featuring data on booking details, hotel ratings, review counts, and amenities like WiFi and parking, as illustrated in Figure 1. We narrowed our focus to searches that led to clicks or bookings for the top 500 destinations during data cleaning, significantly reducing our training dataset size. Each query in our dataset represented about 70 properties.

Fig 1: Search results on booking platform showing booking details and hotel features

Fig 2: NDCG is a metric that measures the quality of a ranking by comparing it to the ideal ranking

In this project, the team tackled the challenge of ordering properties in recommendation systems based on user interactions like clicks and bookings, utilizing a non-differentiable ranking feature. To solve this, they approximated the ranking to optimize the models, evaluated by Normalized Discounted Cumulative Gain (NDCG), which prioritizes user satisfaction by rewarding models that place the most relevant properties higher (Figure 2).

Initially, decision tree-based models were deployed, such as LightGBM and XGBRanker, which helped in selecting features for training deep neural networks. Then Asterios, Guillem, Kunal, Lorenz explored the allRank model, an open-source, transformer-based model that enhances ranking by understanding the context of other properties, demonstrated in Figure 3. Their tests showed allRank as the most effective model on the Expedia RecTour dataset (Figure 4). Despite the score obtained by Expedia is higher (the dashed blue line) due to a different private technique and dataset used, the results obtained by the team will be helpful in improving Expedia’s approach.

Fig 3: AllRank utilizes contextual self-awareness of other hotels for reranking

Fig 4: Comparison of NDCG scores for four models with benchmark dataset

Future steps include training on a larger dataset, comparing ranking similarities between different models, and applying feature engineering to implement various relevance metrics aligned with business objectives. The team would like to thank their mentors from Constructor Academy - Ekaterina Butyugina, Rena Pan, and Dipanjan Sarkar - and the Expedia team - Jean Coupon, Stefania Ebli, and Irini Mens - for their guidance and support throughout this project.

Eonymizer: automating text anonymization for privacy compliance

Students: Janis Kropp, Thomas Lösekann, Georg Ammer

Are you concerned about privacy compliance in handling customer data? The numbers speak volumes. In 2023 alone, German companies faced penalties totaling 1.2 billion euros for privacy law violations, highlighting the pressing need for robust solutions.

To meet this need, the team developed Eonymizer, a framework dedicated to anonymizing personal information in unstructured text. Our collaboration with E.ON, one of Germany's largest energy providers, shed light on the challenges they face in handling vast amounts of customer emails daily. Manual anonymization is very labor-intensive and prone to errors, leaving automation as the only viable solution. So, how did the team tackle this problem?

They employed three approaches using the following models:

ChatGPT: A familiar choice known for its ease of use and solid performance, albeit with occasional quirks.
Sauerkraut Mixtral: A local Language Model (LLM) tailored for German text, offering reproducibility and flexibility, used on a Z by HP Z8 G4 workstation.
Microsoft Presidio: An open-source framework leveraging pre-defined entities for fast anonymization, though less adaptable to variations in text.

Each of the individual models performed well, reaching performance scores between 92.3% and 96.2%. However, for personal information even such high performance scores might still not be good enough.

Performance scores (F1-score) for the three implemented models

As a solution to this problem, the team combined the predictions of all 3 models. They then evaluated their performance using a manually labeled test dataset comprising 200 text files. Specifically, in instances where all three models produced identical output — accounting for 44% of our test set, they could reach our objective of achieving 100% accuracy.

This approach paves the way for automatically generating large amounts of labeled training data for model fine-tuning. Looking ahead, we aim to explore synthetic data generation to further enhance model performance and expand its utility across various use cases.

Example of an e-mail that was anonymized with Eonymizer
An email anonymized with Eonymizer

In summary

Eonymizer provides automated text anonymization for privacy compliance.
Combining multiple models yields near-perfect anonymization rates for approximately half of the texts.
Future efforts will focus on improving model performance and scalability.
With Eonymizer, you can navigate the complexities of privacy compliance with confidence, knowing your data is protected.

Revolutionizing scientific publishing with AI: The DigiScientia case study

Students: Altynai Mambetova, Habtom Kahsay Gidey, Roel D’Haese

In the rapidly advancing world of scientific research, the publication process remains surprisingly archaic and inefficient. Traditional journal publishing is slow, expensive, and labour-intensive, marred by a lack of transparency in peer review allocation, conflicts of interest, and significant barriers for junior researchers. Furthermore, access to published research is often limited, with much of the information locked behind paywalls, hindering the free exchange of knowledge. The DigiScientia*, an innovative, fully autonomous AI-powered journal promises to disrupt this outdated system.

To achieve the goal and develop the functioning prototype from the idea the process is divided into three milestones:

DigiScientia* Bot leverages a sophisticated workflow to ensure a fair and efficient peer review process for scientific papers.
The process begins with the extraction of keywords from the user’s input. Utilizing the keywords, the bot performs an API search through PubMed, a comprehensive database of scientific publications. This search is designed to find potential peer reviewers whose previous work and expertise align with the topic of the paper.

The search results are then processed using an open-source large language model from Hugging Face - PubMedBERT, which creates an embedding for the matched abstracts and an input abstract. Then, for each potential reviewer, a similarity score is calculated. This score quantifies the relevance of the abstract to the submitted abstract, ensuring that the selected reviewers are well-equipped to provide a knowledgeable and insightful review.

To ensure fairness and prevent conflicts of interest, filters are applied. These include background checks on the user to ensure that the reviewers have not published with the corresponding author in the last five years. Additionally, a balance in reviewer seniority is maintained, allowing only one senior researcher with more than seven publications in the last five years. For this, the database of more than four millions papers has been analyzed and processed.

Based on the similarity scores and the filters applied, the bot selects the top three candidates to serve as peer reviewers for the paper. This selection is designed to maximize the objectivity and quality of the peer review process.

Models best matched results

Finally, after the selection of peer reviewers, the email service, integrated within the DigiScientia* Bot, takes over. It retrieves the contact details of the chosen reviewers and autonomously sends them an invitation to review the paper. This step completes the end-to-end process within the app after the user’s input has been submitted.

The results

The implementation of DigiScientia*'s model has already shown promising results. For instance, in a demo review of a paper on the health hazards of plastics in food packaging, the AI successfully identified and engaged relevant experts whose research aligns closely with the subject matter, demonstrating the system's capability to enhance the quality and relevance of peer reviews.

To demonstrate the fully functioning product prototype the team created a Streamlit app, which allows User input submission, showcases best-matched papers with corresponding authors, analyzes review submissions, and handles the control over DigiScientia Bot.

Final thoughts

By integrating advanced AI technologies with a commitment to openness and integrity, DigiScientia* is setting a new standard for scientific publishing. This approach not only makes the publication process more efficient but also more equitable, democratizing access to knowledge and enabling a faster and more transparent dissemination of scientific discoveries.

Battery chiller prediction journey

Students: Laura Giulietti, Stephan Krushev, Federica Graziano

Fluence Energy AG is a global market leader in energy storage products and services, and cloud-based software for renewables and storage assets. Their department in Zurich specializes in providing data intelligence services for renewable energy and Battery Energy Storage Systems (BESS) worldwide. Their primary objective is to develop models that optimize preventative and reactive maintenance, thereby increasing the uptime of components and enabling clients to extract maximum value from their assets.

Maintenance strategies are shaped by reliability analyses, which assess the risk of component failures. These risks are determined based on vendor warranty indications or observed failure rates across operational component fleets. Aligning these risks with predictive model outcomes empowers service managers to schedule maintenance interventions based on real-time component conditions.
The project's focus is to ascertain the probability and periodicity of failure, expressed as Mean Time Between Failures (MTBF), and to devise models capable of predicting failures with advance notice (e.g., 2 weeks). Laura, Stephan, and Federica commenced by analyzing raw data from sensors on battery storage devices to diagnose failure causes. This approach unfolded in three stages:

Data visualization
Defining failure criteria for components
Statistical analysis to establish failure thresholds

Battery storage devices to diagnose failure causes

The subsequent predictive analysis hinged on both supervised and unsupervised methods for failure indication, employing two distinct approaches:

Estimating the probability of failure within a two-week timeframe
Predicting the future trajectory of monitored signals

Recursive Neural network models were employed in this project phase. Both approaches, estimating the probability of failure and predicting signals trends, effectively anticipated system failures in advance, enabling timely interventions to prevent system downtime.

As we conclude this remarkable journey with Data Science Final Projects Group #25, we wish to extend our heartfelt gratitude to all the companies that have provided our students with invaluable projects. Your collaboration has not only enriched their learning experience but also paved the way for innovative solutions to real-world challenges. To the students who joined us in February and dedicated themselves wholeheartedly to completing the course and their final projects, we commend your outstanding efforts. Your dedication, skill, and passion for data science have truly shone through. We wish all the students the very best in their future endeavors. May you continue to push boundaries, innovate, and make a meaningful impact wherever your careers may lead you.

For those inspired by these stories and interested in embarking on their own data science journey, we're excited to announce our upcoming bootcamp. Learn more about our program and how you can join the next cohort of data science innovators at Constructor Academy.