Data Science capstone projects batch #26

by Ekaterina Butyugina

Workstation of a data science student
We want to take a moment to give a big shout-out to all the students who joined us in May and gave their best effort to complete the course and capstone projects.

In just three short months, the talented Data Science enthusiasts from Batch #26 in Zurich, alongside the accomplished fifth cohort from Munich, took on a diverse range of challenging projects. Their exceptional skills and unwavering commitment were evident throughout. This time, a key factor in the students' success was the support from HP, which provided us with exclusive Z by HP workstations.

We invite you to see the transformative power of data science in action as these students push boundaries, uncover insights, and drive meaningful impact.

The Neighborhood Vibe Score

Students: Philippe Matter, Seçkin Adalı, Dashrath Kurli

Comparis aims to help individuals and families make informed decisions by providing salient metrics to customers. For their real estate section, the company sought to develop a neighborhood score for listed properties and capture the neighborhood vibe, facilitating easier decision-making for potential home buyers. 

To address this, Philippe, Dashrath and Seckin developed a neighborhood scoring system that would provide free, comparable, and transparently calculated metrics. This solution aims to increase user engagement on the Comparis platform by eliminating the need for users to leave the site for neighborhood information.

The team evaluated data sources, comparing OpenStreetMap (OSM) and Google Places API (see Figure 1). As Google provided more consistent data across various areas and included facility ratings it was chosen as the primary data source. The team noted that a future advanced solution might involve combining data from both sources.

Comparison of the facilities returned by Google Places (red dots) and by OpenStreetMapComparison of the facilities returned by Google Places - Rural area
Figure 1. Comparison of the facilities returned by Google Places (red dots) and by OpenStreetMap (black dots), showing the strong discrepancy for (1) on the left, an urban area in Zürich and (2) on the right, a rural area

To create a comprehensive scoring system, the team developed two key metrics: a Global Score and a Custom Score. The Global Score considers the number of facilities (eight types, among which, schools, groceries, bars etc.) within a 10-minute walking radius, travel times to the nearest amenities, and facility ratings. The Custom Score incorporates user preferences, allowing individuals to indicate the importance of different facility types and include personal information such as a work address. To address the need for an automated scoring process visible on property pages, the team implemented k-means clustering, which segmented addresses based on prominent features and assigned scores using median facility counts within each cluster.

So that the user could visualize the frequency of the facilities in the neighborhood as well as the proximity of each of the different facility types, each facility was displayed on the map and OpenRouteService was used to construct multiple isochrones representing 15, 13, 10, 7, 5 and 3 minutes walking distances around the property (see Figure 1). 

To capture the “neighborhood vibe”, the team proceeded to feed the data collected, along with the population of the neighborhood and the actual location of the neighborhood, to an LLM (ChatGPT 3.5 turbo) to generate a summary text in different  styles, for example reminiscent of a “real estate brochure”.

In addition, taking advantage of the OpenRouteService and the Swiss Public transport API, the travel times to the workplace, along with a defined maximum acceptable commute time and preferred means of travel were factored in the custom score, therefore returning a tailored score based on the preferences of the user.

The working prototype can be tested as a Streamlit app
Figure 2. Working prototype with the comprehensive analysis of the neighborhood vibe

The working prototype can be tested as a Streamlit app (see also Figure 2 above).
 

Enhancing Quality Control in Electric Motors Using AI-Driven Vibration Testing

Students: Naveen Chand Dugar, ​​Matthias Gumbert, Danijel Matesic

BMW is a German multinational manufacturer of luxury vehicles and motorcycles headquartered in Munich, Bavaria. For them, quality control in electric motor production is a critical aspect of ensuring reliability and performance. Traditional methods of testing and classifying motors based on vibration analysis can be labor-intensive and prone to human error. In response to these challenges, the project aimed to revolutionize the quality assurance process for prototyping motors through the integration of AI-driven automation.

Naveen, Matthias, and Danijel have leveraged their diverse expertise to tackle the complexities of motor quality control.

The primary aim of the project was to enhance the efficiency and accuracy of quality assurance processes for prototyping electric motors (see Figure 1). The team achieved this by automating the process of classifying vibration tests, employing AI for precise anomaly detection, and conducting root cause analysis to identify potential quality issues.
 
The BMW Electric motor
Figure 1. Electric motor 

Key Objectives were to: Use AI to automate the classification of vibration tests, distinguishing between OK and faulty motors; Implement AI models to detect anomalies in vibration data, providing insights into potential issues; Utilize data analytics to perform root cause analysis, improving the overall quality assurance process.

To ensure the accuracy and reliability of AI models, the team followed a comprehensive workflow: Ensuring the vibration data is free from noise and errors; Structuring and formatting data; Using advanced data analytics and AI models to predict motor quality and identify potential issues.

The AI model, built using transfer learning with Convolutional Neural Networks (CNN), predicts motor quality by analyzing vibration data. The model is integrated into a user-friendly dashboard, allowing easy interaction with testing data. This dashboard facilitates real-time analysis and visualization of motor test results, anomaly detection, and root cause analysis.

A visualization of motor test results, anomaly detection, and root cause analysis.
Figure 2. AI workflow

By leveraging AI for predicting motor quality and conducting data analytics for root cause analysis, the efficiency and accuracy of the quality assurance process were significantly improved. The dashboard provides a streamlined interface for interacting with testing data, making it easier for engineers to make informed decisions.

To further enhance the system, the team plans to:
  • Integrate the AI model into the test workflow for real-time analysis.
  • Continuously evaluate and optimize the model based on new data and feedback.
  • Leverage large language models (LLM) for enhanced Q&A interaction, providing better support and insights to users.

Naveen, Matthias, and Danijel extend their gratitude to our collaborators at BMW Group and Constructor Academy who have contributed to the success of this project.
 

Sustainability Reporting: Leverage AI to Enhance Efficiency

Students: Anja Wettstein, Fatima Yousif Gaffar, Stefanie Wedel, Alexandre da Silva

In today’s rapidly evolving business landscape, sustainability has moved from being a mere buzzword to a critical focus area for companies worldwide. Organizations are under increasing pressure to not only adopt sustainable practices but also to transparently report their progress. Understanding this, Engageability offers innovative solutions that address global sustainability challenges for both the public and private sectors.

By evaluating sustainability reports, Engageability provides valuable insights into how effectively companies are addressing sustainability issues, particularly in compliance with global standards like those set by the Task Force on Climate-Related Financial Disclosures (TCFD). This project marks a significant advancement in how Engageability achieves its goals.

The team has developed an AI-powered tool that improves the way sustainability reports are analyzed, reducing the time required for this crucial task from an entire day to just a few hours.

The TCFD has outlined specific requirements for reporting on climate-related matters, and based on these guidelines, Engageability has crafted 32 key questions to assess companies' reporting practices.

The AI model begins by ingesting the companies' reports, which can range from general annual reports to detailed sustainability documents, often spanning 50 to 120 pages. Through advanced techniques like similarity search and semantic matching, the AI scans these reports to find answers to the predefined questions. The model then processes these findings through a Large Language Model (LLM), which generates human-like responses. Each response includes a straightforward Yes/No answer and the reasoning behind the model's assessment, providing a clear and concise evaluation.

The application Anja, Fatima, Stefanie, and Alexandre developed is designed for ease of use. Users can select from different language models and upload the PDF of the report they wish to analyze. Users can also download the results, which include the reasoning behind each Yes/No answer and the specific pages where the information was found (see Figure 1).

LLM report analysis tool - main dashboard
Figure 1. LLM report analysis tool

The AI-powered tool is a significant step forward in helping Engageability efficiently evaluate sustainability reports. By extracting relevant passages and providing clear answers to TCFD-related questions, the tool dramatically reduces the time and effort required for analysis.

To further elevate this project, we recommend two key improvements. First, the questions used for analysis should be formulated to be as precise and clear as possible, eliminating any ambiguities that could lead to inaccurate responses. Second, consistency in the documents used for analysis is crucial. Ensuring uniformity will enhance the reproducibility and transparency of the results, laying a strong foundation for further model development and refinement.

In summary, this project not only advances Engageability's mission but also provides an example for how sustainability reporting can be streamlined and improved through the use of AI.
 

Stable Solutions: Streamlining Equestrian Product Data for Online Retail

Students: Sebastian Gottschalk, Kerstin Kirchgässner, Rusen Yasar

Riders Deal is an online retail shop, and Germany’s biggest deal platform specializing in equestrian products. As they have a large product base coming from multiple suppliers, they need to transform differently structured product data into a single format that can be used by their webshop system. Automating this data transformation would mean a less labor-intensive process, faster integration into the website, and lower costs.

Sebastian, Kerstin, and Rusen designed an application that takes product data as provided by a supplier, and together with user-defined parameters, returns a data file in a standardized format. They integrated programmatic data processing techniques, which ensure data integrity for readily available information, with Natural Language Processing machines- and deep-learning models that have been trained on historical data (see Figure 1 below).

The workflow that converts the product data from five major suppliers
Figure 1. AI workflow

The current prototype is tailored to convert the product data from five major suppliers. The team has also designed a graphical user interface, through which the user can interact with the app, define parameters, and upload and download files conveniently. The snapshot of this app you can find on Figure 2 below. The development of this app will offer the opportunity to significantly reduce the time and costs associated with product data transformation.

Product data transformation
Figure 2. Working prototype to convert the product data

Future improvements would involve further optimisation of data transformation methods and machine- and deep-learning models with iterative feedback from business expertise. A natural next step in development will be its expansion to more suppliers, and eventually full coverage of all suppliers.

ProductTwins: Transforming Product Data Management With AI

Students: Gabriel D. Guerra & Nikita G. Meshin

The ProductTwins project transforms product data management by developing a digital database for balcony connectors. This initiative was undertaken in collaboration with Pro Engineers, an engineering company employing computer-aided design for construction, and Leviat, a prominent designer of engineered connection solutions. The project aims to provide engineers with rapid access to comprehensive product information.

Gabriel and Nikita were tasked to first extract raw product data from brochures and type approvals for Leviat and its competitor Schöck. The next step was to convert the data into a usable format and create a searchable database with robust product comparison features. The data extraction process initially faced challenges with inconsistencies in table detection from scanned PDF files. After focusing on text-based PDFs, smoother data extraction was achieved, leading to the creation of a database containing thousands of product iterations.

Snapshoot of the product database
Figure 1. Product database

Following the establishment of the database, a Streamlit application was developed to facilitate easy product comparison by enabling users to input desired ranges and search for alternative products. Key functionalities include searching by model number or specifications, adjusting search range strictness, and displaying separate outputs for Leviat and Schöck products. Future enhancements aim to improve the user interface and add further data to increase the search results.

A working prototype of th Product Finder App
Figure 2. Working prototype to find a product
 
In addition, the project explores using Python to program CAD models, allowing engineering companies to generate 3D models for digital environments. Pro Engineers plans to leverage existing CAD files and product naming conventions to train an AI model that generates CAD files based on product names. This approach holds the potential to automate processes and therefore enhance efficiency.

The Product Twins project illustrates the transformative potential of integrating data science into engineering workflows, fostering collaboration between Leviat and Pro Engineers. Gabriel and Nikita are keen on further advancing product data management and look forward to future collaboration in this innovative endeavor.

Conclusion

As we conclude this remarkable journey with Data Science Final Projects Group #26, we want to express our deepest gratitude to all the companies that provided our students with invaluable projects. Your collaboration has not only enriched their learning experience but has also paved the way for innovative solutions to real-world challenges.

To the students who joined us in February and dedicated themselves wholeheartedly to completing the course and their final projects, we commend your exceptional efforts. Your dedication, skill, and passion for data science have truly shone through. We wish you all the best in your future endeavors. May you continue to push boundaries, innovate, and make a meaningful impact wherever your careers take you.

For those inspired by these stories and interested in starting their own data science journey, we're excited to announce our upcoming bootcamp. Learn more about our program and how you can join the next cohort of data science innovators at Constructor Academy.

Interested in reading more about Constructor Academy and tech related topics? Then check out our other blog posts.

Read more
Blog