Data Science capstone projects batch #23

by Ekaterina Butyugina

Let's pause and extend a heartfelt shout-out to the students who joined us in August, pouring their dedication into conquering the course and capstone projects. In just three months, our Batch #23 Data Science students in Zurich have triumphed over challenging projects, showcasing exceptional skills. Experience the remarkable impact of data science as it pushes boundaries, reveals insights, and drives meaningful change.

Understanding energy consumption in real estate

Students: Jin Cao, Artem Rakcheev, Stephanie Sabel, Timothy Frei

By consuming significant amounts of energy in the form of electricity and clean water, real estate properties profoundly contribute to global greenhouse gas emissions. To meet state-mandated climate goals, real estate investors are required to reduce the emissions from their portfolio properties to net zero by 2050. A first step towards this target is understanding what factors critically affect electricity and water consumption in real estate properties. Another issue is that consumption data is not always available at the reporting time (generally at the end of the year). This means the consumption for October, November, and December is often estimated based on the previous years, which can lead to inaccurate results. Machine learning models can help produce better forecasts of the consumption for these missing months.

Novalytica, a data science startup with real estate expertise, is looking into helping investors address these challenges with custom-tailored data and machine learning solutions. To do this, they gave our team access to several datasets concerning energy consumption and property attributes such as building type, certification, and distances to points of interest. Overall, it encompassed 178 properties with data recorded between the years 2019 and 2022. The team then further combined this data with weather data due to the assumption that energy consumption strongly depends on the specific weather.

To identify the key drivers of energy consumption, the team fit a gradient-boosting regression model to the data and used the SHAP package in Python to compute the importance of individual features.

The team then addressed forecasting the last three months of a year based on the previous nine months by training a long short-term memory (LSTM) deep neural network, which reduced the mean absolute error by 27% for electricity consumption and 35% for water consumption as compared to a naive approach which took the average per month over the previous years.

Graphs depicting feature contributions and predicted consumption of energy

Graphs depicting feature contributions and predicted consumption of energy

Figure 1, app screenshot: The left panel displays the feature contributions for a selected property (blue) and the entire portfolio (green). In the right panel, one can change various property features and get an estimated consumption.

Historical consumption of energy and forecasted consumption of energy

Historical consumption of energy and forecasted consumption of energy

Figure 2, app screenshot: The left panel displays the historical consumption data for a selected property (blue) and the entire portfolio (green). In the right panel, one can input consumption data for nine months and get a prediction for the following three months. An error estimate (shaded region) based on the mean absolute error is provided.

The resulting models were finally incorporated into a streamlit web app (see the screenshots above) that gives investors easy access to key performance indicators and allows them to predict consumption based on changes in particular property attributes. Furthermore, it provides an interface to forecast consumption based on the most recent data.

Ultimately, enabling access to predictive analytics can guide real estate investors to make decisions that are good for the environment while simultaneously benefiting their financial returns.

Multiple myeloma: a survival story

Students: Antonio Mariano, PhD, Dr. Tatiana Keller, Gordon W Marshall

Over 160,000 people worldwide are currently living with multiple myeloma - a rare cancer that affects the production of plasma cells and causes a multitude of symptoms, including reduced kidney function, bone lesions, and anemia.

How can we ensure timely and effective treatment for these patients? What are the key indicators that signify the success of the treatment?

TriNetX is a global health research network that provides a platform for healthcare organizations, researchers, and life sciences companies to collaborate and access real-world clinical data (RWD) for clinical research and analysis. TriNetX aims to accelerate clinical trials, improve study design, and enhance patient recruitment by offering a comprehensive and standardized view of patient populations across various healthcare institutions.

The team was provided with an anonymized RWD database of Multiple Myeloma patient treatment history and was assigned the task of developing a prognostic model for treatment outcomes and showing prognostic factors in oncology sorted according to their significance.

The data consisted of 390 attributes of 2600 observations (patients), each one linked with the Time To Next Treatment (TTNT), which is the time elapsed between the first and second line of treatment of a patient.

This database contained a high amount of missing data. Therefore, as a first necessary step, the team developed a data pipeline to clean the data and to input the missing values.

Next, the team faced the problem of understanding which attributes or features influence the TTNT. Since this time marks an event (the next treatment phase), the issue falls into the “Survival Analysis” category. The approach followed, typical of the Survival Analysis, was to use a Cox Regression Model and evaluate the performance using its Concordance Index (C-index) in a test-train validation split. The score obtained was 0.62, which indicates good discrimination in line with similar problems in the literature.

The team therefore leveraged this model to obtain a list of features reported in order of their Hazard Ratio (HR), meaning the relative risk of an event happening, which is the following treatment line. Features with high HR contribute to shortening the time a patient needs for a new treatment phase, while features with low HR contribute to a longer time. See the Hazard Ratio plot below for the reference (Figure 1).

A graph that depicts the low and high risk factors

A graph that depicts the low and high risk factors

Figure 1. Hazard Ratio showing the low-risk factors (left) and high-risk factors (right).

The team took a step further to get a detailed view with the help of Kaplan-Meier Survival Curves. On the picture below (Figure 2, left) you can see the patient's performance ability (of Kaplan's Time To Next Treatment (TTNT). “Limited Self-Care Ability” or “Disability” shortens the TTNT for several months in comparison with “Ambulatory” or “Restricted” activities Patients. The model can also predict the effectiveness of different drugs (Figure 2, right).

A graph that depicts Kaplan-meier survival curves

A graph that depicts Kaplan-meier survival curves

A graph depicting the drug effectiveness in compasison

Figure 2. Kaplan-Meier Survival Curves for different features: ECOG (left) and Drugs (right).

For more detailed analysis the model needs to have patient stratification by age, cancer stadium, or survival ranking by C-index. That is a plan for future model improvement.

Elevate your career with Constructor Academy's cutting-edge Data Science Bootcamp.

Are you ready to unlock a world of limitless possibilities in a highly demanding, esteemed, and financially rewarding career? Look no further than the Data Science bootcamp offered by Constructor Academy.

Designed to equip you with the essential techniques and technologies for harnessing the power of real-world data, our bootcamp offers two flexible options: full-time (12 weeks) and part-time (22 weeks). Throughout this immersive experience, you will master transformative technologies including machine learning, natural language processing (NLP), Python, deep learning, and data visualization.

But wait, there's more! Embark on your data science journey with our complimentary introduction to the captivating realm of data science. Simply click here to access this valuable resource and start your exploration today.

Get ready to embrace a future brimming with endless opportunities. Constructor Academy is committed to empowering aspiring data scientists like you to unleash your true potential and pave the way for unparalleled success. Join us on this exhilarating adventure, and let's shape the future of data science together.

Understanding energy consumption in real estate

Multiple myeloma: a survival story

Elevate your career with Constructor Academy's cutting-edge Data Science Bootcamp.

Interested in reading more about Constructor Academy and tech related topics? Then check out our other blog posts.