Tuesday, December 17, 2024

Imagining the Future of ML Evaluation in Drug Discovery

Kun Zhu

Innovation in machine learning for drug discovery is moving at an incredible pace, yet these advancements often struggle to translate into real drug discovery programs. A big roadblock lies in the scattered and inaccessible nature of evaluative insights.

At Polaris, we are actively re-thinking how methods are assessed and compared in order to bridge this gap. We’re currently exploring an evaluation dashboard feature that aims to empower ML practitioners with the necessary insights to make informed decisions.

In our first proposed guidelines on method comparison, we provided recommendations on appropriate visualizations for the statistical tests discussed. What if evaluation systems can move beyond static leaderboards to deliver meaningful insights and actionable analysis?

For the rest of this post, let’s step into the shoes of an ML researcher and explore what this future might look like on Polaris!

The Future, Reimagined

Imagine that you’re a ML scientist at a biotech company, working on developing models for property prediction. In your current project, your model fails to generalize to the chemical space of interest. In an attempt to resolve this, you decide to search for the state of the art in ADMET prediction.

In Polaris’ datasets browser, you begin by selecting small molecules as the modality. To refine your search, you add the ADMET tag and apply additional filters for dataset size and popularity. Finally, you click into a certified dataset from the top results and navigate to its related benchmark.

1. Overview Dashboard

Within the Results tab of the benchmark artifact page, you are presented with the Overview Dashboard, which captures a summary of all the results for a common benchmark.

The results are displayed in a table similar to the traditional leaderboards you are familiar with seeing. However, there are several key differences:

The table follows a group-based ranking system, where methods are grouped based on the statistical similarity of their performance distributions.
A wide range of metrics are available for selection, providing a more holistic overview of the method’s performance.

You notice that in the top right corner of the table, there is a button with the text “Visualize”. When you click it, the table collapses to reveal the Visualizations panel on its right.

The panel presents a variety of plots designed to provide comprehensive insights. Similar to the table, you can adjust the customizable axes to focus on any metrics of your choosing. Beyond the default plots, you can add additional plots and customize the layout in any way that suits your needs. Each plot can also be expanded to a full-screen view for more detailed examination.

After some exploration, you have narrowed down on a few methods for closer comparison.

2. Comparison Dashboard

Having identified a few methods of interest, you navigate to the Comparison Dashboard, which provides detailed comparative insights for your selected methods.

Method comparisons are commonly presented through traditional leaderboards, but their absolute ranking system offers limited insights. For example, they often fail to account for trade-offs between different metrics or reveal whether performance differences between methods are statistically significant. Important operational implications like scalability and computational cost are also rarely included as metrics.

Within the dashboard, a range of visualizations and a rich selection of metrics are included to enhance your comparative analysis. Through interacting with visualizations like critical difference diagram, multiple comparison similarity plot, and Pareto front plot, you start to gain comprehensive insights into how the methods compare against each other.

You quickly realize that the top-ranked method only marginally outperforms the second-ranked method on the metrics that you prioritize. However, according to the Pareto front plot, the latter method has a much lower computational cost. Since your use case prioritizes computational efficiency, you decide to choose the second-ranked method to explore further.

3. Result Dashboard

Clicking on the specific method takes you to its associated Result Dashboard.

At the end of the day, you require more than just compelling performance metrics to make an informed decision. You also want to understand the model’s strengths, weaknesses, and its capacity for generalizability. In the past, you often had to sift through raw data and manually generate plots before being able to make educated guesses.

This time, informative visualizations are already embedded into the dashboard. A range of customizable settings, such as performance metrics, train-test splits, and plot types are also available within an intuitive interface.

To understand how well the model generalizes, you find the plot with performance over distance to the train set and a scatter plot of predictions vs ground truth. For generalizability, you input your distance between the train and test and find that performance is still acceptable. For the scatter plot, you examine the metric within a given range of the predicted values.

The more you explore through the dashboard, the more confidence you gather in the model’s performance.

4. The investigation continues!

Now that you are confident in the results of this model, you are ready to commit some more time on your investigation. Usually, you would have to dig around to locate the resources that contain further relevant information. If only there were an easier way to access everything that you need…

Links to the associated paper and codebase are included at the top of the Result Dashboard, so you are taken to the relevant resources within a few clicks. Eagerly, you begin reading the paper to dive deeper into the motivation and technical aspects behind the methodology.

5. What about your own method?

After a few months, you have now made significant progress in your project. By now, you have developed a new method that you are excited about. While preparing to write a paper about the method, you recall the dashboards that you explored in earlier sections of this blogpost and wonder: What if you could gather the same level of comparative and evaluative insights for your own work?

You quickly turn the thought into action by logging onto your organization account on Polaris and uploading the method as private onto the platform. While uploading, your method runs through an automated evaluation pipeline in the background, so the Result Dashboard for your method is available for you to explore as soon as upload completes.

Curious about how your method compares against others, you select a few methods of interest and proceed to the Comparison Dashboard to examine their differences. Essentially, you go through a similar process as you did months ago. But this time, it’s for your own method.

During this process, you identify several visualizations that you would like to include in your paper. Within a few simple clicks, you have the visualizations exported in your preferred format, ready to be inserted into the paper.

By now, you know that you can rely on Polaris as the one-stop platform to gather evaluative insights, whether you are learning more about your own work or exploring someone else’s method. The platform also makes it easy to share your work with the community once it’s ready for publication, inviting others to explore your work beyond the scope of your paper.

What Does it Take to Get There?

In today’s age, ML models have become deeply embedded in the drug discovery processes. Careful consideration is required when choosing what models to deploy in production, since the decision informs costly experiments and downstream operations. In other words, it’s essential to establish confidence in a model’s ability to outperform the current baseline.

Historically, such confidence is gained through a laborious process: manually running tests, gathering interpretations from disparate sources, and making assumptions about the trade-offs. But the goal of this evaluation dashboard is to change this. Researchers should instead be empowered to explore results intuitively and arrive at informed decisions with ease.

At Polaris, our goal is to make it easier for the community to adopt the best practices we are recommending in our papers. At the end of the day, this dashboard isn't just a step forward for ML practitioners, it's a leap toward practical and impactful drug discovery.

We're excited to collaborate with the community to refine this vision further. Specifically, we would love to hear your feedback on these mock-up designs! Please tell us what you think by reaching out to us through GitHub or Discord.

Back to blog