Wednesday, January 29, 2025
Reproducible Machine Learning in Drug Discovery: How Polaris Serves as a Single Source of Truth


One of the guiding design principles we followed when first building Polaris, was this idea of serving as a single source of truth for the machine learning community tackling drug discovery challenges. Early on, we asked ourselves a central question: How do we ensure everyone is actually working with the same data? That might sound trivial, but in practice, it’s surprisingly difficult to confirm in drug discovery research.
People often publish results claiming to have used “Dataset X,” yet when you dig deeper, you realize they applied different filters, used mismatched file versions, or employed different splitting schemes. Those variations make it impossible to confidently compare any two models. Worse, they can lead to misleading scientific conclusions because the differences in performance might stem from a variety of other reasons than actual model improvements.
We created Polaris to tackle these challenges head-on, with two main pillars:
- Datasets – Immutable and ML-ready
- Benchmarks – Standardized ways of evaluating models on those datasets, including consistent splits and metrics.
This blog post walks through the rationale behind these design choices and what we mean when we say ML-ready datasets and benchmarks on Polaris.
Subtly Different Datasets
Even small variations in a dataset—like removing certain compounds, trimming noisy measurements, or using a different version of the file—can drastically affect your benchmark results. Each of these steps can change the data distribution in ways that are rarely documented. As a result, what one group calls “Dataset X” might not match precisely with another group’s claims of using the “same” dataset. Further, margins at the top of the leaderboards are already minimal, which increases the relative effect of these variations.
To avoid this ambiguity, Polaris locks in dataset versions. Once a dataset is uploaded, it becomes immutable—every time you or anyone else downloads it, it’s the identical dataset. This way, there’s no second-guessing whether you’re working with the same data as someone else.
ML-Ready Datasets
A Single Interface for Data Access
Drug discovery research often involves a maze of file formats—PDB for 3D structures, SDF for small molecules, XYZ for coordinate data, and so on. Each format requires specialized knowledge to parse and interpret properly. At Polaris, we wanted to remove that barrier.
Instead of juggling different parsers, we make it easy to convert whatever files you upload into a single standardized format: Zarr. If you're curious to learn more about Zarr and why we chose it, check out this blog. Through our API, you no longer have to worry about which file extension you’re dealing with. You simply request the data, and the API returns a clean, ML-ready representation of each data point.
Our vision of an ML-ready dataset is that you download it from Polaris and you can immediately use it for training. There’s no need for additional pre-processing on your end. In fact, it’s not straightforward to pre-process a dataset directly through the Polaris API—you’d need to access the source data and re-upload it as a new version if you wanted to make changes.
This might differ from your typical experience, but this is by design to ensure consistency. Everyone gets the same structure and fields, reducing discrepancies in how data gets processed from one user to another. This design choice ties back into the topic of immutability.
Certified Datasets
We also offer certified datasets on Polaris. These have been reviewed and checked by our team (in collaboration with domain experts) to ensure they meet the criteria for basic curation checks (ensuring real-world relevance, consistent dataset sources, and no obvious duplicates, invalid, or ambiguous data). Learn more about the certification process here.
Benchmarks: A Single Source of Truth
While an immutable dataset is a necessary first step, it alone doesn’t guarantee reproducible model comparisons. How that dataset is split, the metrics used, and other evaluation details also matter. We call a dataset plus these evaluation details (split indices, metrics, dataset version, etc.) a benchmark on Polaris.
Consistent Splitting Strategies
When you train and evaluate a model, the splitting method (e.g., scaffold splits, random splits) can cause variation in outcomes. If the community aligns on the same dataset but uses different splits, you still can’t compare results fairly. That’s why Polaris stores the exact indices used in each split for a given benchmark. Our reasoning is simple: if we’re going to compare results, let’s really compare the same slices of data.
Standardized Metrics
Even something as widely used as Mean Absolute Error (MAE) can be implemented in subtly different ways—some people apply a log transform first, others might clip outliers, and sometimes an off-by-one or a bug creeps in. Over time, these variations add up. We decided to codify each metric for a Polaris benchmark in a single, transparent implementation. Our priority here is eliminating “mystery differences” that have nothing to do with actual model performance.
Our goal with Polaris is to help researchers focus on science rather than file formats and dataset ambiguities. By standardizing how data is stored, ensuring immutability, and defining clear benchmarks, we aim to promote reproducibility in drug discovery.
Ultimately, Polaris is about making it easier for everyone to collaborate, compare results, and push the boundaries of what machine learning can achieve in drug discovery. It’s an evolving platform, shaped by the feedback and expertise of the community itself. We hope you’ll join us, explore our datasets, share your work, and help us continue refining this resource—so that together, we can drive the field forward.
Want to learn more about how Polaris can help streamline your ML research for drug discovery? Get in touch, join the Discord or check out our docs to start exploring our growing suite of datasets and benchmarks today.