Thursday, October 31, 2024

Dataset v2.0 - Built to scale!

Julien St-Laurent

Drug discovery is not a single problem. Different communities are working on various rate-limiting steps that span the drug discovery pipeline. Over time, these communities have come to adopt different standards that are best suited to their needs, for example on how they store their data. Now that ML has gained traction as an exciting tool to probe these data archives and unlock new breakthroughs, these differences in standards are making it hard for ML scientists to transfer their skills to new problem sets.

Lack of standardization in drug discovery has complicated data access for ML scientists.

With the Polaris Hub, we set out to design a universal data format for ML scientists in drug discovery. Whether you’re working with phenomics, small molecules, or protein structures, you shouldn’t have to spend time learning about domain-specific file formats, APIs, and software tools to be able to run some ML experiments. Our first attempt at this, which we presented at PyData London 2024, uses a combination of Pandas (saved as Parquet) and Zarr to implement a tabular data structure that is accessible through a universal API.

Beyond modalities, drug discovery datasets also come in different sizes. For example, the recent Leash Bio’s BELKA competition consisted of ~300M rows and ~120GB of data (once decompressed). A quick look through the Kaggle discussion boards shows many participants struggled with this volume of data. And BELKA is not even the largest dataset out there. RxRx3 from Recursion, for example, requires over 80TB. We realized that some of the assumptions we had made wouldn’t scale to that size.

That’s why we’re excited to release Dataset v2.0. Get started right away by playing around with the BELKA dataset or read the blog post from Leash!

ND-arrays are all you need

Zarr in a nutshell: A format for chunked, compressed, N-dimensional arrays, allowing for random access into extra large datasets.

The key realization that informed our v1.0 design was that high-dimensional arrays (or N-D arrays) are a fundamental and versatile data structure that would let us store the wide variety of data types we encounter in ML-based drug discovery. By building on top of Zarr, we also found a solution that could scale to TBs of data. Or so we thought…

The main issue was that we still relied on a Pandas DataFrame to represent the tabular structure of a dataset. Since this has to be fully loaded into memory, that is a major bottleneck. The solution is - yet again - more ND-arrays. Seeing how Pandas was built on top of NumPy arrays, we realized we could store all data directly in the Zarr archive to overcome the bottleneck. Along the way, we overhauled our authentication system to make use of JWTs for faster storage access and built out an event-driven workflow to ensure data integrity.

The result? Random access into larger-than-memory datasets with just a few lines of code!

Leash Bio’s BELKA

We’re excited to host BELKA as the first v2.0 dataset. Try your hand at the original Kaggle competition or explore their raw dataset by visiting the dataset page or reading the blog from Leash to learn more.

What's next?

With extra-large datasets come extra-large benchmarks, which come with their own unique set of challenges (e.g., how do you efficiently represent and store splits with billions of indices?). Currently, it’s not yet possible to define benchmarks based on v2.0 datasets, but we’re already hard at work to address these limitations in our Benchmark v2.0. Stay up-to-date with Polaris by signing up for our mailing list.

If there are any other datasets you would like to see us host or if you have any other feedback, please get in touch.