The benchmarking platform for drug discovery.

Polaris makes it easy for the machine learning in drug discovery community to share and access datasets & benchmarks.

Increase your impact.

Our aim is to improve the state of benchmarking so ML can have a greater impact on real-world drug discovery scenarios. To start, we hope to provide a single source of truth that aggregates and provides simple access to datasets & benchmarks.

Download a dataset from the Hub

Evaluate your method on a benchmark

View on Github

Guidelines for dataset curation and method evaluation & comparison.

Starting with small molecules

Through a unique, cross-industry collaboration involving representatives from Recursion Pharmaceuticals, AstraZeneca, Relay Therapeutics, Pfizer, Merck, Nimbus Therapeutics, Blueprint Medicines, Johnson & Johnson, Novartis, Bayer, and Valence Labs, we'll be releasing recommended benchmarks and datasets plus guidelines for dataset curation, method evaluation, and comparison.

Explore Today

Featured datasets on Polaris.

December 11, 2024

RxRx3-core

222,601

To accompany OpenPhenom-S/16, Recursion is releasing the RxRx3-core dataset, a challenge dataset in phenomics optimized for the research community. RxRx3-core includes labeled images of 735 genetic knockouts and 1,674 small-molecule perturbations drawn from the RxRx3 dataset, image embeddings computed with OpenPhenom-S/16, and associations between the included small molecules and genes. The dataset contains 6-channel Cell Painting images and associated embeddings from 222,601 wells but is less than 18Gb, making it incredibly accessible to the research community.

recursion

October 30, 2024

BELKA-v1

~99.3M

This is the dataset provided for the BELKA competition, which Leash Biosciences hosted on Kaggle in the summer of 2024. It is roughly 100M small molecules from DNA-encoded chemical libraries (DELs) screened against 3 protein targets (BRD4, EPHX2/sEH, and ALB/HSA) and includes binary binding labels. Leash is also providing the raw data from these experiments: raw sequencing counts and counts-per-billion of 3 replicates of 3 rounds of selection per protein, plus additional replicates of two negative controls, plus additional raw data from a smaller orthogonal DEL used as a private test set in the Kaggle competition. The raw dataset is some 4.25B physical measurements.

leash-bio

July 10, 2024

adme-fang-v1

384 KB3,521

Assessing ADME properties helps understand a drug candidate’s interaction with the body in terms of absorption, distribution, metabolism, and excretion, essential for evaluating its efficacy, safety, and clinical potential. Fang et al. (2023) presented DMPK datasets gathered over 20 months, covering six in vitro ADME endpoints: human and rat liver microsomal stability, MDR1-MDCK efflux ratio, solubility, and human and rat plasma protein binding. With 885 to 3087 measurements across endpoints, the dataset showcases chemical diversity in key properties like microsomal stability, plasma protein binding, permeability, and solubility.

biogen