Guidelines for Method ComparisonRead the first pre-print from the Small Molecule Steering Committee

This dataset has not yet been certified by approved reviewers. It may contain issues related to data completeness and quality.

Dataset

optic/plumber

Binding affinity prediction benchmark for sequence-based models

Created on: November 15, 2024Number of datapoints: ~1.8M
Public
V2

Tags

sequence
binding affinity
sbdd
small molecules

Modalities

No modalities found.

Related Benchmarks

Creating benchmarks based on V2 datasets is not yet supported. Coming soon!

You're looking at a v2.0 dataset!

Our goal at Polaris is to build a universal format for ML-ready datasets in drug discovery. With our V2 implementation, we're drastically improving scalability, but there's still work to be done!

Details

README

PLUMBER

Overview

PLUMBER preview PLUMBER is a benchmark for developing sequence-based models for binding event prediction, based on the PLINDER benchmark. PLUMBER is compiled as protein-ligand pairs dataset from various sources (ChEMBL, BindingDB, and BioLip2) and employes aggressive filtering from each of the datasets followed by molecules standardization, PAINS filtering and deduplication. The val/test sets are additionally binarized for binding event classification at a threshold of < 1 μM on Ki/Kd to have unified benchmark to compare models on. PLINDER is employed to split the proteins into training and testing sets. To enhance flexibility, the training set includes continuous values and their corresponding signs (=, >, <).

Note: PLUMBER states for Protein–Ligand Unseen Matching Benchmark for Evaluating Robustness

PLINDER

To develop generalizable sequence/structure-based models, we aim to test our model on unseen proteins. Standard techniques, such as time-split and random split, often result in test sets containing many very similar proteins, which limits the ability to measure generalizability. The recent benchmark, PLINDER, proposed a compound metric that accounts for different types of similarity on system level and splits datasets based on this metric. We decided to use their protein split assignment. While it is not perfect (as we lack ligand split information), it should yield more challenging splits compared to standard techniques.

EDA

Data description

val and test sets contain the following columns:

  • SMILES: standardized SMILES representation of a molecule
  • sequence: amino acid sequence of a monomer target protein
  • uniprot_id: UniProt ID of that protein
  • source: either "chembl", "bdb", or "biolip"
  • split: always set to "test"
  • is_active: a binary label indicating if the molecule has a Ki/Kd < 1 μM

train set does not have an is_active column. Instead, it contains different columns useful for training. ki, kd, ic50, and ec50 store activity values, while ki_sign, kd_sign, ic50_sign, and ec50_sign specify the corresponding relationship, such as =, <, or >. You can choose to use only Ki/Kd equality data, but alternative strategies can incorporate the other activity types as well.

Preprocessing

To ensure high data quality, we performed extensive preprocessing steps:

  • Selected only monomer data
  • Standardized SMILES with ChEMBL structure pipeline
  • Cleaned SMILES
  • Prioritized data from BindingDB
  • Filtered out molecules with PAINS filter
  • Binarized and deduplicated values with inconsistency check for val/test sets

Acknowledgements

Many thanks to the PLINDER authors for the groundbreaking work on the advanced molecular data splitting.