ASAP Discovery x OpenADMET CompetitionTake part in the first prospective benchmark on Polaris.

Dataset

optic/plumber

Binding affinity prediction benchmark for sequence-based models

Created on: November 15, 2024Number of datapoints: ~1.8M
Public
V2

Status

Uncertified

This artifact has not been certified by approved reviewers. It may contain issues related to data quality.

Learn more here.

Tags

sequence
binding affinity
sbdd
small molecules

Modalities

No modalities found.

Related benchmarks

No related benchmarks yet.

You're looking at a v2.0 dataset!

Our goal at Polaris is to build a universal format for ML-ready datasets in drug discovery. With our V2 implementation, we're drastically improving scalability, but there's still work to be done!

Details

README

PLUMBER

Overview

PLUMBER preview PLUMBER is a benchmark for developing sequence-based models for binding event prediction, based on the PLINDER benchmark. PLUMBER is compiled as protein-ligand pairs dataset from various sources (ChEMBL, BindingDB, and BioLip2) and employes aggressive filtering from each of the datasets followed by molecules standardization, PAINS filtering and deduplication. The val/test sets are additionally binarized for binding event classification at a threshold of < 1 μM on Ki/Kd to have unified benchmark to compare models on. PLINDER is employed to split the proteins into training and testing sets. To enhance flexibility, the training set includes continuous values and their corresponding signs (=, >, <).

Note: PLUMBER states for Protein–Ligand Unseen Matching Benchmark for Evaluating Robustness

PLINDER

To develop generalizable sequence/structure-based models, we aim to test our model on unseen proteins. Standard techniques, such as time-split and random split, often result in test sets containing many very similar proteins, which limits the ability to measure generalizability. The recent benchmark, PLINDER, proposed a compound metric that accounts for different types of similarity on system level and splits datasets based on this metric. We decided to use their protein split assignment. While it is not perfect (as we lack ligand split information), it should yield more challenging splits compared to standard techniques.

EDA

Data description

val and test sets contain the following columns:

  • SMILES: standardized SMILES representation of a molecule
  • sequence: amino acid sequence of a monomer target protein
  • uniprot_id: UniProt ID of that protein
  • source: either "chembl", "bdb", or "biolip"
  • split: always set to "test"
  • is_active: a binary label indicating if the molecule has a Ki/Kd < 1 μM

train set does not have an is_active column. Instead, it contains different columns useful for training. ki, kd, ic50, and ec50 store activity values, while ki_sign, kd_sign, ic50_sign, and ec50_sign specify the corresponding relationship, such as =, <, or >. You can choose to use only Ki/Kd equality data, but alternative strategies can incorporate the other activity types as well.

Preprocessing

To ensure high data quality, we performed extensive preprocessing steps:

  • Selected only monomer data
  • Standardized SMILES with ChEMBL structure pipeline
  • Cleaned SMILES
  • Prioritized data from BindingDB
  • Filtered out molecules with PAINS filter
  • Binarized and deduplicated values with inconsistency check for val/test sets

Acknowledgements

Many thanks to the PLINDER authors for the groundbreaking work on the advanced molecular data splitting.