PLUMBER

Overview

PLUMBER preview PLUMBER is a benchmark for developing sequence-based models for binding event prediction, based on the PLINDER benchmark. PLUMBER is compiled as protein-ligand pairs dataset from various sources (ChEMBL, BindingDB, and BioLip2) and employes aggressive filtering from each of the datasets followed by molecules standardization, PAINS filtering and deduplication. The val/test sets are additionally binarized for binding event classification at a threshold of < 1 μM on Ki/Kd to have unified benchmark to compare models on. PLINDER is employed to split the proteins into training and testing sets. To enhance flexibility, the training set includes continuous values and their corresponding signs (=, >, <).

Note: PLUMBER states for Protein–Ligand Unseen Matching Benchmark for Evaluating Robustness

PLINDER

To develop generalizable sequence/structure-based models, we aim to test our model on unseen proteins. Standard techniques, such as time-split and random split, often result in test sets containing many very similar proteins, which limits the ability to measure generalizability. The recent benchmark, PLINDER, proposed a compound metric that accounts for different types of similarity on system level and splits datasets based on this metric. We decided to use their protein split assignment. While it is not perfect (as we lack ligand split information), it should yield more challenging splits compared to standard techniques.

EDA

Data description

val and test sets contain the following columns:

SMILES: standardized SMILES representation of a molecule
sequence: amino acid sequence of a monomer target protein
uniprot_id: UniProt ID of that protein
source: either "chembl", "bdb", or "biolip"
split: always set to "test"
is_active: a binary label indicating if the molecule has a Ki/Kd < 1 μM

train set does not have an is_active column. Instead, it contains different columns useful for training. ki, kd, ic50, and ec50 store activity values, while ki_sign, kd_sign, ic50_sign, and ec50_sign specify the corresponding relationship, such as =, <, or >. You can choose to use only Ki/Kd equality data, but alternative strategies can incorporate the other activity types as well.

Preprocessing

To ensure high data quality, we performed extensive preprocessing steps:

Selected only monomer data
Standardized SMILES with ChEMBL structure pipeline
Cleaned SMILES
Prioritized data from BindingDB
Filtered out molecules with PAINS filter
Binarized and deduplicated values with inconsistency check for val/test sets

Acknowledgements

Many thanks to the PLINDER authors for the groundbreaking work on the advanced molecular data splitting.

Dataset

Status

Quick Links

Tags

Modalities

Related benchmarks

Details