Guidelines for Method ComparisonRead the first pre-print from the Small Molecule Steering Committee

This dataset has not yet been certified by approved reviewers. It may contain issues related to data completeness and quality.

Dataset

graphium/pcba-1328-1564k-v1

A subset of PubChem BioAssay, containing 1328 bioassays measured over 1564k compounds used by previous work to benchmark machine learning methods.

Created on: July 19, 2024Dataset size: 188 MBNumber of datapoints: ~1.6M
Public

Tags

LargeMix
BioAssay

Modalities

MOLECULE

Details

README

Background

This dataset is very similar to the OGBG-PCBA dataset, but instead of being limited to 128 assays and 437k molecules, it comprises 1,328 assays and 1.56M molecules. This dataset is very interesting for pre-training molecular models since it contains information about a molecule's behavior in various settings relevant to biochemists, with evidence that it improves binding predictions. Analogous to the gene expression, we obtain a bio-assay-expression of each molecule.

To gather the data, we have looped over the PubChem index of bioassays [6] in 2022 and collected every dataset such that it contains more than 6,000 molecules annotated with either "Active" or "Inactive" flags reported by the author, and at least 10 of each. All other flags are removed. Then, we converted all the molecular IDs to canonical SMILES and used it to merge all of the bioassays into a single dataset.

Data resource

Reference: PubChem index of bioassays

User Attributes

These are custom, user-defined attributes that are not required by the Polaris data model.

AttributeValue
year2024