Quality Datasets

Datasets 101

Below, we outline basic checks that we encourage all members of the community to follow when curating or working with drug discovery datasets. This is the first of many resources that we aim to release to help the community develop methods that matter.

The dataset is representative of applications in real-world drug discovery

Creators of the dataset must be able to explain the data generation process and describe the specific applications of this dataset in drug discovery.

Take for example the FreeSolv dataset in MoleculeNet mentioned in Pat Walter's blog. Although the dataset was designed to evaluate molecular dynamics methods, it has turned into a generic property prediction task for the free energy of solvation. However, this quantity used in isolation is not particularly useful.

The dataset stems from a consistent, original source

Creators of the dataset must share references to where the dataset was originally sourced from. If data is aggregated from multiple sources or preprocessed in some way, this process needs to be transparent and the rationale should be well documented. Blindly combining datasets can introduce significant noise.

Some examples that violate this rule include datasets like tdcommons/solubility-aqsoldb and tdcommons/bbb-martins. In both cases, data has been collected from multiple sources yet there are no references to primary literature.

The dataset does not contain obvious errors or ambiguous data

Creators of the dataset should check for obvious duplicates, invalid data, or ambiguous data. They should also visualize the data distributions to highlight potential outliers.

For example, tdcommons/bbb-martins violates this rule as it contains many duplicate structures.

Introducing Certified Datasets

Certified datasets are checked against the criteria listed above for data completeness and quality. Certified datasets are more visible on Polaris. The review process happens transparently on GitHub.

Build with Confidence

Explore our first certified datasets. You can find notebooks outlining curation details in Polaris Recipes.

Guidelines

Data curation can be very nuanced depending on the specific modality you’re working with. That’s why we’re building steering committees comprised of industry experts, starting with small molecules, to provide detailed guidelines on dataset curation, method evaluation & comparison.

Small Molecules

Extensive guidelines on dataset curation, method evaluation, and method comparison.

Explore Today

In the Backlog

Interested in helping foster the development of more impactful ML methods in your domain? Contact us if you're interested in joining the steering committee of other modalities in drug discovery.

Phenomics

Genomics

Transcriptomics

Proteomics