Guidelines for Method ComparisonRead the first pre-print from the Small Molecule Steering Committee

This dataset has been certified! Learn why this matters here.

Dataset

leash-bio/BELKA-v1

This is the dataset provided for the BELKA competition, which Leash Biosciences hosted on Kaggle in the summer of 2024. It is roughly 100M small molecules from DNA-encoded chemical libraries (DELs) screened against 3 protein targets (BRD4, EPHX2/sEH, and ALB/HSA) and includes binary binding labels. Leash is also providing the raw data from these experiments: raw sequencing counts and counts-per-billion of 3 replicates of 3 rounds of selection per protein, plus additional replicates of two negative controls, plus additional raw data from a smaller orthogonal DEL used as a private test set in the Kaggle competition. The raw dataset is some 4.25B physical measurements.

Created on: October 30, 2024Number of datapoints: ~99.3M
Public
V2

Status

Certified

This artifact has been reviewed in line with our Dataset 101 guidelines and was found to meet all criteria.

Learn more here.

Tags

Screen
DEL
DTI

Modalities

MOLECULE

Related benchmarks

No related benchmarks yet.

You're looking at a v2.0 dataset!

Our goal at Polaris is to build a universal format for ML-ready datasets in drug discovery. With our V2 implementation, we're drastically improving scalability, but there's still work to be done!

Details

README

banner

The BELKA dataset

This is the dataset provided for the BELKA competition, which Leash Biosciences hosted on Kaggle in the summer of 2024. It is roughly 100M small molecules (down from 133M after splitting) from DNA-encoded chemical libraries screened against 3 protein targets (BRD4, EPHX2/sEH, and ALB/HSA). More detailed descriptions of the targets, including links to target crystal structures for structure-based predictions, can be found here.

DELs

The vast majority of chemical matter in this dataset is a triazine-based 3-cycle library designed to superficially resemble DEL-A in (1) and this library is called AMA014. AMA014 was designed and manufactured by AlphaMa.

The orthogonal DEL in the library split is a non-triazine 3-cycle library designed by Leash to resemble kinase inhibitor chemistry and is called kinase0, or kin0 for short.

ML-Ready: To benchmark against the BELKA competition

For the contest, we provided binary bind-or-no-bind labels for each small molecule pair. We also provided data splits: 1) a random molecule split, 2) a split where a central core (a triazine) is preserved but there are no shared building blocks between train and test, and 3) a library split. The library split is a series of screens with the same targets against an entirely different DEL with different building blocks, different cores, and different attachment chemistries. The library split was included in the test set only, to test true out-of-distribution predictions.

The dataset that you're looking at now is identical to the dataset in the Kaggle competition in order to enable those who want to benchmark methods against it.

Raw Readouts: To try new methods of hit-calling and more

Maybe you want to know how replicates help, or if multiple rounds of selection is a good idea with AMA014, or what background conditions might be best. This dataset is for you!

To allow exploration of additional hit-calling methods and the contributions of various aspects of experimental design to enriching those hits, we have has also provided the raw/normalized sequencing counts and experimental details of each replicate. These can be found in a public cloud bucket at https://storage.polarishub.io/belka-v1-raw/.

Please note that the full raw dataset uses ~600GB in its compressed form!


Details about the raw readouts

schematic

Rounds of selection:

A common DEL screening practice is to combine the library with the protein target, isolate the DEL/target complexes, elute the bound DEL (here, we did it with heat), and repeat the selection with fresh target protein (Figure 1). The reasoning is that extra rounds select for only the tightest binders, making the most efficient use of sequencing capacity (1). To test this method, we performed 3 rounds of selection on the triazine library per target protein and sequenced a small portion of each round. Kinase0 is a far smaller library, and only a single round of selection was performed on it.

Number of replicates

Each series of selection rounds was performed in triplicate for AMA014. Kinase0 screens were performed in duplicate with a single round of selection.

Columns in the individual experiment dataset

  • molecule_id: unique identifier per molecule
  • smiles: SMILES representation of the full molecule
  • bb1_smiles, bb2_smiles, bb3_smiles: SMILES representation of the individual building blocks that, when combined using SMARTS reactions, produce the full molecule (also with a triazine core, in the case of AMA014)
  • bcj_id: binding count job, an id created internally at Leash to track computational counting jobs
  • raw_counts: read counts obtained from the sequencing data
  • counts_per_billion: read counts normalized by the size of the entire sequencing run and scaled to one billion reads

Conventions

The raw data can be found in folders following the sample label naming convention provided below, one folder per selection round/replicate.

Conventions for Target Screens:

Sample labels for AMA014:

[target]-[replicate]-[selection round]

For example:

  • HSA-A-R1/ is HSA, replicate 1, first selection round (see Rounds of selection below)
  • SEH-C-R2/ is sEH, replicate 3, second selection round
Sample labels for kinase0:

[target]-[kin0]-[replicate]

For example:

  • SEH-Kin0-S1/ is sEH, kinase0, replicate 1, first (and only) selection round

Conventions for negative controls:

Negative control conditions.

One approach to identifying hits is by comparing the relative enrichment of sequenced barcodes over a background distribution. The distribution of the DEL by itself is such a background; an alternate background would be to mix the DEL and the magnetic beads used in selection to identify a bead-only distribution. We are providing both.

Sample labels for DEL-only negative controls:

We sequenced the DELs alone in duplicate to understand the background distribution of the libraries. DEL-A, DEL-B are replicates of AMA014 alone. DEL-Kin0-S1, DEL-Kin0-S2 are replicates of kinase0 alone.

Sample labels for bead-only negative controls:

To compare bead-only signal against the AMA014 selections, we used His Dynabeads from Thermo in triplicate. The Dynabeads were used in the protein-target selections of AMA014. His-A, His-B, and His-C are replicates of Dynabead-only selections, one selection round only.

To compare bead-only signal against the kinase0 selections, we used EDTA-compatible Magbeads from Thermo in duplicate. The Magbeads were used in the protein-target selections of kinase0. Some building blocks in kinase0 were strong Dynabead binders, which is why the Magbeads were used in these screens. HisEDTA-Kin0-S1 and HisEDTA-Kin0-S2 are replicates of Magbead-only selections, one selection round only.

User Attributes

These are custom, user-defined attributes that are not required by the Polaris data model.

AttributeValue
LibrariesAMA014, Kinase0
Protein TargetsBRD4, EPHX2/sEH, and ALB/HSA
Kaggle Competitionhttps://www.kaggle.com/competitions/leash-BELKA/