Compound Gene Activity
This dataset was started with 2921 associations curated from public databases, such as ChEMBL, BindingDB and US Patents. For associations with multiple IC50 or EC50 values, we took the median.
We turned this into a binary classification task as follows:
- A positive pair is any relationship that has an activity value
< 1000 nM
in the original dataset.
- A negative pair is defined in one of two ways:
- It is either an annotated relationship that has an activity value > 10,000 nM` in the original dataset.
- Or it is a randomly sampled gene for which a compound has no annotated association in the original dataset with an activity value
< 10,000 nM
. Since the genes in this dataset are well characterized, we consider it unlikely that many of these compounds act against other genes in the set.
In total, this leads to 17,140 classified relationships (out of 18,681 relationships total). So the 18,681 - 17,140 = 1541
associations were "indecisive", meaning they were in the original 2921 but had an activity >1,000nM
and <10,000nM
. We've left these in the dataset for completeness sake and reproducibility.