Overview
ProteinGym is an extensive set of Deep Mutational Scanning (DMS) assays and annotated human clinical variants curated to enable thorough comparisons of various mutation effect predictors in different regimes. Both the DMS assays and clinical variants are divided into 1) a substitution benchmark which currently consists of the experimental characterisation of ~2.7M missense variants across 217 DMS assays and 2,525 clinical proteins, and 2) an indel benchmark that includes ∼300k mutants across 74 DMS assays and 1,555 clinical proteins. This dataset contains all four components of the benchmark in one file. To separate the dataset into its separate benchmark, split it on the "benchmark" column. Each benchmark also contains multiple different experiments/wild-type proteins, whose variants we score and compute statistics for independently. These are denoted using the "experiment_id" column. For example, to score just one assay of the deep mutational scanning (DMS) substitutions benchmark, such as the A4_HUMAN assay by Seuma et al. (2022), you would take the rows where the benchmark column is "DMS_substitutions" and the experiment_id column is "A4_HUMAN_Seuma_2022".
Note that the "random_cv_split_index" and "continuous_label" columns only have values for deep mutational scans, not for clinical variants, as the variants are labeled as binary benign/pathogenic and we did not train supervised models on them in the original benchmark.
Additional information about all the assays and clinical variants in the benchmark, along with the details of how we compute statistics for all of them, are available in the publication here or at the github repository here.