ASAP Discovery x OpenADMET CompetitionTake part in the first prospective benchmark on Polaris.

This dataset has been certified! Learn why this matters here.

Dataset

roman-bushuiev/MassSpecGym

MassSpecGym: A benchmark for the discovery and identification of molecules

Created on: November 26, 2024Number of datapoints: 231,104
Public
V2

Status

Certified

This artifact has been reviewed in line with our Dataset 101 guidelines and was found to meet all criteria.

Learn more here.

Tags

Small molecule discovery
Generative modeling
Metabolomics
Mass spectrometry

Modalities

MOLECULE

Related benchmarks

No related benchmarks yet.

You're looking at a v2.0 dataset!

Our goal at Polaris is to build a universal format for ML-ready datasets in drug discovery. With our V2 implementation, we're drastically improving scalability, but there's still work to be done!

Details

README

MassSpecGym: A benchmark for the discovery and identification of molecules

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols.

To address this problem, we propose MassSpecGym - the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community.

πŸ§ͺ MassSpecGym dataset

MassSpecGym comprises the largest publicly available collection of 231 thousand high-quality MS/MS spectra labeled with molecular structures. The dataset includes spectra exhaustively collected from well-established public repositories (i.e., MoNA, MassBank, and GNPS), as well as new spectra generated from our in-house mass spectrometry measurements.

DatasetSpectraHigh-quality spectraMoleculesSplit
GNPS [Wang et al., 2016]322K104K16Kβœ—
MoNA [Fiehn lab]98K62K10Kβœ—
MassBank [Horai et al., 2010]62K58K4Kβœ—
MIST CANOPUS [Goldman et al., 2023]11K≀11K≀9Kβœ“
MassSpecGym (ours)231K231K29Kβœ“

The curation of the dataset involves a pipeline of cleaning and standardization steps, ensuring high-quality data while preserving a broad coverage of molecular structures and mass spectrometry settings. The table below introduces all the variables present in MassSpecGym. The main ones are mzs and intensities, representing MS/MS spectra, and smiles, representing the corresponding molecular structures.

VariableDescriptionData typeNum. unique valuesExample
identifierUnique entry identifierstring231,104MassSpecGymID0088683
mzsArray of spectrum m/z valuesn Γ— float231,104[55.0542, 57.0699, ..., 238.0995]
intensitiesArray of spectrum intensitiesn Γ— float231,104[0.0240, 1.0, ..., 0.5356]
smilesSMILES string of moleculestring31,602CCCCOCN(C1=C(C=C...CCl
inchikey2D InChI keystring28,929HKPHPIREJKHECO
formulaChemical formula of moleculestring17,634C17H26ClNO2
precursor_formulaChemical formula of precursor ionstring21,653C17H27ClNO2
parent massMass of moleculefloat32,228311.1652
precursor_mzM/z of precursor ionfloat32,275312.1725
adductIonization adductstring2[M+H]+
instrument_typeType of MS instrumentstring2Orbitrap
collision_energyEnergy of CID fragmentationfloat9,73730.0
foldSplit fold which entry belongs tostring3train
simulation_challengeEntry is used for simulation challengeboolean2True

πŸ† MassSpecGym benchmark

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra:

  • πŸ’₯ De novo molecule generation (MS/MS spectrum β†’ molecular structure)
    • ✨ Bonus chemical formulae challenge (MS/MS spectrum + chemical formula β†’ molecular structure)
  • πŸ’₯ Molecule retrieval (MS/MS spectrum β†’ ranked list of candidate molecular structures)
    • ✨ Bonus chemical formulae challenge (MS/MS spectrum β†’ ranked list of candidate molecular structures with ground-truth chemical formulae)
  • πŸ’₯ Spectrum simulation (molecular structure β†’ MS/MS spectrum)
    • ✨ Bonus chemical formulae challenge (molecular structure β†’ MS/MS spectrum; evaluated on the retrieval of molecular structures with ground-truth chemical formulae)

The provided challenges abstract the process of scientific discovery from biological and environmental samples into well-defined machine learning problems with pre-defined datasets, data splits, and evaluation metrics.

πŸ“© Contact

For any questions or suggestions, please contact the authors via email: roman.bushuiev@uochb.cas.cz.

πŸ”— References

πŸ“ƒ NeurIPS 2024 Splotlight paper: https://arxiv.org/abs/2410.23326.

πŸ’» GitHub repository: https://github.com/pluskal-lab/MassSpecGym.

πŸ€— Hugging Face page: https://huggingface.co/datasets/roman-bushuiev/MassSpecGym.

If you use MassSpecGym in your work, please cite the following paper:

@article{bushuiev2024massspecgym, title={MassSpecGym: A benchmark for the discovery and identification of molecules}, author={Roman Bushuiev and Anton Bushuiev and Niek F. de Jonge and Adamo Young and Fleming Kretschmer and Raman Samusevich and Janne Heirman and Fei Wang and Luke Zhang and Kai DΓΌhrkop and Marcus Ludwig and Nils A. Haupt and Apurva Kalia and Corinna Brungs and Robin Schmid and Russell Greiner and Bo Wang and David S. Wishart and Li-Ping Liu and Juho Rousu and Wout Bittremieux and Hannes Rost and Tytus D. Mak and Soha Hassoun and Florian Huber and Justin J. J. van der Hooft and Michael A. Stravs and Sebastian BΓΆcker and Josef Sivic and TomΓ‘Ε‘ Pluskal}, year={2024}, eprint={2410.23326}, url={https://arxiv.org/abs/2410.23326}, doi={10.48550/arXiv.2410.23326} }

User Attributes

These are custom, user-defined attributes that are not required by the Polaris data model.

AttributeValue
year2024