Research
matchms avatar

matchms

Python toolkit for mass spectrometry data processing. Enables spectral file importing (mzML, MGF, MSP), metadata harmonization, peak filtering, and calculating spectral similarity scores (cosine, modified cosine) for metabolomics.

Introduction

Matchms is a robust, open-source Python library designed for mass spectrometry data analysis and metabolomics research. It provides a comprehensive framework for building reproducible analytical pipelines, allowing researchers to automate the cleaning, standardization, and comparison of mass spectral data. The skill is specifically tailored for scientists and bioinformaticians working on metabolite identification, spectral library searching, and large-scale spectral clustering. It handles complex spectral metadata through built-in harmonization, ensuring that downstream statistical analyses are based on consistent data structures.

  • Advanced data importing/exporting for common mass spectrometry formats including mzML, mzXML, MGF, MSP, JSON, and Pickle.

  • Comprehensive spectrum filtering capabilities, including normalization of peak intensities, relative intensity selection, precursor peak removal, and metadata validation.

  • Multiple spectral similarity metrics such as CosineGreedy, ModifiedCosine, NeutralLossesCosine, and FingerprintSimilarity for accurate compound matching.

  • Customizable processing pipelines that allow users to chain multiple filters and similarity calculations into sequential, reproducible workflows.

  • Native support for deriving chemical annotations, including InChI, InChIKey, and Morgan fingerprints from SMILES strings.

  • Ideal for metabolomics workflows, library searching, and spectral quality control tasks in research laboratories.

  • Input typically involves raw mass spectral data files or pre-processed peak lists; output includes similarity scores, filtered spectrum objects, and standardized spectral datasets.

  • Highly flexible integration: users can define custom filtering logic using standard Python functions within the SpectrumProcessor class.

  • Constraint: While excellent for metabolomics, users requiring full LC-MS/MS proteomics pipelines are advised to utilize the pyopenms library instead.

  • Performance note: Efficient handling of large datasets is supported through vectorized similarity scoring operations and memory-efficient spectrum object management.

Repository Stats

Stars
19,802
Forks
2,209
Open Issues
41
Language
Python
Default Branch
main
Sync Status
Idle
Last Synced
Apr 30, 2026, 04:39 PM
View on GitHub