1. part I - building a MS/MS analysis pipeline#

1.1. Requirements#

For this notebook to work, matchms needs to be installed. This can be done by running:

conda create --name matchms python=3.8
conda activate matchms
conda install --channel nlesc --channel bioconda --channel conda-forge matchms

1.2. Import MS/MS data#

  • matchms has several importing options, e.g.: load_from_json, load_from_mzml, load_from_mzxml etc. Here we are going to use load_from_mgf

  • Import spectrum from mgf file downloaded from GNPS (https://gnps-external.ucsd.edu/gnpslibrary, here using GNPS-NIH-NATURALPRODUCTSLIBRARY.mgf)

import os
import numpy as np
from matchms.importing import load_from_mgf

path_data = os.path.join(os.path.dirname(os.getcwd()), "data") #"..." enter your pathname to the downloaded file
file_mgf = os.path.join(path_data, "GNPS-NIH-NATURALPRODUCTSLIBRARY.mgf")
spectrums = list(load_from_mgf(file_mgf))
print(f"{len(spectrums)} spectrums found and imported")
1267 spectrums found and imported

1.3. Inspect spectra#

  • Inspect the metadata: inchikey, inchi, smiles

  • Inspect the number of peaks per spectrum

inchikeys = [s.get("inchikey") for s in spectrums]
found_inchikeys = np.sum([1 for x in inchikeys if x is not None])
print(f"Found {int(found_inchikeys)} inchikeys in metadata")
Found 0 inchikeys in metadata
inchi = [s.get("inchi") for s in spectrums]
inchi[:10]
[None, None, None, None, None, None, None, None, None, None]
smiles = [s.get("smiles") for s in spectrums]
smiles[:10]
['OC(=O)[C@H](NC(=O)CCN1C(=O)[C@@H]2Cc3ccccc3CN2C1=O)c4ccccc4',
 'O=C(N1[C@@H](C#N)C2OC2c3ccccc13)c4ccccc4',
 'COc1cc(O)c2c(=O)cc(oc2c1)c3ccccc3',
 'COc1c2OCOc2cc(CCN(C)C(=O)c3ccccc3)c1/C=C\\4/C(=O)NC(=O)N(CC=C)C4=O',
 'CC(C)[C@H](NC(=O)N1[C@@H](C(C)C)C(=O)Nc2ccccc12)C(=O)N[C@@H](Cc3ccccc3)C(=O)O',
 'OC(COC(=O)c1ccccc1)C(O)C(O)COC(=O)c2ccccc2',
 'Cc1c(Cc2ccccc2)c(=O)oc3c(C)c(OCC(=O)N[C@@H](Cc4ccccc4)C(=O)O)ccc13',
 'CC(Cc1c[nH]c2ccccc12)C(=O)O',
 'COc1ccc(CNC(=O)[C@@H](NC(=O)C2CCN(CC2)C(=O)[C@@H](N)CC(C)C)C(C)C)cc1',
 'Cc1c(Br)c(=O)oc2cc(OCC(=O)N3C[C@H]4C[C@@H](C3)c5cccc(=O)n5C4)ccc12']
numbers_of_peaks = [len(s.peaks.mz) for s in spectrums]

from matplotlib import pyplot as plt
plt.figure(figsize=(5,4), dpi=150)
plt.hist(numbers_of_peaks, 20, edgecolor="white")
plt.title("Peaks per spectrum (before filtering)")
plt.xlabel("Number of peaks in spectrum")
plt.ylabel("Number of spectra")
#plt.savefig("hist01.png")
Text(0, 0.5, 'Number of spectra')
../_images/73e5cf78c78b9ed60ca3fd30039a790b1e56ab6e76423be0facbdb22d18bb341.png

1.4. Build a data processing pipeline#

Depending on the data that you import, in most cases you might want to further process the spectra. matchms offers numerous ‘filters’ to process Spectrum objects. They are designed in a way that allows to simply stack them to any processing pipeline you want. In general, matchms.filtering contains filters that either will

  • harmonize, clean, extend, and/or check the spectrum metadata

  • process or filter spectrum peaks (e.g. remove low intensity peaks, or normalize peak intensities)

from matchms.filtering import default_filters
from matchms.filtering import repair_inchi_inchikey_smiles
from matchms.filtering import derive_inchikey_from_inchi
from matchms.filtering import derive_smiles_from_inchi
from matchms.filtering import derive_inchi_from_smiles
from matchms.filtering import harmonize_undefined_inchi
from matchms.filtering import harmonize_undefined_inchikey
from matchms.filtering import harmonize_undefined_smiles
from matchms.filtering import add_precursor_mz

def metadata_processing(spectrum):
    spectrum = default_filters(spectrum)
    spectrum = repair_inchi_inchikey_smiles(spectrum)
    spectrum = derive_inchi_from_smiles(spectrum)
    spectrum = derive_smiles_from_inchi(spectrum)
    spectrum = derive_inchikey_from_inchi(spectrum)
    spectrum = harmonize_undefined_smiles(spectrum)
    spectrum = harmonize_undefined_inchi(spectrum)
    spectrum = harmonize_undefined_inchikey(spectrum)
    spectrum = add_precursor_mz(spectrum)
    return spectrum

Here we defined a sequence of filters to be applied to a spectrum, which will apply default filters (a bunch of standard operations to make metadata entries more consistent). Then we run a number of filters that look for inchi, inchikey, and smiles that ended up in wrong fields (repair_inchi_inchikey_smiles), and ultimately translate inchi, inchikey, and smiles into another wherever possible.

To process the peaks we define another pipeline:

1.4.1. Peak processing pipeline#

from matchms.filtering import normalize_intensities
from matchms.filtering import select_by_intensity
from matchms.filtering import select_by_mz

def peak_processing(spectrum):
    spectrum = normalize_intensities(spectrum)
    spectrum = select_by_intensity(spectrum, intensity_from=0.01)
    spectrum = select_by_mz(spectrum, mz_from=10, mz_to=1000)
    return spectrum

Here we defined a sequence of filters which will normalize the intensities to values between 0 and 1, and finally only keep peaks if they have a normalized intensity > 0.01 and if they have a m/z position between 10 and 1000 Da.

Once defined in this way, we can easily apply this processing sequence to all imported spectrums by doing:

spectrums = [metadata_processing(s) for s in spectrums]
spectrums = [peak_processing(s) for s in spectrums]
inchikeys = [s.get("inchikey") for s in spectrums]
inchikeys[:10]
['XTJNPXYDTGJZSA-PKOBYXMFSA-N',
 'VOYWJNWCKFCMPN-FHERZECASA-N',
 'IRZVHDLBAYNPCT-UHFFFAOYSA-N',
 'OPWCHZIQXUKNMP-RGEXLXHISA-N',
 'GTBYYVAKXYVRHX-BVSLBCMMSA-N',
 'UVZWLAGDMMCHPD-UHFFFAOYSA-N',
 'XKWILXCQJFNUJH-DEOSSOPVSA-N',
 'JDZNIWUNOASRIK-UHFFFAOYSA-N',
 'RCAVVTTVAJETSK-VXKWHMMOSA-N',
 'KQAZJQXFNDOORW-CABCVRRESA-N']
numbers_of_peaks = [len(s.peaks.mz) for s in spectrums]

from matplotlib import pyplot as plt
plt.figure(figsize=(5,4), dpi=150)
plt.hist(numbers_of_peaks, 20, edgecolor="white")
plt.title("Peaks per spectrum (after filtering)")
plt.xlabel("Number of peaks in spectrum")
plt.ylabel("Number of spectra")
#plt.savefig("hist02.png")
Text(0, 0.5, 'Number of spectra')
../_images/99a0b8da92c12f159c9b0dfeb7af025ae38e4f69afeb71fdd13bd5a6a37bb5a6.png
  • the filtering did drastically reduce the (initially very high) number of peaks per spectrum

  • this is mostly due to the removal of peaks < 0.01 of the max intensity

1.5. Export processed data#

There are multiple options to export your data after processing. Using save_as_json or save_as_mgf the data can be written back to a .json or .mgf file.

from matchms.exporting import save_as_json

file_export_json = os.path.join(path_data, "GNPS-NIH-NATURALPRODUCTSLIBRARY_processed.json")
save_as_json(spectrums, file_export_json)

A alternative that is much faster for reading/writing is to use ‘pickle’. Should data handling become an issue (e.g. many thousands spectrums), than this can be a way out.

import pickle

file_export_pickle = os.path.join(path_data, "GNPS-NIH-NATURALPRODUCTSLIBRARY_processed.pickle")
pickle.dump(spectrums, 
            open(file_export_pickle, "wb"))

1.6. Compute spectra similarities : Cosine score#

In the following part the Cosine similarity scores between all possible spectra pairs will be calculated.

from matchms import calculate_scores
from matchms.similarity import CosineGreedy

similarity_measure = CosineGreedy(tolerance=0.005)
scores = calculate_scores(spectrums, spectrums, similarity_measure, is_symmetric=True)
scores.scores.shape
(1267, 1267, 2)
scores_array = scores.scores.to_array()
scores_array[:5, :5]["CosineGreedy_score"]
array([[1.00000000e+00, 6.41050958e-03, 1.70386100e-04, 8.59172648e-04,
        2.35757313e-03],
       [6.41050958e-03, 1.00000000e+00, 1.37105333e-02, 9.45693252e-01,
        9.17922823e-05],
       [1.70386100e-04, 1.37105333e-02, 1.00000000e+00, 1.53604286e-02,
        0.00000000e+00],
       [8.59172648e-04, 9.45693252e-01, 1.53604286e-02, 1.00000000e+00,
        0.00000000e+00],
       [2.35757313e-03, 9.17922823e-05, 0.00000000e+00, 0.00000000e+00,
        1.00000000e+00]])
scores_array = scores.scores.to_array()
scores_array[:5, :5]["CosineGreedy_matches"]
array([[40,  2,  1,  4,  1],
       [ 2, 75,  2,  5,  1],
       [ 1,  2, 37,  4,  0],
       [ 4,  5,  4, 66,  0],
       [ 1,  1,  0,  0, 43]])

1.7. Get highest scoring results for a spectrum of interest#

best_matches = scores.scores_by_query(spectrums[5], name="CosineGreedy_score", sort=True)[:10]
print([x[1] for x in best_matches])
[(1., 30), (0.99711049, 2), (0.99534901, 2), (0.99214557, 2), (0.98748381, 2), (0.98461111, 3), (0.98401833, 2), (0.97598497, 2), (0.9757458, 2), (0.97547771, 2)]
[x[0].get("smiles") for x in best_matches]
['OC(COC(=O)c1ccccc1)C(O)C(O)COC(=O)c2ccccc2',
 'Cc1cc(=O)oc2cc(OC(=O)c3ccccc3)ccc12',
 'O=C(Nc1ccccc1OC(=O)c2ccccc2)c3ccccc3',
 'COc1cc(CC=C)ccc1OC(=O)c2ccccc2',
 'O=C(OCC1OC(C(OC(=O)c2ccccc2)C1OC(=O)c3ccccc3)n4ncc(=O)[nH]c4=O)c5ccccc5',
 'O=C(Oc1cccc2ccccc12)c3ccccc3',
 'O=C(N1[C@@H](C#N)C2OC2c3ccccc13)c4ccccc4',
 'COC(=O)CNC(=O)c1ccccc1',
 'COc1c2OCOc2cc(CCN(C)C(=O)c3ccccc3)c1C=C4C(=O)NC(=O)NC4=O',
 'COc1c2OCOc2cc(CCN(C)C(=O)c3ccccc3)c1/C=C\\4/C(=O)NC(=O)N(C)C4=O']
from rdkit import Chem
from rdkit.Chem import Draw

for i, smiles in enumerate([x[0].get("smiles") for x in best_matches]):
    m = Chem.MolFromSmiles(smiles)
    Draw.MolToFile(m, f"compound_{i}.png")

1.8. Alternative: Get highest scoring results above min_match criteria#

min_match = 5
sorted_matches = scores.scores_by_query(spectrums[5], name="CosineGreedy_score", sort=True)
best_matches = [x for x in sorted_matches if x[1]["CosineGreedy_matches"] >= min_match][:10]
[x[1] for x in best_matches]
[(1., 30),
 (0.44857215, 6),
 (0.39605775, 5),
 (0.33880658, 5),
 (0.03942863, 6),
 (0.03429136, 5),
 (0.03028157, 5),
 (0.02845932, 5),
 (0.01924283, 7),
 (0.01890612, 5)]

1.9. Computing spectra similarities between all spectra - Modified Cosine score#

In the following part the Modified Cosine similarity scores (Watrous et al., 2012) between all possible spectra pairs will be calculated.

The modified cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’, or if their m/z ratios lie within the tolerance once a mass-shift is applied. The mass shift is simply the difference in precursor-m/z between the two spectra. See Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743] for further details.

from matchms.similarity import ModifiedCosine

similarity_measure = ModifiedCosine(tolerance=0.005)
scores = calculate_scores(spectrums, spectrums, similarity_measure, is_symmetric=True)
min_match = 5
sorted_matches = scores.scores_by_query(spectrums[5], name="ModifiedCosine_score", sort=True)
best_matches = [x for x in sorted_matches if x[1]["ModifiedCosine_matches"] >= min_match][:10]
[x[1] for x in best_matches]
[(1., 30),
 (0.44857215, 6),
 (0.39605775, 5),
 (0.33880658, 5),
 (0.14651964, 5),
 (0.03942863, 6),
 (0.03761521, 5),
 (0.03429136, 5),
 (0.03060675, 7),
 (0.03028157, 5)]
scores_array = scores.scores.to_array()

plt.figure(figsize=(6,6), dpi=150)
plt.imshow(scores_array[:50, :50]["ModifiedCosine_score"], cmap="viridis")
plt.colorbar(shrink=0.7)
plt.title("Modified Cosine spectra similarities")
plt.xlabel("Spectrum #ID")
plt.ylabel("Spectrum #ID")
#plt.savefig("modified_cosine_scores.png")
Text(0, 0.5, 'Spectrum #ID')
../_images/726caa0e81b54c6f71a6f0a50d6b138378d40fe51c74288b037106d2c6a0a71e.png
min_match = 5

plt.figure(figsize=(6,6), dpi=150)
plt.imshow(scores_array[:50, :50]["ModifiedCosine_score"] * (scores_array[:50, :50]["ModifiedCosine_matches"] >= min_match), cmap="viridis")
plt.colorbar(shrink=0.7)
plt.title("Modified Cosine spectra similarities (min_match=5)")
plt.xlabel("Spectrum #ID")
plt.ylabel("Spectrum #ID")
#plt.savefig("modified_cosine_scores_min_match5.png")
Text(0, 0.5, 'Spectrum #ID')
../_images/06a4e25f5a7b2bb8b8cbedddb7fe5121253155c5cd72ce738c13efe4f5c0d23a.png

This suggests that -for instance- spectrums[11] seems to have several related other ones among the first 50 spectrums. Let’s have a closer look at it.

min_match = 5
sorted_matches = scores.scores_by_query(spectrums[11], name="ModifiedCosine_score", sort=True)
best_matches = [x for x in sorted_matches if x[1]["ModifiedCosine_matches"] >= min_match][:10]
[x[1] for x in best_matches]
[(1., 151),
 (0.95295779, 15),
 (0.94542762, 13),
 (0.89735889, 17),
 (0.7886489, 12),
 (0.77433041, 9),
 (0.74935776, 8),
 (0.72854032, 8),
 (0.55896333, 7),
 (0.52331993, 9)]
[x[0].get("smiles") for x in best_matches]
['CC(C)[C@H](NS(=O)(=O)c1ccc(C)cc1)C(=O)N2CCC(CC2)C(=O)N[C@@H](CC(=O)O)C(=O)O',
 'CCCCCC(NC(=O)C1CCN(CC1)C(=O)[C@@H](NS(=O)(=O)c2ccc(C)cc2)C(C)C)C(=O)O',
 'CC(C)[C@H](NS(=O)(=O)c1ccc(C)cc1)C(=O)N2CCC(CC2)C(=O)N[C@H](C(=O)O)c3ccccc3',
 'CC(C)[C@H](NS(=O)(=O)c1ccc(C)cc1)C(=O)N2CCC(CC2)C(=O)N[C@@H](CCC(=O)N)C(=O)O',
 'COC(=O)C1CCN(CC1)C(=O)C2CCN(CC2)C(=O)[C@@H](NS(=O)(=O)c3ccc(C)cc3)C(C)C',
 'CCC(C)[C@H](NS(=O)(=O)c1ccc(C)cc1)C(=O)N2CCC(CC2)C(=O)N[C@H](C(=O)O)c3ccccc3',
 'CC(C)C[C@H](NS(=O)(=O)c1ccc(C)cc1)C(=O)N2CCC(CC2)C(=O)N[C@@H](C(C)C)C(=O)O',
 'CC(C)C[C@H](NS(=O)(=O)c1ccc(C)cc1)C(=O)N2CCC(CC2)C(=O)N[C@@H](Cc3ccccc3)C(=O)O',
 'CCC(C)[C@H](NS(=O)(=O)c1ccc(C)cc1)C(=O)N2CCC(CC2)C(=O)N3CCC(CC3)C(=O)N',
 'CCC(C)[C@H](NS(=O)(=O)c1ccc(C)cc1)C(=O)N2CCC(CC2)C(=O)N3CCC(CC3)C(=O)OC']

1.9.1. Plot the compounds with the highest modified Cosine score#

from rdkit import Chem
from rdkit.Chem import Draw

for i, smiles in enumerate([x[0].get("smiles") for x in best_matches]):
    m = Chem.MolFromSmiles(smiles)
    Draw.MolToFile(m, f"compound_{i}.png")

1.10. Inspect a case that doesn’t work that well…#

masses = [s.get("precursor_mz") for s in spectrums]
np.where(np.array(masses) > 600)
(array([  12,   84,   97,  104,  105,  156,  469,  470,  532,  533,  602,
         604,  708,  853, 1010, 1060, 1085, 1097, 1117, 1265]),)
min_match = 5
sorted_matches = scores.scores_by_query(spectrums[97], name="ModifiedCosine_score", sort=True)
best_matches = [x for x in sorted_matches if x[1]["ModifiedCosine_matches"] >= min_match][:10]
[x[1] for x in best_matches]
[(1., 733),
 (0.65230153, 8),
 (0.40915412, 14),
 (0.40722733, 81),
 (0.40374948, 54),
 (0.38470292, 59),
 (0.3423652, 30),
 (0.3292672, 6),
 (0.32810433, 12),
 (0.32173948, 5)]
from rdkit import Chem
from rdkit.Chem import Draw

for i, smiles in enumerate([x[0].get("smiles") for x in best_matches]):
    m = Chem.MolFromSmiles(smiles)
    Draw.MolToFile(m, f"compound_{i}.png")