3. part III - create a network based on spectral similarities#

This is part of a matchms and Spec2Vec tutorial.
Part I –> https://blog.esciencecenter.nl/build-your-own-mass-spectrometry-analysis-pipeline-in-python-using-matchms-part-i-d96c718c68ee

Part II –> https://blog.esciencecenter.nl/build-a-mass-spectrometry-analysis-pipeline-in-python-using-matchms-part-ii-spec2vec-8aa639571018

Part III –> https://blog.esciencecenter.nl/build-a-mass-spectrometry-analysis-pipeline-in-python-using-matchms-part-iii-molecular-91891248ee34

Here we will start with data that was processed in the part I and II. See notebook part I or notebook part II

3.1. Requirements#

For this notebook to work, you must have matchms, Spec2Vec, and matchmsextras installed. This can be done by running:

conda create --name spec2vec python=3.8
conda activate spec2vec
conda install --channel nlesc --channel bioconda --channel conda-forge spec2vec>=0.40
pip install matchmsextras

Here we used spec2vec 0.5.0, matchms 0.9.1, and matchmsextras 0.2.3.

3.2. Import data#

  • We will here use data that was imported and processed in the notebook on part I.

import os
import numpy as np
from matplotlib import pyplot as plt
from matchms.importing import load_from_json

path_data = os.path.join(os.path.dirname(os.getcwd()), "data") #"..." enter your pathname to the downloaded file
file_json = os.path.join(path_data, "GNPS-NIH-NATURALPRODUCTSLIBRARY_processed.json")
spectrums = list(load_from_json(file_json))
print(f"{len(spectrums)} spectrums found and imported")
1267 spectrums found and imported

3.3. Spec2Vec - Load a pretrained Spec2Vec model#

As described in part II, we will here load a pretrained model and apply it to compute all spectrum-spectrum similarities.

For many use-cases we would not advice to retrain a new model from scratch. Instead a more general model that has been trained on a large MS/MS dataset can simply be loaded and used to calculate similarities, even for spectra which were not part of the initial model training.

Here we can download a model trained on about 95,000 spectra from GNPS (positive ionmode) which we provide here: https://zenodo.org/record/4173596.

Let’s now load this model:

import gensim

path_model = os.path.join(os.path.dirname(os.getcwd()), "data", "trained_models") #"..." enter your pathname to the downloaded file
filename_model = "spec2vec_AllPositive_ratio05_filtered_201101_iter_15.model"
filename = os.path.join(path_model, filename_model)

model = gensim.models.Word2Vec.load(filename)

It is very important to make sure that the “documents” are created the same way as for the model training. This mostly comes down here to the number of decimals which needs to be the same here than for the pretrained model. To inspect the words the model has learned, we can look at model.wv which is the “vocabulary” the model has learned.

model.wv.index_to_key[0]  # shows the first word in vocabulary
'peak@105.07'

This means the model will expect words with 2 decimals!

print(f"Learned vocabulary contains {len(model.wv)} words.")
Learned vocabulary contains 115910 words.

3.4. Compute similarities#

from matchms import calculate_scores
from spec2vec import Spec2Vec

spec2vec_similarity = Spec2Vec(model=model,
                               intensity_weighting_power=0.5,
                               allowed_missing_percentage=5.0)

scores = calculate_scores(spectrums, spectrums, spec2vec_similarity,
                          is_symmetric=True)

The warning that 1 word is missing means that one peak in the data does not occur in the model vocabulary. Since it seems to be a small peak, the impact of this is small (0.19%) and will be ignored.

3.5. Compute network/graph from similarities#

Graphs based on similarities can be generated (and pruned/modified) in many different ways. Here we will start rather simple. The function create_network() will create a node for each spectrum. Then it will add edges (connections) for each node in the following way: For each spectrum, the max_links highest scoring connections will be added, as long as they have a similarity score of > cutoff. That means, per node, between 0 (if no high enough similarity exist) and max_links connections will be added.

spectrums[0].metadata
{'charge': 1,
 'ionmode': 'positive',
 'smiles': 'OC(=O)[C@H](NC(=O)CCN1C(=O)[C@@H]2Cc3ccccc3CN2C1=O)c4ccccc4',
 'scans': '1865',
 'ms_level': '2',
 'instrument_type': 'LC-ESI-qTof',
 'file_name': 'p1-A05_GA5_01_17878.mzXML',
 'peptide_sequence': '*..*',
 'organism_name': 'GNPS-NIH-NATURALPRODUCTSLIBRARY',
 'compound_name': '(2R)-2-[3-[(10aS)-1,3-dioxo-10,10a-dihydro-5H-imidazo[1,5-b]isoquinolin-2-yl]propanoylamino]-2-phenylacetic acid"',
 'principal_investigator': 'Dorrestein',
 'data_collector': 'VVP/LMS',
 'submit_user': 'vphelan',
 'confidence': '1',
 'spectrum_id': 'CCMSLIB00000079350',
 'precursor_mz': 408.156,
 'adduct': '[M+H]+',
 'inchi': 'InChI=1S/C22H21N3O5/c26-18(23-19(21(28)29)14-6-2-1-3-7-14)10-11-24-20(27)17-12-15-8-4-5-9-16(15)13-25(17)22(24)30/h1-9,17,19H,10-13H2,(H,23,26)(H,28,29)/t17-,19+/m0/s1',
 'inchikey': 'XTJNPXYDTGJZSA-PKOBYXMFSA-N'}
import networkx as nx
from matchms.networking import SimilarityNetwork
import matchmsextras.networking as net

ms_network = SimilarityNetwork(identifier_key="spectrum_id",
                               score_cutoff=0.7,
                               max_links=10)
ms_network.create_network(scores)

Actually, that’s it. Now a network was generated.

There is now multiple possible next steps to inspect the results.

3.6. 1. We can plot the network with Python (or better: with networkx)#

net.plot_cluster(ms_network.graph)
../_images/436b6352b83ee179dd169aa66e28f24cd1dd3d014014fbb6426083e932085072.png

Lacking any interactivity this kind of plot is not very helpful.

So we could move on to

3.7. 2. Plot individual clusters#

subnetworks = [ms_network.graph.subgraph(c).copy() for c in nx.connected_components(ms_network.graph)]
clusters = {}
for i, subnet in enumerate(subnetworks):
    if len(subnet.nodes) > 2: # exclude clusters with <=2 nodes
        clusters[i] = len(subnet.nodes)

clusters
{3: 59,
 6: 3,
 7: 18,
 8: 4,
 10: 17,
 12: 10,
 14: 10,
 17: 4,
 23: 7,
 25: 20,
 26: 10,
 28: 18,
 29: 10,
 31: 3,
 33: 5,
 37: 3,
 43: 7,
 45: 6,
 46: 3,
 55: 3,
 57: 5,
 60: 4,
 61: 32,
 62: 11,
 63: 11,
 64: 4,
 66: 3,
 69: 4,
 70: 11,
 74: 8,
 75: 10,
 77: 4,
 78: 4,
 84: 3,
 86: 7,
 89: 28,
 93: 10,
 100: 15,
 106: 5,
 120: 3,
 127: 4,
 132: 7,
 137: 10,
 140: 11,
 144: 8,
 166: 3,
 174: 8,
 175: 4,
 176: 8,
 187: 4,
 189: 7,
 190: 3,
 195: 4,
 204: 4,
 212: 10,
 218: 7,
 220: 3,
 230: 4,
 236: 6,
 260: 4,
 272: 3,
 275: 8,
 280: 3,
 310: 4,
 345: 4,
 347: 4,
 377: 3,
 382: 3,
 507: 3,
 541: 3,
 556: 3}

Now, let’s plot one of those clusters separately.

net.plot_cluster(subnetworks[44])
../_images/0b268b5e053354a7bcbc7fcd769347c77b389d783249919608838dcf0d061f9d.png

That is more insightful. Yet, to interactively explore the graph it is generally better to go to

3.8. 3. Export the graph and plot elsewhere (e.g. Cytoscape)#

This comprises two steps. First, we will export the graph/network as a graphml file which contains all edges and nodes. Second, we will also export the metadata which can later be used to color or label the graph.

# export the graph
ms_network.export_to_graphml("network_GNPS_cutoff_07.graphml")

# former way:
# nx.write_graphml(our_network, "network_GNPS_cutoff_07.graphml")

3.8.1. Add parent mass#

from matchms.filtering import default_filters
from matchms.filtering import add_precursor_mz
from matchms.filtering import add_parent_mass

def metadata_processing(spectrum):
    spectrum = default_filters(spectrum)

    spectrum = add_precursor_mz(spectrum)
    spectrum = add_parent_mass(spectrum)
    return spectrum
spectrums2 = [metadata_processing(s) for s in spectrums]
spectrums2[0].metadata
{'charge': 1,
 'ionmode': 'positive',
 'smiles': 'OC(=O)[C@H](NC(=O)CCN1C(=O)[C@@H]2Cc3ccccc3CN2C1=O)c4ccccc4',
 'scans': '1865',
 'ms_level': '2',
 'instrument_type': 'LC-ESI-qTof',
 'file_name': 'p1-A05_GA5_01_17878.mzXML',
 'peptide_sequence': '*..*',
 'organism_name': 'GNPS-NIH-NATURALPRODUCTSLIBRARY',
 'compound_name': '(2R)-2-[3-[(10aS)-1,3-dioxo-10,10a-dihydro-5H-imidazo[1,5-b]isoquinolin-2-yl]propanoylamino]-2-phenylacetic acid"',
 'principal_investigator': 'Dorrestein',
 'data_collector': 'VVP/LMS',
 'submit_user': 'vphelan',
 'confidence': '1',
 'spectrum_id': 'CCMSLIB00000079350',
 'precursor_mz': 408.156,
 'adduct': '[M+H]+',
 'inchi': 'InChI=1S/C22H21N3O5/c26-18(23-19(21(28)29)14-6-2-1-3-7-14)10-11-24-20(27)17-12-15-8-4-5-9-16(15)13-25(17)22(24)30/h1-9,17,19H,10-13H2,(H,23,26)(H,28,29)/t17-,19+/m0/s1',
 'inchikey': 'XTJNPXYDTGJZSA-PKOBYXMFSA-N',
 'parent_mass': 407.148724}
# collect the metadata
metadata = net.extract_networking_metadata(spectrums2)
metadata.head()
smiles compound_name parent_mass
CCMSLIB00000079350 OC(=O)[C@H](NC(=O)CCN1C(=O)[C@@H]2Cc3ccccc3CN2... (2R)-2-[3-[(10aS)-1,3-dioxo-10,10a-dihydro-5H-... 407.148724
CCMSLIB00000079351 O=C(N1[C@@H](C#N)C2OC2c3ccccc13)c4ccccc4 (2S)-3-benzoyl-2,7b-dihydro-1aH-oxireno[2,3-c]... 276.090724
CCMSLIB00000079352 COc1cc(O)c2c(=O)cc(oc2c1)c3ccccc3 5-hydroxy-7-methoxy-2-phenylchromen-4-one 268.073724
CCMSLIB00000079353 COc1c2OCOc2cc(CCN(C)C(=O)c3ccccc3)c1/C=C\4/C(=... N-[2-[7-methoxy-6-[(Z)-(2,4,6-trioxo-1-prop-2-... 491.169724
CCMSLIB00000079354 CC(C)[C@H](NC(=O)N1[C@@H](C(C)C)C(=O)Nc2ccccc1... (2S)-2-[[(2S)-3-methyl-2-[[(2S)-3-oxo-2-propan... 480.237724
# export the metadata
metadata.to_csv("network_GNPS_metadata.csv")