Icon

03_​Clustering

03_Clustering

This exercise shows how to perform hierarchical clustering based on molecular fingerprints and create an interactive view to pick interesting clusters. Chemical structures are extracted from this publication: https://doi.org/10.1021/acs.jmedchem.9b01658​

I. Pre-processing
  1. Execute the Table reader and investigate the data.

  2. Inspect the Row Filter node, it filters for missing values. Run it.

  3. Run the RDKit Canon SMILES node.

  4. Remove duplicate compounds with a Duplicate Row Filter node. Make sure to set it to include only the appropriate column.

  5. Compute phys.-chem. properties using RDKit Descriptor Calculation node (e.g., SlogP, TPSA, AMW, NumRotatableBonds, NumHBD, NumHBA).

  6. Connect the result to the Renderer to Image node, make sure that the canonical smiles and the RdKit 2d depiction are used as input column and renderer, respectively.

  7. Ctrl/Cmd + Double click into the component and follow the instructions.

  8. Open the view of the component and make a pre-selection of compounds by clicking and dragging along the axes of the parallel coordinates plot. Make sure to include enough compounds for clustering. Close and apply temporarily.

II. Clustering
  1. Use the RDKit Fingerprint node to create Morgan Fingerprints called mfp2 from the canonical smiles column.

  2. The created fingerprint can now be used to calculate the Tanimoto distance in a Bit Vector Distances node.

  3. Connect the resulting distance and the previous fingerprint output to a Hierarchical Clustering (DistMatrix) node, use average linkage as linkage type.

  4. Use the cluster output and the table from the RDKit Fingerprint as input for the Hierarchical Cluster Assigner. Assign the clusters based on a fixed number of clusters.

  5. Starting from the RDKit Fingerprint node add a Column Resorter to make sure Mol is shown first followed by text and page. Display only those three columns in a Tile View (JavaScript). Adjust the tile view so that 3 tiles per row are displayed and that it only shows selected rows (option found in Interactivity tab).

  6. Create a component containing the Column Resorter, the Tile View and the Hierarchical Cluster Assigner and name it "Set cluster threshold". Execute it and inspect the interactive view, select a cluster threshold. Close and apply temporarily.

  7. Connect the result of the "Set cluster threshold" component to the "Pick interesting cluster" component. Ctrl/Cmd + Double click to go into the component if you wish to investigate it.

  8. Open the views of the component and select the cluster you want to save later. Close and apply temporarily.

  9. Filter for the following columns: text, canonical_smiles, and Mol. Write the result to a file in the data folder of this exercise using a Table Writer or and Excel Writer.

Pick interesting cluster
Renderer to Image
Canonicalize
RDKit Canon SMILES
RDKit Fingerprint
filter missing compounds
Row Filter (deprecated)
Pick compounds
Read input data
Table Reader
Hierarchical Cluster Assigner

Nodes

Extensions

Links