Icon

Molecule Activity Classification with Machine Learning

<p><strong>Molecule Activity Classification with Machine Learning</strong></p><p>The amount of data available to researchers has drastically increased over the last couple of years, including<strong> large datasets on chemical compounds</strong> relevant to pharmacological research. <strong>Machine learning models </strong>can be used to analyze these large datasets and identify patterns that allow for the prediction of pharmacokinetic properties, issues, or dangers of novel compounds. This has the potential to significantly accelerate industrial and academic pharmacological research and development, <strong>saving both time and money</strong>.</p><p>This workflow demonstrates the basic principle of <strong>how to train and evaluate different machine learning models for a binary classification of compounds into active and inactive</strong> <strong>categories against a specific target protein</strong>. The data used for training and testing is a list of compounds containing the <strong>SMILES notations </strong>of their molecular structure as well as their <strong>activity data (as pIC50) </strong>against the target of interest. The compounds are classified as either active or inactive using a threshold on their pIC50 values. The dataset is then reduced to the <strong>molecular fingerprint </strong>of each compound (a reduced numeric representation of the molecule) with the category information and passed on to three branches using the following<strong> three machine learning model examples</strong> for demonstration:</p><ul><li><p><strong>Random Forest</strong>: ensemble machine learning method that builds many individual decision trees on random data subsets during training which then "vote" for the total result in classification problems</p></li><li><p><strong>Resilient Backpropagation (RProp)</strong>: supervised learning algorithm for feedforward neural networks</p></li><li><p><strong>Support Vector Machine (SVM)</strong>: supervised learning algorithm for classification or regression searching for the optimal separating boundaries between data points/classes</p></li></ul><p>The <strong>X Partitione</strong>r and <strong>X Aggregator</strong> nodes provide a reliable first estimate of the different models' performances on unseen data. A low variation in the resulting error rates points to a robust model for the intended use. In combination with the visualization dashboard including <strong>ROC curves, confusion matrices, and other performance metrics</strong>, the best machine learning model of these three can be chosen for further optimization and subsequent deployment in the future.</p><p><strong>Note</strong>: This workflow is based on the TeachOpenCADD workflow, more specifically Workflow 7 (Ligand-based screening: Machine learning), from the KNIME Community Hub and zenodo. It uses a processed version of the example data provided there, which is a list of active substances against Epidermal Growth Factor Receptor<strong> (EGFR)</strong> that has been filtered according to Lipinski's Rule of Five (see use case Compound Library Screening (ADME) for details).</p>

URL: Teach Open CADD - Workflow 7 (Ligand-based screening: Machine learning https://hub.knime.com/corey/spaces/Workflow%20Team/Andra/ML-workflow/TeachOpenCADD_Workflow7_Machine_learning/TeachOpenCADD_Workflow7_Machine_learning~-uFId4g0DnV6Xb83/current-state
URL: Teach Open CADD - Master Workflow https://hub.knime.com/knime/spaces/Life%20Sciences/Cheminformatics/Teaching/TeachOpenCADD/TeachOpenCADD~xYhrR1mfFcGNxz7I/current-state
URL: Compound Library Screening (ADME) https://hub.knime.com/knime/spaces/Industry%20&%20Department%20Use%20Cases/Life%20Sciences/Compound%20Library%20Screening%20(ADME)~a-fpfwW0JEzAvw4o/current-state
URL: Teach Open CADD - zenodo https://zenodo.org/records/6636125

Molecule Activity Classification with Machine Learning


The amount of data available to researchers has drastically increased over the last couple of years, including large datasets on chemical compounds relevant to pharmacological research. Machine learning models can be used to analyze these large datasets and identify patterns that allow for the prediction of pharmacokinetic properties, issues, or dangers of novel compounds. This has the potential to significantly accelerate industrial and academic pharmacological research and development, saving both time and money.

This workflow demonstrates the basic principle of how to train and evaluate different machine learning models for a binary classification of compounds into active and inactivecategories against a specific target protein. The data used for training and testing is a list of compounds containing the SMILES notations of their molecular structure as well as their activity data (as pIC50) against the target of interest. The compounds are classified as either active or inactive using a threshold on their pIC50 values. The dataset is then reduced to the molecular fingerprint of each compound (a reduced numeric representation of the molecule) with the category information and passed on to three branches using the following three machine learning model examples for demonstration:

  • Random Forest: ensemble machine learning method that builds many individual decision trees on random data subsets during training which then "vote" for the total result in classification problems

  • Resilient Backpropagation (RProp): supervised learning algorithm for feedforward neural networks

  • Support Vector Machine (SVM): supervised learning algorithm for classification or regression searching for the optimal separating boundaries between data points/classes

The X Partitioner and X Aggregator nodes provide a reliable first estimate of the different models' performances on unseen data. A low variation in the resulting error rates points to a robust model for the intended use. In combination with the visualization dashboard including ROC curves, confusion matrices, and other performance metrics, the best machine learning model of these three can be chosen for further optimization and subsequent deployment in the future.

Note: This workflow is based on the TeachOpenCADD workflow, more specifically Workflow 7 (Ligand-based screening: Machine learning), from the KNIME Community Hub and zenodo. It uses a processed version of the example data provided there, which is a list of active substances against Epidermal Growth Factor Receptor (EGFR) that has been filtered according to Lipinski's Rule of Five (see use case Compound Library Screening (ADME) for details).

Visualization and Evaluation of Models

A dashboard containing ROC curves, confusion matrices and other performance metrics for all three model types.

Data Access
Data Cleaning and Preparation
Training the Machine Learning Models

Datas split into three branches to evaluate different model types (Random Forest, RProp MLP, and SVM).
X Partitioner and X Aggregator nodes perform k-fold cross-validation on each model by splitting the data into training (top sub-branch with the Learner node) and test sets (bottom sub-branch with the Predictor node) multiple times. All predictions and error rates of these different runs are collected.

Data Enrichment
  • Define cutoff for activity to label each compound either active or inactive, e.g., pIC50 of 6.3

  • Use a pre-defined node to generate a so-called "fingerprint" of each molecule

  • reduce dataset to fingerprint and activity category

  • split fingerprint into single columns for model training

reads in filewith list of molecules
CSV Reader
Split data into training/test setin k-fold validation
X-Partitioner
Train modelon training set
RProp MLP Learner
split fingerprint to one bit per column
Expand Bit Vector
convert SMILEScolumn to RDKitformat
RDKit From Molecule
convert Stringtype to SMILEStype
Molecule Type Cast
Split data into training/test setin k-fold validation
X-Partitioner
Train modelon training set
Random Forest Learner
Aggregate results from k-fold validation
X-Aggregator
Aggregate results from k-fold validation
X-Aggregator
remove unnecessary columns
Column Filter
Test modelon test set
SVM Predictor
Test modelon test set
Random Forest Predictor
Train modelon training set
SVM Learner
label activeor inactive
Expression
Generate fingerprint(default MACCS)
RDKit Fingerprint
reduce to activity +fingerprint
Column Filter
remove rowswithout activity data
Row Filter
Test modelon test set
MultiLayerPerceptron Predictor
Split data into training/test setin k-fold validation
X-Partitioner
Dashboard with ROC curves andScores
Visualization of Model Evaluation
Aggregate results from k-fold validation
X-Aggregator
Down-samplemajority class
Equal Size Sampling

Nodes

Extensions

Links