Molecule Activity Classification with Machine Learning
The amount of data available to researchers has drastically increased over the last couple of years, including large datasets on chemical compounds relevant to pharmacological research. Machine learning models can be used to analyze these large datasets and identify patterns that allow for the prediction of pharmacokinetic properties, issues, or dangers of novel compounds. This has the potential to significantly accelerate industrial and academic pharmacological research and development, saving both time and money.
This workflow demonstrates the basic principle of how to train and evaluate different machine learning models for a binary classification of compounds into active and inactivecategories against a specific target protein. The data used for training and testing is a list of compounds containing the SMILES notations of their molecular structure as well as their activity data (as pIC50) against the target of interest. The compounds are classified as either active or inactive using a threshold on their pIC50 values. The dataset is then reduced to the molecular fingerprint of each compound (a reduced numeric representation of the molecule) with the category information and passed on to three branches using the following three machine learning model examples for demonstration:
Random Forest: ensemble machine learning method that builds many individual decision trees on random data subsets during training which then "vote" for the total result in classification problems
Resilient Backpropagation (RProp): supervised learning algorithm for feedforward neural networks
Support Vector Machine (SVM): supervised learning algorithm for classification or regression searching for the optimal separating boundaries between data points/classes
The X Partitioner and X Aggregator nodes provide a reliable first estimate of the different models' performances on unseen data. A low variation in the resulting error rates points to a robust model for the intended use. In combination with the visualization dashboard including ROC curves, confusion matrices, and other performance metrics, the best machine learning model of these three can be chosen for further optimization and subsequent deployment in the future.
Note: This workflow is based on the TeachOpenCADD workflow, more specifically Workflow 7 (Ligand-based screening: Machine learning), from the KNIME Community Hub and zenodo. It uses a processed version of the example data provided there, which is a list of active substances against Epidermal Growth Factor Receptor (EGFR) that has been filtered according to Lipinski's Rule of Five (see use case Compound Library Screening (ADME) for details).