Forge Build Machine Learning

Generates a Forge™ Machine Learning regression or classification model for activity from a set of aligned molecules.

The molecules must be pre-aligned for all model types except using 2D descriptors (see below): the falign program or Forge Align node are ideal for this.

The following types of models can be generated.

k Nearest Neighbor (kNN) regression or classification

The kNN methodology is a well-known and robust machine learning approach where the activity for each compound is predicted as the weighted average activity of its k nearest neighbors (most similar compounds) in the training set.

The similarity between the molecules is calculated using either Cresset's field/shape similarity or by using 2D circular fingerprints (ECFP4, ECFP6, FCFP4, or FCFP6).

Random Forest (RF) regression

The aligned training set of molecules are used to derive a set of sampling points around the molecules based on their field points, which can be used to probe any molecule for the electrostatic potential or for the volume taken up by molecules. The data matrix derived from sample values is then processed to generate a Random Forest model.

Relevance Vector Machine (RVM) regression or classification

The aligned training set of molecules are used to derive a set of sampling points around the molecules based on their field points, which can be used to probe any molecule for the electrostatic potential or for the volume taken up by molecules. The data matrix derived from sample values is then processed to generate a RVM model.

Support Vector Machine (SVM) regression or classification

The aligned training set of molecules are used to derive a set of sampling points around the molecules based on their field points, which can be used to probe any molecule for the electrostatic potential or for the volume taken up by molecules. The data matrix derived from sample values is then processed to generate a SVM model.

These models can be used within the 'Forge Score Machine Learning' node (wrapping the 'fscore' executable) to predict an activity value for newly designed molecules.

Please refer to the Forge manual for a detailed description of the science behind each of these model types in Forge and the corresponding model building options.

This node wraps the Forge Build executable 'fbuild', which must be installed with a valid license for this node to work. If this is installed in the default location on Windows, then it should be found automatically. Otherwise, you must either set the 'Cresset Home' preference or the CRESSET_HOME environment variable to the base Cresset software install directory. You may also set the 'fbuild Path' preference or the CRESSET_FORGEBUILD_EXE environment variable to point directly at the executable itself.

The Forge Build Machine Learning node can be configured to use additional resources to perform calculations. The time taken for the node to run will be drastically reduced using the Cresset's Engine Broker. To use this facility either set the 'Cresset Engine Broker' preference or the CRESSET_BROKER environment variable to point to the location of your local Engine Broker. If you do not currently have the Cresset Engine Broker then contact Cresset (enquiries@cresset-group.com) for pricing on local and cloud based brokers.

For more information visit www.cresset-group.com or contact us at support@cresset-group.com.

Options

Basic

Training Set Structure column
The column that contains the aligned molecule to be used as the training set.
Model type
Which type of model to generate. If set to Automatic, all the model types will be generated and the best model will be picked for the output. Note that RF is not available for building classification models for categorical data.
Activity column
The name of the column which specifies the activity data to use when building the model.
Units for the input activity values
Specify whether the input activity values require log-transforming and give their units, or whether the activity values are categorical. For categorical data, the activity column should contain only integer values.
Assign formal charges to input molecules
If set, the protonation states for the input molecules will be set using Cresset's charging rules. Acids will be deprotonated, primary amines protonated, etc.

Model Settings

Fields
Specifies which fields to use for the RF, SVM or RVM models. At least one field must be selected. This option is not used by kNN models.
Maximum number of neighbors (k)
The maximum number of neighbors to consider (i.e. the largest value of k).
Similarity matrix method
The method used by kNN to calculate the similarity between the molecules.
  • field - Cresset's field/shape similarity, molecules must be pre-aligned
  • ECFP4 – 2D similarity based on Extended-Connectivity Fingerprints with a radius of 2
  • ECFP6 - 2D similarity based on Extended-Connectivity Fingerprints with a radius of 3
  • FCFP4 - 2D similarity based on Circular Pharmacophore Fingerprints with a radius of 2
  • FCFP6 - 2D similarity based on Circular Pharmacophore Fingerprints with a radius of 3
Shape weight
The relative weight assigned to shape (as opposed to field) similarity. Values must be between 0.0 (all field) and 1.0 (all shape). The default is to use 50% shape / 50% fields.
Optimize pairwise alignments
If checked, the relative orientation of each pair of conformers is optimized by means of a simplex optimizer which rigidly rotates and translates one conformer with respect to the other to maximize the similarity score. Otherwise, the similarity value is computed from fixed input orientations. Turning this option reduces alignment noise, at the expense of increased computational cost/time.
Weighting method
Select the weighting method to use when averaging the activities of the closest neighbors. In 'Automatic' mode, all the weighting options are tried and the one that provides the best q2 value is chosen.
Number of trees
The number of trees in the Random Forest. Increasing this value will lead to a more robust model at the expense of longer training and prediction times.
Maximum no. optimizer iterations
This option controls how many iterations of global optimization are allowed in training the SVM and RVM models. Higher values have a possibility of finding better models, at the expense of a longer running time.

Output

Add columns for QSAR descriptors
If checked, then additional columns are added to the output of RF, SVM and RVM models to include the QSAR descriptors capturing the field sample values for each molecule. Note that when this option is turned on, successor nodes cannot be configured until this node is executed.
Forge project format
Specifies the output format of the Forge project.
  • Model only - Creates a Forge project which only contains the model. This option creates a smaller project.
  • Molecules and model - Creates a complete Forge project which includes all the molecules and the model.

Input Ports

Icon
The molecules in the training set which will be used to build the model. All molecules must have activity data and must be pre-aligned unless you are generating a 2D kNN model where.

Output Ports

Icon
The input molecules, with tags added for predicted activity. For kNN models the 'distance to model' and 'activity error' information is also added to the molecules.
Icon
The Forge project containing the generated model. The type of Forge project depends on the Forge project format option. The 'Forge Project Viewer' node may be used to view the model. The 'Forge Model Info' node may be used to extract data from the model.
Icon
The sample co-ordinates which were used to generate the model. This is only generated for RF, SVM and RVM models.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.