Icon

02_​Hyperparameter Optimization

Model Selection with Integrated Deployment

This workflow deploys an advanced parameter optimzation protocol with four machine learning methods. In this implementation the choice of features (fingerprints) and one hyperparameter per method are being optimized. However, we encourage to use this workflow as a template if you have completely different data and customize it by including additional parameters into the optimization loop.

Parameter optimization is performed on 80% of the original dataset. The optimization loops are encapsulated in Metanodes which carry the name of the machine learning methods. The model performances can be evaluted and the best model can be selected in the interactive view of the Pick best Model component. Finally, the selected model is scored using 20% of the dataset (that was not part of optimization cycle) and results are displayed with Model Report component.

The dataset represents a subset of 844 compounds evaluated for activity against CDPK1. 181 compounds inhibited CDPK1 with IC50 below 1uM and have "active" as their class.
More information is available https://chembl.gitbook.io/chembl-ntd/#deposited-set-19-5th-march-2016-uw-kinase-screening-hits. See Set 19.

This workflow a revised version of the original workflow: https://kni.me/w/-ATVMu9EmIURm8kr

1. Read in the data using a Table Readernode from the data folder (xxx.table, resultfrom the previous exercise). Use aworkflow-relative path. Connect it to thecomponent and the partitioning node. Data preprocessing and partition dataSelect the column with class values and partition the data into a trainingset and a test set 02_Hyperparameter OptimizationThis workflow deploys an advanced parameter optimization protocol with the Random Forest machine learning method. In this implementation the choice of features (fingerprints) and one hyperparameter per method are being optimized. This workflow servers to explain the parameter optimization, but weencourage to use the worklfow optimizing four different models and adjust it to your data. Parameter optimization is performed on 80% of the original dataset. Parameters leading to the highest enrichment factor on 5% of the data set are picked to build the best model. Finally, this model is scored using 20% of the dataset (that was not part of optimization cycle). The input data set comprises 2354 molecules binding serotonin receptor. We have arbitrarily classified all compounds with an Ki value lower than 10 nM as active. This results in 751 actives and 1603 inactives. We computed five chemical fingerprints for the compounds using the RDKit nodes in KNIME AnalyticsPlatform and provided them with the data set. Finding the best parameters Hyperparameter for Optimization 4. Training the final model with the best parameters found in the previous step and using it to generate predictions on the test data 2. Use the Parameter Optimization Loop Start and define thefollowing parameters (start, stop and step size each): param_FPs 0.0 4.0 1.0 min_leaf_size 3.0 11.0 2.0 tree_depth 15.0 15.0 2.0 forest_size 200.0 500.0 100.0 4. Close the loop with the ParameterOptimization Loop End. Connect the upperdata output port to the Table Row toVariable Node in the next step 5. Train the final Random Forest model with the data from the first Partitioning node. Like usual,this follows the Learner and predictor motif. Feed the best parameters into the Learner node usingflow variables. Connect the output to the Compute Statistics component. 6. Write the model to the data folder using theworkflow-relative path in the Model Writer node. Thiswill be used in the next exercise to predict theactivity of the compounds. 3. Partition the data with the Partitioning node, and use a Random Forest Learner and a Predictornode to train random forest models with all the different parameters. Connect the flow variable with theparameters to the Learner node and set them in the Flow variables tab. min leaf sizeFPs80/20random stratifiedFPsmin leaf sizeaction needed:- pick activity column- pick objective functionWrite out model to be used in the next exerciseCompute Statistics Math Formula(Variable) Rule EngineVariable Partitioning Rule EngineVariable Math Formula(Variable) Table Rowto Variable Pick activitycolumn Table Reader VisualizeStatistics Model Writer 1. Read in the data using a Table Readernode from the data folder (xxx.table, resultfrom the previous exercise). Use aworkflow-relative path. Connect it to thecomponent and the partitioning node. Data preprocessing and partition dataSelect the column with class values and partition the data into a trainingset and a test set 02_Hyperparameter OptimizationThis workflow deploys an advanced parameter optimization protocol with the Random Forest machine learning method. In this implementation the choice of features (fingerprints) and one hyperparameter per method are being optimized. This workflow servers to explain the parameter optimization, but weencourage to use the worklfow optimizing four different models and adjust it to your data. Parameter optimization is performed on 80% of the original dataset. Parameters leading to the highest enrichment factor on 5% of the data set are picked to build the best model. Finally, this model is scored using 20% of the dataset (that was not part of optimization cycle). The input data set comprises 2354 molecules binding serotonin receptor. We have arbitrarily classified all compounds with an Ki value lower than 10 nM as active. This results in 751 actives and 1603 inactives. We computed five chemical fingerprints for the compounds using the RDKit nodes in KNIME AnalyticsPlatform and provided them with the data set. Finding the best parameters Hyperparameter for Optimization 4. Training the final model with the best parameters found in the previous step and using it to generate predictions on the test data 2. Use the Parameter Optimization Loop Start and define thefollowing parameters (start, stop and step size each): param_FPs 0.0 4.0 1.0 min_leaf_size 3.0 11.0 2.0 tree_depth 15.0 15.0 2.0 forest_size 200.0 500.0 100.0 4. Close the loop with the ParameterOptimization Loop End. Connect the upperdata output port to the Table Row toVariable Node in the next step 5. Train the final Random Forest model with the data from the first Partitioning node. Like usual,this follows the Learner and predictor motif. Feed the best parameters into the Learner node usingflow variables. Connect the output to the Compute Statistics component. 6. Write the model to the data folder using theworkflow-relative path in the Model Writer node. Thiswill be used in the next exercise to predict theactivity of the compounds. 3. Partition the data with the Partitioning node, and use a Random Forest Learner and a Predictornode to train random forest models with all the different parameters. Connect the flow variable with theparameters to the Learner node and set them in the Flow variables tab. min leaf sizeFPs80/20random stratifiedFPsmin leaf sizeaction needed:- pick activity column- pick objective functionWrite out model to be used in the next exerciseCompute Statistics Math Formula(Variable) Rule EngineVariable Partitioning Rule EngineVariable Math Formula(Variable) Table Rowto Variable Pick activitycolumn Table Reader VisualizeStatistics Model Writer

Nodes

Extensions

Links