Icon

02_​Hyperparameter Optimization

Model Selection with Integrated Deployment

This workflow deploys an advanced parameter optimzation protocol with four machine learning methods. In this implementation the choice of features (fingerprints) and one hyperparameter per method are being optimized. However, we encourage to use this workflow as a template if you have completely different data and customize it by including additional parameters into the optimization loop.

Parameter optimization is performed on 80% of the original dataset. The optimization loops are encapsulated in Metanodes which carry the name of the machine learning methods. The model performances can be evaluted and the best model can be selected in the interactive view of the Pick best Model component. Finally, the selected model is scored using 20% of the dataset (that was not part of optimization cycle) and results are displayed with Model Report component.

The dataset represents a subset of 844 compounds evaluated for activity against CDPK1. 181 compounds inhibited CDPK1 with IC50 below 1uM and have "active" as their class.
More information is available https://chembl.gitbook.io/chembl-ntd/#deposited-set-19-5th-march-2016-uw-kinase-screening-hits. See Set 19.

This workflow a revised version of the original workflow: https://kni.me/w/-ATVMu9EmIURm8kr

1. Read data 2. Data preprocessing and partition dataSelect the column with class values and partition the data into a trainingset and a test set 02_Hyperparameter OptimizationThis workflow deploys an advanced parameter optimization protocol with the Random Forest machine learning method. In this implementation the choice of features (fingerprints) and one hyperparameter per method are being optimized. This workflow servers to explain the parameter optimization, but weencourage to use the worklfow optimizing four different models and adjust it to your data. Parameter optimization is performed on 80% of the original dataset. Parameters leading to the highest enrichment factor on 5% of the data set are picked to build the best model. Finally, this model is scored using 20% of the dataset (that was not part of optimization cycle). The input data set comprises 2354 moelcules binding serotonin receptor. We have arbitrarily classified all compounds with an Ki value lower than 10 nM as active. This results in 751 actives and 1603 inactives. We computed five chemical fingerprints for the compounds using the RDKit nodes in KNIME AnalyticsPlatform and provided them with the data set. 3. Finding the best parameters Possible Additional Hyperparameters:minimum number of members in each leaf (min_leaf_size): min 3.0;max 11.0; step 2.0 number of split levels of each individual tree (tree_depth): min 10.0;max 20.0; step 2.0. Hyperparameter for Optimization 4. Training the final model with the best parameters found in the previous step and using it to generate predictions on the test data Brute ForceFingerprint ChoiceNumber of Treesmin leaf sizeFPs80/20random stratifiedstratified 80-20FPsmin leaf sizeaction needed:- pick activity column- pick objective functionWrite out model to be used in the next exerciseSave the test setCompute Statistics Parameter OptimizationLoop Start Math Formula(Variable) Rule EngineVariable Partitioning ParameterOptimization Loop End Partitioning Rule EngineVariable Math Formula(Variable) Random ForestLearner Random ForestPredictor Random ForestLearner Table Rowto Variable Pick activitycolumn Table Reader VisualizeStatistics Random ForestPredictor Model Writer Table Writer 1. Read data 2. Data preprocessing and partition dataSelect the column with class values and partition the data into a trainingset and a test set 02_Hyperparameter OptimizationThis workflow deploys an advanced parameter optimization protocol with the Random Forest machine learning method. In this implementation the choice of features (fingerprints) and one hyperparameter per method are being optimized. This workflow servers to explain the parameter optimization, but weencourage to use the worklfow optimizing four different models and adjust it to your data. Parameter optimization is performed on 80% of the original dataset. Parameters leading to the highest enrichment factor on 5% of the data set are picked to build the best model. Finally, this model is scored using 20% of the dataset (that was not part of optimization cycle). The input data set comprises 2354 moelcules binding serotonin receptor. We have arbitrarily classified all compounds with an Ki value lower than 10 nM as active. This results in 751 actives and 1603 inactives. We computed five chemical fingerprints for the compounds using the RDKit nodes in KNIME AnalyticsPlatform and provided them with the data set. 3. Finding the best parameters Possible Additional Hyperparameters:minimum number of members in each leaf (min_leaf_size): min 3.0;max 11.0; step 2.0 number of split levels of each individual tree (tree_depth): min 10.0;max 20.0; step 2.0. Hyperparameter for Optimization 4. Training the final model with the best parameters found in the previous step and using it to generate predictions on the test data Brute ForceFingerprint ChoiceNumber of Treesmin leaf sizeFPs80/20random stratifiedstratified 80-20FPsmin leaf sizeaction needed:- pick activity column- pick objective functionWrite out model to be used in the next exerciseSave the test setCompute Statistics Parameter OptimizationLoop Start Math Formula(Variable) Rule EngineVariable Partitioning ParameterOptimization Loop End Partitioning Rule EngineVariable Math Formula(Variable) Random ForestLearner Random ForestPredictor Random ForestLearner Table Rowto Variable Pick activitycolumn Table Reader VisualizeStatistics Random ForestPredictor Model Writer Table Writer

Nodes

Extensions

Links