Icon

03. Advanced Machine Learning Chemistry

Advanced Machine Learning - Chemistry

"Advanced Machine Learning Chemistry" exercise for the advanced Life Science User Training
- Training a Random Forest model to predict a nominal target column
- Evaluating the performance of a classification model
- Optimizing parameters of the Random Forest model
- Performing the classification multiple times in a cross validation loop



Activity I: Random Forest - Partition the activity data after the RDKit Fingerprint node into training and test set (Stratified Sampling). - Train a random forest on the training set to predict activity. - Use the trained model to predict the activity on the test set. - Evaluate the quality of a model with the Scorer node. Machine_Learning. This workflow demonstrates Model Building for a bioactivity data set with Random Forest learner and binary fingerprints.The data set represents a subset of 844 compounds evaluated for activity against CDPK1. 181 compounds inhibited CDPK1 with IC50 below 1uM and have "active" as their class.More information is available https://www.ebi.ac.uk/chemblntd/#tcams_dataset. See Set 19. Step 2. Compute Morganfingerprints with radius 2using RDKit Fingerprintnode. Hint: CheckAdvanced tab Step 1. Read training data: TCAMS_CDPK1_subset_ML.table Step 3. Split data into training andtest sets using Partitioningnode. Make sure toperform stratifiedsampling. Step 4. Build model using Random Forest Learner node and check itsperformance on the test set with Random Forest Predictornode. Step 5. Collect statistics for training and testsets using Scorer nodes. Activity II: Parameter Optimization - Add a parameter optimization loop to your model training process - Use Hillclimbing to determine the optimum number of models (min=10, max=200, step=10, int = yes) - Use maximum accuracy as the objective value - What is the best number of models?(Hint: don't forget to use the flow variable in the Random Forest Learner node) Activity III: Cross Validation - Create a 10-fold cross validation for your model - Take a look at the error rates produced by the different iterations. Does the model seem stable? Step 1. Use the Parameter OptimizationLoop Start node with Hillclimbing todetermine the optimum number ofmodels (min=10, max=200, step=10,int=yes).Name the Parameter: nr_models Step 2. Connect the upper output port of the PartitioningNode with the Random Forest Learner node.Connect the Flow Variable to the Random ForestLearner node and assign your Flow Variable to the nrModels in the Flow Variable Tab.Use the Random Forest Predictor node to predictyour test data. Step 3. Use the Scorer node to evaluate your model. FirstColumn is "activity" and second column is"Prediction(activity)". Connect the Flow Variableport of the Scorer node to the ParameterOptimization Loop End node and choose theobjective function value "accuracy". Step 1. Use the X-Partitioner node with 10validation sets and using stratifiedsampling based on the activity column. Step 2. Use the Random Forest Learner node to trainyour model.Use the Random Forest Predictor node to predictyour test data. Step 3. Use the X-Aggregator node to close your loop. Usethe Scorer node to evaluate your model. First Columnis "activity" and second column is "Prediction(activity)". Step 4. Connect the upper output port of the ParameterOptimization Loop End with the upper input portof the component Run Best Model + ScorerView.Connect the upper output port of the Partitioningnode with the middle input port of the componentRun Best Model + Scorer View.Connect the lower output port of the Partitioningnode with the lower input port of the componentRun Best Model + Scorer View. TCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.table RDKit Fingerprint Partitioning Run best Model+ Scorer View Scorer (JavaScript) Scorer (JavaScript) RDKit Fingerprint RDKit Fingerprint Table Reader Table Reader Table Reader Activity I: Random Forest - Partition the activity data after the RDKit Fingerprint node into training and test set (Stratified Sampling). - Train a random forest on the training set to predict activity. - Use the trained model to predict the activity on the test set. - Evaluate the quality of a model with the Scorer node. Machine_Learning. This workflow demonstrates Model Building for a bioactivity data set with Random Forest learner and binary fingerprints.The data set represents a subset of 844 compounds evaluated for activity against CDPK1. 181 compounds inhibited CDPK1 with IC50 below 1uM and have "active" as their class.More information is available https://www.ebi.ac.uk/chemblntd/#tcams_dataset. See Set 19. Step 2. Compute Morganfingerprints with radius 2using RDKit Fingerprintnode. Hint: CheckAdvanced tab Step 1. Read training data: TCAMS_CDPK1_subset_ML.table Step 3. Split data into training andtest sets using Partitioningnode. Make sure toperform stratifiedsampling. Step 4. Build model using Random Forest Learner node and check itsperformance on the test set with Random Forest Predictornode. Step 5. Collect statistics for training and testsets using Scorer nodes. Activity II: Parameter Optimization - Add a parameter optimization loop to your model training process - Use Hillclimbing to determine the optimum number of models (min=10, max=200, step=10, int = yes) - Use maximum accuracy as the objective value - What is the best number of models?(Hint: don't forget to use the flow variable in the Random Forest Learner node) Activity III: Cross Validation - Create a 10-fold cross validation for your model - Take a look at the error rates produced by the different iterations. Does the model seem stable? Step 1. Use the Parameter OptimizationLoop Start node with Hillclimbing todetermine the optimum number ofmodels (min=10, max=200, step=10,int=yes).Name the Parameter: nr_models Step 2. Connect the upper output port of the PartitioningNode with the Random Forest Learner node.Connect the Flow Variable to the Random ForestLearner node and assign your Flow Variable to the nrModels in the Flow Variable Tab.Use the Random Forest Predictor node to predictyour test data. Step 3. Use the Scorer node to evaluate your model. FirstColumn is "activity" and second column is"Prediction(activity)". Connect the Flow Variableport of the Scorer node to the ParameterOptimization Loop End node and choose theobjective function value "accuracy". Step 1. Use the X-Partitioner node with 10validation sets and using stratifiedsampling based on the activity column. Step 2. Use the Random Forest Learner node to trainyour model.Use the Random Forest Predictor node to predictyour test data. Step 3. Use the X-Aggregator node to close your loop. Usethe Scorer node to evaluate your model. First Columnis "activity" and second column is "Prediction(activity)". Step 4. Connect the upper output port of the ParameterOptimization Loop End with the upper input portof the component Run Best Model + ScorerView.Connect the upper output port of the Partitioningnode with the middle input port of the componentRun Best Model + Scorer View.Connect the lower output port of the Partitioningnode with the lower input port of the componentRun Best Model + Scorer View. TCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.table RDKit Fingerprint Partitioning Run best Model+ Scorer View Scorer (JavaScript) Scorer (JavaScript) RDKit Fingerprint RDKit Fingerprint Table Reader Table Reader Table Reader

Nodes

Extensions

Links