Icon

03. Advanced Machine Learning Chemistry - solution

Advanced Machine Learning - Chemistry Solution

Solution to the "Advanced Machine Learning Chemistry" exercise for the advanced Life Science User Training
- Training a Random Forest model to predict a nominal target column
- Evaluating the performance of a classification model
- Optimizing parameters of the Random Forest model
- Performing the classification multiple times in a cross validation loop

Activity I: Random Forest - Partition the activity data after the RDKit Fingerprint node into training and test set (Stratified Sampling). - Train a random forest on the training set to predict activity. - Use the trained model to predict the activity on the test set. - Evaluate the quality of a model with the Scorer node. Machine Learning This workflow demonstrates Model Building for a bioactivity data set with Random Forest learner and binary fingerprints.The data set represents a subset of 844 compounds evaluated for activity against CDPK1. 181 compounds inhibited CDPK1 with IC50 below 1uM and have "active" as their class.More information is available https://www.ebi.ac.uk/chemblntd/#tcams_dataset. See Set 19. Step 2. Compute Morganfingerprints with radius 2using RDKit Fingerprintnode. Hint: CheckAdvanced tab Step 1. Read training data: TCAMS_CDPK1_subset_ML.table Step 3. Split data into training andtest sets using Partitioningnode. Make sure toperform stratifiedsampling. Step 4. Build model using Random Forest Learner node and check itsperformance on the test set with Random Forest Predictornode. Step 5. Collect statistics for training and testsets using Scorer nodes. Activity II: Parameter Optimization - Add a parameter optimization loop to your model training process - Use Hillclimbing to determine the optimum number of models (min=10, max=200, step=10, int = yes) - Use maximum accuracy as the objective value - What is the best number of models?(Hint: don't forget to use the flow variable in the Random Forest Learner node) Activity III: Cross Validation - Create a 10-fold cross validation for your model - Take a look at the error rates produced by the different iterations. Does the model seem stable? Step 1. Use the Parameter OptimizationLoop Start node with Hillclimbingto determine the optimum numberof models (min=10, max=200,step=10, int=yes) Step 2. Connect the upper output port of the PartitioningNode with the Random Forest Learner node.Connect the Flow Variable to the Random ForestLearner node and assign your Flow Variable to the nrModels in the Flow Variable Tab.Use the Random Forest Predictor node to predictyour test data. Step 3. Use the Scorer node to evaluate your model. FirstColumn is "activity" and second column is"Prediction(activity)". Connect the Flow Variableport of the Scorer node to the ParameterOptimization Loop End node and choose theobjective function value "accuracy". Step 1. Use the X-Partitioner node with 10validation sets and using stratifiedsampling based on the activity column. Step 2. Use the Random Forest Learner node to trainyour model.Use the Random Forest Predictor node to predictyour test data. Step 3. Use the X-Aggregator node to close your loop. Usethe Box Plot node to evaluate your model. Connectthe lower output port of the X-Aggregator node withthe Box Plot node and select the "Error in %" Column. Step 4. Connect the upper output port of the ParameterOptimization Loop End with the upper input portof the component Run Best Model + ScorerView.Connect the upper output port of the Partitioningnode with the middle input port of the componentRun Best Model + Scorer View.Connect the lower output port of the Partitioningnode with the lower input port of the componentRun Best Model + Scorer View. Define ParametersCollect AccuracyTCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.table X-Partitioner Parameter OptimizationLoop Start ParameterOptimization Loop End X-Aggregator Partitioning Random ForestLearner Random ForestPredictor RDKit Fingerprint Scorer Table Reader Partitioning Random ForestLearner Random ForestPredictor Random ForestLearner Random ForestPredictor Run best Model+ Scorer View Scorer (JavaScript) Scorer (JavaScript) Table Reader RDKit Fingerprint Table Reader RDKit Fingerprint Box Plot Activity I: Random Forest - Partition the activity data after the RDKit Fingerprint node into training and test set (Stratified Sampling). - Train a random forest on the training set to predict activity. - Use the trained model to predict the activity on the test set. - Evaluate the quality of a model with the Scorer node. Machine Learning This workflow demonstrates Model Building for a bioactivity data set with Random Forest learner and binary fingerprints.The data set represents a subset of 844 compounds evaluated for activity against CDPK1. 181 compounds inhibited CDPK1 with IC50 below 1uM and have "active" as their class.More information is available https://www.ebi.ac.uk/chemblntd/#tcams_dataset. See Set 19. Step 2. Compute Morganfingerprints with radius 2using RDKit Fingerprintnode. Hint: CheckAdvanced tab Step 1. Read training data: TCAMS_CDPK1_subset_ML.table Step 3. Split data into training andtest sets using Partitioningnode. Make sure toperform stratifiedsampling. Step 4. Build model using Random Forest Learner node and check itsperformance on the test set with Random Forest Predictornode. Step 5. Collect statistics for training and testsets using Scorer nodes. Activity II: Parameter Optimization - Add a parameter optimization loop to your model training process - Use Hillclimbing to determine the optimum number of models (min=10, max=200, step=10, int = yes) - Use maximum accuracy as the objective value - What is the best number of models?(Hint: don't forget to use the flow variable in the Random Forest Learner node) Activity III: Cross Validation - Create a 10-fold cross validation for your model - Take a look at the error rates produced by the different iterations. Does the model seem stable? Step 1. Use the Parameter OptimizationLoop Start node with Hillclimbingto determine the optimum numberof models (min=10, max=200,step=10, int=yes) Step 2. Connect the upper output port of the PartitioningNode with the Random Forest Learner node.Connect the Flow Variable to the Random ForestLearner node and assign your Flow Variable to the nrModels in the Flow Variable Tab.Use the Random Forest Predictor node to predictyour test data. Step 3. Use the Scorer node to evaluate your model. FirstColumn is "activity" and second column is"Prediction(activity)". Connect the Flow Variableport of the Scorer node to the ParameterOptimization Loop End node and choose theobjective function value "accuracy". Step 1. Use the X-Partitioner node with 10validation sets and using stratifiedsampling based on the activity column. Step 2. Use the Random Forest Learner node to trainyour model.Use the Random Forest Predictor node to predictyour test data. Step 3. Use the X-Aggregator node to close your loop. Usethe Box Plot node to evaluate your model. Connectthe lower output port of the X-Aggregator node withthe Box Plot node and select the "Error in %" Column. Step 4. Connect the upper output port of the ParameterOptimization Loop End with the upper input portof the component Run Best Model + ScorerView.Connect the upper output port of the Partitioningnode with the middle input port of the componentRun Best Model + Scorer View.Connect the lower output port of the Partitioningnode with the lower input port of the componentRun Best Model + Scorer View. Define ParametersCollect AccuracyTCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.table X-Partitioner Parameter OptimizationLoop Start ParameterOptimization Loop End X-Aggregator Partitioning Random ForestLearner Random ForestPredictor RDKit Fingerprint Scorer Table Reader Partitioning Random ForestLearner Random ForestPredictor Random ForestLearner Random ForestPredictor Run best Model+ Scorer View Scorer (JavaScript) Scorer (JavaScript) Table Reader RDKit Fingerprint Table Reader RDKit Fingerprint Box Plot

Nodes

Extensions

Links