Icon

03. Advanced Machine Learning Chemistry

Advanced Machine Learning - Chemistry

"Advanced Machine Learning Chemistry" exercise for the advanced Life Science User Training
- Training a Random Forest model to predict a nominal target column
- Evaluating the performance of a classification model
- Optimizing parameters of the Random Forest model
- Performing the classification multiple times in a cross validation loop



Activity I: Random Forest - Partition the activity data after the RDKit Fingerprint node into training and test set (Stratified Sampling). - Train a random forest on the training set to predict activity. - Use the trained model to predict the activity on the test set. - Evaluate the quality of a model with the Scorer node. Machine_Learning. This workflow demonstrates Model Building for a bioactivity data set with Random Forest learner and binary fingerprints.The data set represents a subset of 844 compounds evaluated for activity against CDPK1. 181 compounds inhibited CDPK1 with IC50 below 1uM and have "active"as their class.More information is available https://www.ebi.ac.uk/chemblntd/#tcams_dataset. See Set 19. Step 2. Compute Morganfingerprints withradius 2 usingRDKit Fingerprintnode. Hint: CheckAdvanced tab Step 1. Read training data: TCAMS_CDPK1_subset_ML.table Step 3. Split data intotraining and testsets usingPartitioning node.Make sure toperform stratifiedsampling. Step 4. Build model using Random Forest Learner nodeand check its performance on the test set withRandom Forest Predictor node. Step 5. Collect statistics for trainingand test sets using Scorernodes. Activity II: Parameter Optimization - Add a parameter optimization loop to your model training process - Use Hillclimbing to determine the optimum number of models (min=10, max=200, step=10, int = yes) - Use maximum accuracy as the objective value - What is the best number of models?(Hint: don't forget to use the flow variable in the Random Forest Learner node) Activity III: Cross Validation - Create a 10-fold cross validation for your model - Take a look at the error rates produced by the different iterations. Does the model seem stable? Step 1. Use the ParameterOptimization Loop Startnode with Hillclimbing todetermine the optimumnumber of models (min=10,max=200, step=10, int=yes).Name the Parameter:nr_models Step 2. Connect the upper output port of thePartitioning Node with the RandomForest Learner node. Connect theFlow Variable to the Random ForestLearner node and assign your FlowVariable to the nrModels in the FlowVariable Tab.Use the Random Forest Predictornode to predict your test data. Step 3. Use the Scorer node to evaluate yourmodel. First Column is "activity" andsecond column is"Prediction(activity)". Connect the FlowVariable port of the Scorer node to theParameter Optimization Loop Endnode and choose the objectivefunction value "accuracy". Step 1. Use the X-Partitioner nodewith 10 validation sets andusing stratified samplingbased on the activity column. Step 2. Use the Random Forest Learnernode to train your model.Use the Random Forest Predictornode to predict your test data. Step 3. Use the X-Aggregator node to close yourloop. Use the Scorer node to evaluateyour model. First Column is "activity" andsecond column is "Prediction(activity)". Step 4. Connect the upper output port of theParameter Optimization Loop Endwith the upper input port of thecomponent Run Best Model +Scorer View.Connect the upper output port of thePartitioning node with the middleinput port of the component Run BestModel + Scorer View.Connect the lower output port of thePartitioning node with the lower inputport of the component Run BestModel + Scorer View. TCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.tableNode 347Node 348Node 349Node 350Node 351Node 352Node 353Node 354Node 355Node 356Node 357Node 358Node 359Node 360Node 361Node 362Node 363Node 364Node 365Node 366 RDKit Fingerprint Partitioning Run best Model+ Scorer View Scorer (JavaScript) Scorer (JavaScript) RDKit Fingerprint RDKit Fingerprint Table Reader Table Reader Table Reader Table Reader Partitioning Random ForestLearner Random ForestPredictor Table Reader Parameter OptimizationLoop Start ParameterOptimization Loop End Random ForestLearner Random ForestPredictor Scorer Table Reader X-Partitioner Random ForestLearner Random ForestPredictor X-Aggregator Scorer Box Plot Scorer ROC Curve Row Filter Activity I: Random Forest - Partition the activity data after the RDKit Fingerprint node into training and test set (Stratified Sampling). - Train a random forest on the training set to predict activity. - Use the trained model to predict the activity on the test set. - Evaluate the quality of a model with the Scorer node. Machine_Learning. This workflow demonstrates Model Building for a bioactivity data set with Random Forest learner and binary fingerprints.The data set represents a subset of 844 compounds evaluated for activity against CDPK1. 181 compounds inhibited CDPK1 with IC50 below 1uM and have "active"as their class.More information is available https://www.ebi.ac.uk/chemblntd/#tcams_dataset. See Set 19. Step 2. Compute Morganfingerprints withradius 2 usingRDKit Fingerprintnode. Hint: CheckAdvanced tab Step 1. Read training data: TCAMS_CDPK1_subset_ML.table Step 3. Split data intotraining and testsets usingPartitioning node.Make sure toperform stratifiedsampling. Step 4. Build model using Random Forest Learner nodeand check its performance on the test set withRandom Forest Predictor node. Step 5. Collect statistics for trainingand test sets using Scorernodes. Activity II: Parameter Optimization - Add a parameter optimization loop to your model training process - Use Hillclimbing to determine the optimum number of models (min=10, max=200, step=10, int = yes) - Use maximum accuracy as the objective value - What is the best number of models?(Hint: don't forget to use the flow variable in the Random Forest Learner node) Activity III: Cross Validation - Create a 10-fold cross validation for your model - Take a look at the error rates produced by the different iterations. Does the model seem stable? Step 1. Use the ParameterOptimization Loop Startnode with Hillclimbing todetermine the optimumnumber of models (min=10,max=200, step=10, int=yes).Name the Parameter:nr_models Step 2. Connect the upper output port of thePartitioning Node with the RandomForest Learner node. Connect theFlow Variable to the Random ForestLearner node and assign your FlowVariable to the nrModels in the FlowVariable Tab.Use the Random Forest Predictornode to predict your test data. Step 3. Use the Scorer node to evaluate yourmodel. First Column is "activity" andsecond column is"Prediction(activity)". Connect the FlowVariable port of the Scorer node to theParameter Optimization Loop Endnode and choose the objectivefunction value "accuracy". Step 1. Use the X-Partitioner nodewith 10 validation sets andusing stratified samplingbased on the activity column. Step 2. Use the Random Forest Learnernode to train your model.Use the Random Forest Predictornode to predict your test data. Step 3. Use the X-Aggregator node to close yourloop. Use the Scorer node to evaluateyour model. First Column is "activity" andsecond column is "Prediction(activity)". Step 4. Connect the upper output port of theParameter Optimization Loop Endwith the upper input port of thecomponent Run Best Model +Scorer View.Connect the upper output port of thePartitioning node with the middleinput port of the component Run BestModel + Scorer View.Connect the lower output port of thePartitioning node with the lower inputport of the component Run BestModel + Scorer View. TCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.tableTCAMS_CDPK1_subset_ML.tableNode 347Node 348Node 349Node 350Node 351Node 352Node 353Node 354Node 355Node 356Node 357Node 358Node 359Node 360Node 361Node 362Node 363Node 364Node 365Node 366 RDKit Fingerprint Partitioning Run best Model+ Scorer View Scorer (JavaScript) Scorer (JavaScript) RDKit Fingerprint RDKit Fingerprint Table Reader Table Reader Table Reader Table Reader Partitioning Random ForestLearner Random ForestPredictor Table Reader Parameter OptimizationLoop Start ParameterOptimization Loop End Random ForestLearner Random ForestPredictor Scorer Table Reader X-Partitioner Random ForestLearner Random ForestPredictor X-Aggregator Scorer Box Plot Scorer ROC Curve Row Filter

Nodes

Extensions

Links