0 ×

xgboost parameter tuning (maximise ROC) using Bayes Optimization

Workflow

xgboost parameter tuning and handling large datasets
xgboostHandling large datasetsROCParameter optimizationCross-validationFeature EngineeringaucBayesian optimization
Last amended: 9th Nov, 2019Problem: Kaggle: Santander Customer Transaction PredictionRef: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview1. Handling Large datasets in KNIME--Set Memory Policy for every widget2. Feature Engineering2. ROC curve3. XGBoost classification4. Parameter tuning & cross-validation (maximise AUC)About RAM usage: RAM usage depends upon data as also number of processing nodes. More the nodes for processing more theusage of RAM. To minimize use of RAM, for nodes handling large datasets, set Memory Policy to 'Write tables to disc' Decide samplesize as per yourtime and RAM The ROC widget at Node 24 calculates AUC considering stackedprobability data (prob (target=1) ) for all the three folds together. That is thenumber of rows are three times that of test data coming from X-Partitioner.In a way, it can be said that this AUC value is average of three AUCs wheneach fold's probability data would be considered separately. Fold- wiseAUC is being manually calculated separately in another widget only for lastoptimization iteration-loop. Get AUC value per fold Test data prediction--This AUC will mostly be less than the best predicted above Memory PolicyWrite tables to discfilterID_codeFeature Engineeringgeneratenew aggregatefeaturesmin-maxnormalization80:20Append individualclassprobbailitiesNode 24AUC consideringdata forall foldssample40%data'target'tostringset parameterrangesMaximiseArea Under CurveAcceptDefaultconfigurationmin-maxnormalizationNode 21k-folds: 3Node 23Add columnwith 'fold id'Node 25Node 26Node 27Node 28filter for fold#0, 1 & 2get ROC foreach fold#Node 33 CSV Reader Column Filter Column Aggregator Normalizer Partitioning XGBoost Predictor ROC Curve (local) Row Sampling Number To String Parameter OptimizationLoop Start ParameterOptimization Loop End Table Rowto Variable Normalizer XGBoost TreeEnsemble Learner X-Partitioner X-Aggregator ROC Curve (local) XGBoost TreeEnsemble Learner XGBoost Predictor Timer Info Row Filter ROC Curve (local) Table Rowto Variable Last amended: 9th Nov, 2019Problem: Kaggle: Santander Customer Transaction PredictionRef: https://www.kaggle.com/c/santander-customer-transaction-prediction/overview1. Handling Large datasets in KNIME--Set Memory Policy for every widget2. Feature Engineering2. ROC curve3. XGBoost classification4. Parameter tuning & cross-validation (maximise AUC)About RAM usage: RAM usage depends upon data as also number of processing nodes. More the nodes for processing more theusage of RAM. To minimize use of RAM, for nodes handling large datasets, set Memory Policy to 'Write tables to disc' Decide samplesize as per yourtime and RAM The ROC widget at Node 24 calculates AUC considering stackedprobability data (prob (target=1) ) for all the three folds together. That is thenumber of rows are three times that of test data coming from X-Partitioner.In a way, it can be said that this AUC value is average of three AUCs wheneach fold's probability data would be considered separately. Fold- wiseAUC is being manually calculated separately in another widget only for lastoptimization iteration-loop. Get AUC value per fold Test data prediction--This AUC will mostly be less than the best predicted above Memory PolicyWrite tables to discfilterID_codeFeature Engineeringgeneratenew aggregatefeaturesmin-maxnormalization80:20Append individualclassprobbailitiesNode 24AUC consideringdata forall foldssample40%data'target'tostringset parameterrangesMaximiseArea Under CurveAcceptDefaultconfigurationmin-maxnormalizationNode 21k-folds: 3Node 23Add columnwith 'fold id'Node 25Node 26Node 27Node 28filter for fold#0, 1 & 2get ROC foreach fold#Node 33 CSV Reader Column Filter Column Aggregator Normalizer Partitioning XGBoost Predictor ROC Curve (local) Row Sampling Number To String Parameter OptimizationLoop Start ParameterOptimization Loop End Table Rowto Variable Normalizer XGBoost TreeEnsemble Learner X-Partitioner X-Aggregator ROC Curve (local) XGBoost TreeEnsemble Learner XGBoost Predictor Timer Info Row Filter ROC Curve (local) Table Rowto Variable

Download

Get this workflow from the following link: Download

Nodes

xgboost parameter tuning (maximise ROC) using Bayes Optimization consists of the following 23 nodes(s):

Plugins

xgboost parameter tuning (maximise ROC) using Bayes Optimization contains nodes provided by the following 3 plugin(s):