Icon

training

Data Exploration Data Preparation Log Loss computation (model testing) Parameter Optimization (model optimization) Testing (model testing) Feature selection (data preparation) This workflow has the goal to predict the probability that a given person has diabetes, based on some of their biological values.The pipeline is divided in 8 main parts that represent the main phases of the project. The sections work in a successive way, and the previous step is necessary for the computation of the others.1. Data Import: the dataset is imported as an Excel file through the Excel Reader node;2. Data Exploration: the nodes in this section has the goal to extract and visualize some relevant information about the data, such as outliers (Box plot node) and statistics measures (Statistics node).3. Data Preparation: in this step, we computed some transformations on the data, so that the dataset could be a better input for the model. In particular, from the early stages of the pipeline, we divided the data set in Training set and Test set and executed the operations on them separately. In this way, we were able to avoid dataleakage. The Training set was later used to training the model and select the best features and parameters. The Test set was used as hold-out set to finally test the model.4. Feature Selection: we tested various models and, at the end, decided that the best model in terms of minimum log loss is the Gradient Boosted Tree model. The step has the goal to find the features that return the minimal log loss. That is computed inside the metanode Log loss and given as score measure to the FeatureSelection Loop End node. The Cross-validation loop, with k=3, is used to evaluate the model's ability when given new data and then find the best features that generalize better the input.5. Parameter Optimization: this is the Model optimization step. Here, we created a loop that iterates on different values of parameters (Search strategy: Hillclimbing), that are, then, provided to the training model as flow variables. In particular, we created two variables, Depth and N_Models, that range respectively from 1 to 10and to 1 to 500, with a step of 1. They substitute the values maxLevels and modelsNum, needed for the implementation of the model. The best parameters are chosen on the lowest log loss they bring. Also in this case, we applied a Cross-validation loop to stabilize the choice of the model's parameters.6. Hold-out data filtering: here we execute the last transformation on the hold-out set, that is filtering the columns based on the result of the Feature Selection Filter.7. Testing: this step is part of the Model Testing phase. Here, we execute for one last time the model and test it on the hold-out data we retain in the last steps. Also, the final confusion matrix and accuracy are computed.8. Evaluation (Log Loss): here we compute the final log loss and the ROC Curve. Hold-out Data Filtering Data Import Data import.Data overview.Useful to have a firstlook on the raw data.Some attributes (BMI, MentHlth,..)contain outliers. Outliers are removed only from attribute BMI.By removing other columns, there is the risk toonly consider some kinds of people.Data overviewafter the previous steps.Starts loop to find thebest features to train the model.Confusion matrix ofthe last model tested(max accuracy).It returns the model's log losson the features set tested.Output: feature set thatreturned the best log loss (min).This is given as input for thePartition node in the next step.Changed type ofattribute Diabetes:from Integer to String.Training modelfor column Diabetes;trying different features sets.It predicts the values of theattribute Diabetes, given thetrained model on the feature setbeing evaluated.It predicts the values of theattribute Diabetes, given thetrained model on the featuresselected and parameterscurrentyl looping.Confusion matrix ofthe last model tested(mac accuracy).The model is trained withdifferent parameters. Loop on flow variables Depthand Num_model to determine thebest varibles for the model.Return the best parameters(min log lolss).Compute log lossfor each entry.Final model's log loss. y: a binary indicator (0 or 1) of whether class label cis the correct classification for observation o.Retrive bestparameters from the loop.The model is trained ondifferent sets. Diabetes value arepredicted.Final workflow'saccuracy andconfusion matrix.Correlation matrix betweenall variables.Column Prediction (Diabetes)needs to be of type String for the ROC Curve.The feature selected areevaluated on the log lossvalue (min log loss.)Also parameter optimization isevaluated on log lossminimization.Merged the two columnsVeggies and Fruits into onecolumn called healthy_food.Revomed columnsFruits and Veggies.ROC CurveSave the best modelbased on previous steps.Replace missing valueswith their most frequent value whenever it is necessary.80% Trainig set20% Test setApply missing valuesremoval model to thetest set.Apply outliersremoval model to thetest set.Same as Column Merger aboveused on the training set.Same as Column Filter aboveused on the training set.Filter test sets,including only relevantfeatures.Start of a cross-validationloop with k=3.Collect the resultsof the cross-validationloop.Start of a cross-validationloop with k=3.Collect the resultsof the cross-validationloop.From collection toflow varible.Extract column headersfrom the table returnedfrom the Feature Selection Filter.Create headers collection.Removes duplicaterows, if any.Data Normalization(Min-Max Normalization).Apply normalization basedon the normalizationparameters definined above.Data oversamplingto deal with class imbalance.Write missing datamodel that will be used in the Data App.Write ouliersmodel that will be used in the Data App.Write data for DeploymentTab in the Data App.Write normalizationmodel that will be used in the Data App.Write data for TrainingTab in the Data App.Node 406Node 407Node 408Excel Reader Statistics InteractiveTable (local) Box Plot Numeric Outliers Statistics Feature SelectionLoop Start (1:1) Scorer Feature SelectionLoop End Feature SelectionFilter Number To String Gradient BoostedTrees Learner Gradient BoostedTrees Predictor Gradient BoostedTrees Predictor Scorer Gradient BoostedTrees Learner Parameter OptimizationLoop Start ParameterOptimization Loop End Math Formula GroupBy Rule Engine Table Rowto Variable Gradient BoostedTrees Learner Gradient BoostedTrees Predictor Scorer Rank Correlation String To Number Logloss Logloss Column Merger Column Filter ROC Curve Model Writer Missing Value Partitioning Missing Value(Apply) Numeric Outliers(Apply) Column Merger Column Filter Column Filter X-Partitioner X-Aggregator X-Partitioner X-Aggregator Table Rowto Variable ExtractColumn Header Create CollectionColumn DuplicateRow Filter Normalizer Normalizer (Apply) SMOTE Model Writer Model Writer Excel Writer Model Writer Excel Writer ExtractColumn Header Table Writer Table Writer Data Exploration Data Preparation Log Loss computation (model testing) Parameter Optimization (model optimization) Testing (model testing) Feature selection (data preparation) This workflow has the goal to predict the probability that a given person has diabetes, based on some of their biological values.The pipeline is divided in 8 main parts that represent the main phases of the project. The sections work in a successive way, and the previous step is necessary for the computation of the others.1. Data Import: the dataset is imported as an Excel file through the Excel Reader node;2. Data Exploration: the nodes in this section has the goal to extract and visualize some relevant information about the data, such as outliers (Box plot node) and statistics measures (Statistics node).3. Data Preparation: in this step, we computed some transformations on the data, so that the dataset could be a better input for the model. In particular, from the early stages of the pipeline, we divided the data set in Training set and Test set and executed the operations on them separately. In this way, we were able to avoid dataleakage. The Training set was later used to training the model and select the best features and parameters. The Test set was used as hold-out set to finally test the model.4. Feature Selection: we tested various models and, at the end, decided that the best model in terms of minimum log loss is the Gradient Boosted Tree model. The step has the goal to find the features that return the minimal log loss. That is computed inside the metanode Log loss and given as score measure to the FeatureSelection Loop End node. The Cross-validation loop, with k=3, is used to evaluate the model's ability when given new data and then find the best features that generalize better the input.5. Parameter Optimization: this is the Model optimization step. Here, we created a loop that iterates on different values of parameters (Search strategy: Hillclimbing), that are, then, provided to the training model as flow variables. In particular, we created two variables, Depth and N_Models, that range respectively from 1 to 10and to 1 to 500, with a step of 1. They substitute the values maxLevels and modelsNum, needed for the implementation of the model. The best parameters are chosen on the lowest log loss they bring. Also in this case, we applied a Cross-validation loop to stabilize the choice of the model's parameters.6. Hold-out data filtering: here we execute the last transformation on the hold-out set, that is filtering the columns based on the result of the Feature Selection Filter.7. Testing: this step is part of the Model Testing phase. Here, we execute for one last time the model and test it on the hold-out data we retain in the last steps. Also, the final confusion matrix and accuracy are computed.8. Evaluation (Log Loss): here we compute the final log loss and the ROC Curve. Hold-out Data Filtering Data Import Data import.Data overview.Useful to have a firstlook on the raw data.Some attributes (BMI, MentHlth,..)contain outliers. Outliers are removed only from attribute BMI.By removing other columns, there is the risk toonly consider some kinds of people.Data overviewafter the previous steps.Starts loop to find thebest features to train the model.Confusion matrix ofthe last model tested(max accuracy).It returns the model's log losson the features set tested.Output: feature set thatreturned the best log loss (min).This is given as input for thePartition node in the next step.Changed type ofattribute Diabetes:from Integer to String.Training modelfor column Diabetes;trying different features sets.It predicts the values of theattribute Diabetes, given thetrained model on the feature setbeing evaluated.It predicts the values of theattribute Diabetes, given thetrained model on the featuresselected and parameterscurrentyl looping.Confusion matrix ofthe last model tested(mac accuracy).The model is trained withdifferent parameters. Loop on flow variables Depthand Num_model to determine thebest varibles for the model.Return the best parameters(min log lolss).Compute log lossfor each entry.Final model's log loss. y: a binary indicator (0 or 1) of whether class label cis the correct classification for observation o.Retrive bestparameters from the loop.The model is trained ondifferent sets. Diabetes value arepredicted.Final workflow'saccuracy andconfusion matrix.Correlation matrix betweenall variables.Column Prediction (Diabetes)needs to be of type String for the ROC Curve.The feature selected areevaluated on the log lossvalue (min log loss.)Also parameter optimization isevaluated on log lossminimization.Merged the two columnsVeggies and Fruits into onecolumn called healthy_food.Revomed columnsFruits and Veggies.ROC CurveSave the best modelbased on previous steps.Replace missing valueswith their most frequent value whenever it is necessary.80% Trainig set20% Test setApply missing valuesremoval model to thetest set.Apply outliersremoval model to thetest set.Same as Column Merger aboveused on the training set.Same as Column Filter aboveused on the training set.Filter test sets,including only relevantfeatures.Start of a cross-validationloop with k=3.Collect the resultsof the cross-validationloop.Start of a cross-validationloop with k=3.Collect the resultsof the cross-validationloop.From collection toflow varible.Extract column headersfrom the table returnedfrom the Feature Selection Filter.Create headers collection.Removes duplicaterows, if any.Data Normalization(Min-Max Normalization).Apply normalization basedon the normalizationparameters definined above.Data oversamplingto deal with class imbalance.Write missing datamodel that will be used in the Data App.Write ouliersmodel that will be used in the Data App.Write data for DeploymentTab in the Data App.Write normalizationmodel that will be used in the Data App.Write data for TrainingTab in the Data App.Node 406Node 407Node 408Excel Reader Statistics InteractiveTable (local) Box Plot Numeric Outliers Statistics Feature SelectionLoop Start (1:1) Scorer Feature SelectionLoop End Feature SelectionFilter Number To String Gradient BoostedTrees Learner Gradient BoostedTrees Predictor Gradient BoostedTrees Predictor Scorer Gradient BoostedTrees Learner Parameter OptimizationLoop Start ParameterOptimization Loop End Math Formula GroupBy Rule Engine Table Rowto Variable Gradient BoostedTrees Learner Gradient BoostedTrees Predictor Scorer Rank Correlation String To Number Logloss Logloss Column Merger Column Filter ROC Curve Model Writer Missing Value Partitioning Missing Value(Apply) Numeric Outliers(Apply) Column Merger Column Filter Column Filter X-Partitioner X-Aggregator X-Partitioner X-Aggregator Table Rowto Variable ExtractColumn Header Create CollectionColumn DuplicateRow Filter Normalizer Normalizer (Apply) SMOTE Model Writer Model Writer Excel Writer Model Writer Excel Writer ExtractColumn Header Table Writer Table Writer

Nodes

Extensions

Links