Icon

04 Data Science

Workflow 1 Workflow 2 Documentation for Data Science:In workflow 1, I first used Interactive Data Cleaning to clean the missingvalues. Next, I partitioned the data into a training set (70%) and a testset (30%) and applied random sampling. I also added a ParameterOptimization Loop. After that, I trained a Random Forest (Regression)model to predict the “life expectancy” column. Next, I applied the modelto test and evaluated its performance with the Numeric Scorer node. Inthis model, the predicted values came out close to the actual values.In workflow 2, I created a similar model. However, I added an additionalparameter- carbon dioxide emission rate per capita, by adding theconcatenated datasets to see the difference in prediction. I ran themodel and it looks like the life expectancy predictions have variationsas compared to the actual values. This is due to the different carbondioxide emission rates in each of the region. new column "Life-Expectancy(High/Low/Avg)"Life Expectancy DataLife Expectancy DataHuman Development IndexFiltered columns by Year 2010Inner join by countryHuman Development IndexFiltered columns by Year 2010Filtered missing values in coumn: UNDP Developing RegionsConcatenated all three datasetsPredict Life-expectancyR2 and error metricsControlnr of modelsMinimize MAPEFiltered columnsLife ExpectancyRegionCarbon Dioxide EmissionPredict Life-expectancyR2 and error metricsControlnr of modelsMinimize MAPEFiltered missing values InteractiveData Cleaning Rule Engine CSV Reader CSV Reader CSV Reader Column Filter Joiner CSV Reader Column Filter Row Filter Concatenate Random Forest Learner(Regression) Random Forest Predictor(Regression) Numeric Scorer Parameter OptimizationLoop Start ParameterOptimization Loop End Partitioning Table Columnto Variable Column Filter Table Columnto Variable Random Forest Learner(Regression) Random Forest Predictor(Regression) Numeric Scorer Parameter OptimizationLoop Start ParameterOptimization Loop End Partitioning Row Filter Workflow 1 Workflow 2 Documentation for Data Science:In workflow 1, I first used Interactive Data Cleaning to clean the missingvalues. Next, I partitioned the data into a training set (70%) and a testset (30%) and applied random sampling. I also added a ParameterOptimization Loop. After that, I trained a Random Forest (Regression)model to predict the “life expectancy” column. Next, I applied the modelto test and evaluated its performance with the Numeric Scorer node. Inthis model, the predicted values came out close to the actual values.In workflow 2, I created a similar model. However, I added an additionalparameter- carbon dioxide emission rate per capita, by adding theconcatenated datasets to see the difference in prediction. I ran themodel and it looks like the life expectancy predictions have variationsas compared to the actual values. This is due to the different carbondioxide emission rates in each of the region. new column "Life-Expectancy(High/Low/Avg)"Life Expectancy DataLife Expectancy DataHuman Development IndexFiltered columns by Year 2010Inner join by countryHuman Development IndexFiltered columns by Year 2010Filtered missing values in coumn: UNDP Developing RegionsConcatenated all three datasetsPredict Life-expectancyR2 and error metricsControlnr of modelsMinimize MAPEFiltered columnsLife ExpectancyRegionCarbon Dioxide EmissionPredict Life-expectancyR2 and error metricsControlnr of modelsMinimize MAPEFiltered missing values InteractiveData Cleaning Rule Engine CSV Reader CSV Reader CSV Reader Column Filter Joiner CSV Reader Column Filter Row Filter Concatenate Random Forest Learner(Regression) Random Forest Predictor(Regression) Numeric Scorer Parameter OptimizationLoop Start ParameterOptimization Loop End Partitioning Table Columnto Variable Column Filter Table Columnto Variable Random Forest Learner(Regression) Random Forest Predictor(Regression) Numeric Scorer Parameter OptimizationLoop Start ParameterOptimization Loop End Partitioning Row Filter

Nodes

Extensions

Links