Icon

1

1) Read the yellow_tripdata_2015_04-05_timestamp.csv fileFOR THE FIRST OPTIMIZATION (whose goal is the hyperparameters optimization):2) From the original table, include only the rows with total_amount > 03) Identify with a semi-automatic procedure which attributes are most promising for learning: a) evaluate correlation of the attributes with the "total_amount" column (excluding non-numeric columns) b) keep only the rows having "total_amount" as element of the column "Second column name" c) compute (in a new column) the absolute value of the correlation coefficients d) sort in descending order the column of absolute correlation coefficient e) keep only the first three rows of the table obtained in the previous step f) keep only the "First column name" column g) make the transpose of the table setting Chunk size equal to 1 (i.e. the number of columns in the table to be transposed) h) add a new column (to be named LearningFeatures) containing as unique value the list of values of all the other columns4) By means of the "LearningFeatures" column as flow variable, keep only the selected features from the original input data (including the "total_amount" column also)5) Split data into training and test subsets with 80:20 ratio by drawing randomly with seed equal to 06) Optimize, by minimizing the MAPE (Mean Absolute Percentage Error) with Brute Force search strategy, a Simple Regression Tree model with the following parameters optimization ranges: Limit number of levels (tree depth): from 5 to 20 with step equal to 1 (integer values) Minimum node size: from 2 to 5 with step equal to 1 (integer values) Remember that the Numeric Scorer node does not provide flow variables by default, you need to explicitly set such option in the node configuration.QUESTION 1: what is the minimum MAPE value achieved?7) Disconnect the optimization nodes and execute again the model with the optimized parametersQUESTION 2: what is the RMSE value achieved?FOR THE SECOND OPTIMIZATION (whose goal is the optimization of the number of features for learning):8) from the sorting already implemented at step 3d: a) keep only the first n rows of the table obtained in the previous step (read all the following points for further information about n) b) keep only the "First column name" column c) make the transpose of the table setting Chunk size equal to 1 (i.e. the number of columns in the table to be transposed) d) add a new column (to be named LearningFeatures) containing as unique value the list of values of all the other columns9) By means of the "LearningFeatures" column as flow variable, keep only the selected features from the original input data (including the "total_amount" column also)10) Split data into training and test subsets with 80:20 ratio by drawing randomly with seed equal to 011) Perform learning and prediction using a Simple Regression Tree model with the previously optimized hyperparameters12) Optimize the number of features (i.e. the parameter n of step 8a) by minimizing the MAPE (Mean Absolute Percentage Error) with Brute Force search strategy Remember that the Numeric Scorer node does not provide flow variables by default, you need to explicitly set such option in the node configuration.QUESTION 3: what is the minimum MAPE value achieved?13) Disconnect the optimization nodes and execute again the model with the optimized parameter Warning for step 13: to correctly use the result (n) provided by the optimization procedure, you now have to consider Last row number = n+1 for row filteringQUESTION 4: what is the RMSE value achieved? Node 558Node 559Node 573Node 576Node 594Node 596Node 600 File Reader Rule-basedRow Filter Linear Correlation Rule-basedRow Filter Column Expressions Sorter Row Filter Table Transposer Column Filter Parameter OptimizationLoop Start Simple RegressionTree Predictor Numeric Scorer ParameterOptimization Loop End Parameter OptimizationLoop Start Create CollectionColumn Table Columnto Variable Column Filter Partitioning Simple RegressionTree Learner ParameterOptimization Loop End Column Appender Column Filter Create CollectionColumn Linear Correlation Column Filter Table Columnto Variable Column Expressions Simple RegressionTree Learner Simple RegressionTree Predictor Rule-basedRow Filter Partitioning Row Filter Sorter Numeric Scorer Column Appender Table Transposer Column Filter Rule-basedRow Filter File Reader Column Filter Create CollectionColumn Parameter OptimizationLoop Start Column Filter Linear Correlation Table Columnto Variable Simple RegressionTree Predictor Column Expressions Simple RegressionTree Learner Partitioning Rule-basedRow Filter Row Filter Sorter Numeric Scorer Table Transposer Column Appender ParameterOptimization Loop End Column Filter Rule-basedRow Filter File Reader Column Filter Table Rowto Variable 1) Read the yellow_tripdata_2015_04-05_timestamp.csv fileFOR THE FIRST OPTIMIZATION (whose goal is the hyperparameters optimization):2) From the original table, include only the rows with total_amount > 03) Identify with a semi-automatic procedure which attributes are most promising for learning: a) evaluate correlation of the attributes with the "total_amount" column (excluding non-numeric columns) b) keep only the rows having "total_amount" as element of the column "Second column name" c) compute (in a new column) the absolute value of the correlation coefficients d) sort in descending order the column of absolute correlation coefficient e) keep only the first three rows of the table obtained in the previous step f) keep only the "First column name" column g) make the transpose of the table setting Chunk size equal to 1 (i.e. the number of columns in the table to be transposed) h) add a new column (to be named LearningFeatures) containing as unique value the list of values of all the other columns4) By means of the "LearningFeatures" column as flow variable, keep only the selected features from the original input data (including the "total_amount" column also)5) Split data into training and test subsets with 80:20 ratio by drawing randomly with seed equal to 06) Optimize, by minimizing the MAPE (Mean Absolute Percentage Error) with Brute Force search strategy, a Simple Regression Tree model with the following parameters optimization ranges: Limit number of levels (tree depth): from 5 to 20 with step equal to 1 (integer values) Minimum node size: from 2 to 5 with step equal to 1 (integer values) Remember that the Numeric Scorer node does not provide flow variables by default, you need to explicitly set such option in the node configuration.QUESTION 1: what is the minimum MAPE value achieved?7) Disconnect the optimization nodes and execute again the model with the optimized parametersQUESTION 2: what is the RMSE value achieved?FOR THE SECOND OPTIMIZATION (whose goal is the optimization of the number of features for learning):8) from the sorting already implemented at step 3d: a) keep only the first n rows of the table obtained in the previous step (read all the following points for further information about n) b) keep only the "First column name" column c) make the transpose of the table setting Chunk size equal to 1 (i.e. the number of columns in the table to be transposed) d) add a new column (to be named LearningFeatures) containing as unique value the list of values of all the other columns9) By means of the "LearningFeatures" column as flow variable, keep only the selected features from the original input data (including the "total_amount" column also)10) Split data into training and test subsets with 80:20 ratio by drawing randomly with seed equal to 011) Perform learning and prediction using a Simple Regression Tree model with the previously optimized hyperparameters12) Optimize the number of features (i.e. the parameter n of step 8a) by minimizing the MAPE (Mean Absolute Percentage Error) with Brute Force search strategy Remember that the Numeric Scorer node does not provide flow variables by default, you need to explicitly set such option in the node configuration.QUESTION 3: what is the minimum MAPE value achieved?13) Disconnect the optimization nodes and execute again the model with the optimized parameter Warning for step 13: to correctly use the result ( n) provided by the optimization procedure, you now have to consider Last row number = n+1 for row filteringQUESTION 4: what is the RMSE value achieved? Node 558Node 559Node 573Node 576Node 594Node 596Node 600 File Reader Rule-basedRow Filter Linear Correlation Rule-basedRow Filter Column Expressions Sorter Row Filter Table Transposer Column Filter Parameter OptimizationLoop Start Simple RegressionTree Predictor Numeric Scorer ParameterOptimization Loop End Parameter OptimizationLoop Start Create CollectionColumn Table Columnto Variable Column Filter Partitioning Simple RegressionTree Learner ParameterOptimization Loop End Column Appender Column Filter Create CollectionColumn Linear Correlation Column Filter Table Columnto Variable Column Expressions Simple RegressionTree Learner Simple RegressionTree Predictor Rule-basedRow Filter Partitioning Row Filter Sorter Numeric Scorer Column Appender Table Transposer Column Filter Rule-basedRow Filter File Reader Column Filter Create CollectionColumn Parameter OptimizationLoop Start Column Filter Linear Correlation Table Columnto Variable Simple RegressionTree Predictor Column Expressions Simple RegressionTree Learner Partitioning Rule-basedRow Filter Row Filter Sorter Numeric Scorer Table Transposer Column Appender ParameterOptimization Loop End Column Filter Rule-basedRow Filter File Reader Column Filter Table Rowto Variable

Nodes

Extensions

Links