Icon

04_​Machine_​Learning

Machine Learning - Exercise (Solution)

This workflow predicts the residual of time series (energy consumption) by machine learning models that use lagged values as predictors. The residual of time series is what is left after removing the trend and first and second seasonality.

URL: All you need is ... the Lag Column Node! https://www.knime.com/blog/all-you-need-is-the-lag-column-node
URL: The Lag Column Node https://youtu.be/pR_7pIEqW-c
URL: Slides on the KNIME Website https://www.knime.com/form/material-download-registration

Time Series Analysis
04. Machine Learning

Summary:
In this exercise you will train and score a random forest and linear regression

Instructions:
1) Run the workflow up through the Missing Value node. You will start this exercise from here

2) Partition the data using the Table Partitioner node. Let’s use an 80/20 split. Make sure you check the box to take data from the top. This is important with time series data.


3) Let's create lag columns on the training data (top output, Table Partitioner). Use the Lag Column node with Lag Interval = 1 and Lags = 10. You will use these 10 past values as the inputs for our models.

Data Loading
Data Preparation

Partitioning & creating lag columns

Linear regression

Random forest

Random forest model

4) Apply the Random Forest Learner (Regression) to the output port of the Lag Column node. Make sure your target is cluster_26 and your inputs are the lagged values: cluster_26(-n)

5) Create a separate branch from the output of the Lag Column node, with a Top k Row Filter node. Sort by row ID in descending, so that the last time point is on the first row. Keep the first row only of the sorted table. Send the output to the Recursive Loop Start node.

6). The Recursive Loop Start node sends an updated table of the target and the predictors to the process component. The process component shifts the lag columns by 1 time point, and add the target column (cluster_26) as the last time point (cluster_26(-1)).

7) Add a Random Forest Predictor (Regression) node. Add the trained model from the Random Forest Learner (Regression) node to the top input port, and the output from the process component to the bottom input port. Change the prediction column name to cluster_26.

8) From the prediction table generated at the output of the Random Forest Predictor (Regression) node, remove the column cluster_26 (Prediction Variance) with a Column Filter node.

9) Send the output of the Column Filter node from step 8) to both input ports of the Recursive Loop End node. Set the maximum number of iterations to 168 hours (or 1 week).

10) In the output of the Recursive Loop End node, rename the column cluster_26 to Forcasts with a Column Renamer node.

11) The model evaluation is similar to that of the SARIMA model exercise. Both prediction and original data tables are renumerated (Row ID node) and combined (Joiner node). Then various evaluation metrics are calculated (Numeric Scorer node). The forecasts and the original data are plotted (Line Plot node).

Linear regression model

12) Modify the random forest model (steps 4-11) by replacing the Random Forest Learner (Regression) node with a Linear Regression Learner node, and the Random Forest Predictor (Regression) node with a Regression Predictor node.

13) Compare the model performance with that of the random forest model.

Optional

14) In the Lag Column node, change the number of lag columns to 24. Does this improve the prediction on the random forest model and the linear regression model?

convertdate/timeinto Date&Time objects
String to Date&Time
RowID
Energy usage data
CSV Reader
Missing Value
Numeric Scorer
Joiner
process
Recursive Loop End
Line Plot
Column Filter
Recursive Loop Start
Date&Time Aligner (Labs)
RowID

Nodes

Extensions

Links