This workflow uses a subset of the popular NYC taxi dataset and Spark Random Forest node to train a simple time series prediction model to predict taxi demand in the next hour based on data from the past hours.
The input data is the number of NYC taxi trips per hour per day in the year 2017. Our goal is to predict taxi demand at a certain hour, and in order to do that we need the taxi demands in the previous N hours. The step to create N lagged columns is done in the Spark Lag Column metanode. The Find lag metanode creates a correlation matrix between the lagged columns where we can inspect the matrix visually to see the correlation, moreover it also automatically finds the value N which has the highest correlation factor with the original column of total number of trips (taxi demand) per hour. A Random Forest model is then trained using those N lagged columns, with two additional temporal features (hour of day, and day of week).
We experimented with first order differencing and seasonality removal, which are a common practice to do in time series prediction, to see if they would improve our simple model. Based on the results, it seems that for regular time series often a highly parametric algorithm like a Random Forest produces good results even if trained on the full time series, without seasonality removal.
Get this workflow from the following link: Download
Seasonality_Removal consists of the following 310 nodes(s):
Seasonality_Removal contains nodes provided by the following 8 plugin(s):
Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to email@example.com, follow @NodePit on Twitter, or chat on Gitter!
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.