Icon

Seasonality_​Removal

Seasonality removal

This workflow uses a subset of the popular NYC taxi dataset and Spark Random Forest node to train a simple time series prediction model to predict taxi demand in the next hour based on data from the past hours.

The input data is the number of NYC taxi trips per hour per day in the year 2017. Our goal is to predict taxi demand at a certain hour, and in order to do that we need the taxi demands in the previous N hours. The step to create N lagged columns is done in the Spark Lag Column metanode. The Find lag metanode creates a correlation matrix between the lagged columns where we can inspect the matrix visually to see the correlation, moreover it also automatically finds the value N which has the highest correlation factor with the original column of total number of trips (taxi demand) per hour. A Random Forest model is then trained using those N lagged columns, with two additional temporal features (hour of day, and day of week).

We experimented with first order differencing and seasonality removal, which are a common practice to do in time series prediction, to see if they would improve our simple model. Based on the results, it seems that for regular time series often a highly parametric algorithm like a Random Forest produces good results even if trained on the full time series, without seasonality removal.

Original data First order difference Daily seasonality removal Load the data Taxi Demand Prediction with Spark Random ForestThis workflow uses a subset of the popular NYC taxi dataset and Spark Random Forest node to train a simple time series prediction model to predict taxi demand in the next hour based on data from the past hours.For more information see the workflow metadata. Find it here: View -> Description Weekly seasonality removal Visualization partitioninto training and test settrain the modelpredictthe test settrain the modelpredictthe test setrecomputepredicted trip counttrain the modelpredictthe test setrecomputepredicted trip countpartitioninto training and test setpartitioninto training and test setpartitioninto training and test setrecomputepredicted trip countpredictthe test settrain the modelload the Parquetdataset to Spark View line plotPrediction vs Expected Split by dateand time Spark NumericScorer Find lag Spark Random ForestsLearner (MLlib) Spark Predictor(MLlib) Path totraining set Spark Random ForestsLearner (MLlib) Spark Predictor(MLlib) Spark NumericScorer Spark SQL Query Spark Lag Column Inspect the dataset Spark Random ForestsLearner (MLlib) Spark Predictor(MLlib) Spark SQL Query Spark NumericScorer Calculate firstorder difference Remove dailyseasonality Split by dateand time Split by dateand time View line plotPrediction vs Expected View line plotPrediction vs Expected Remove weeklyseasonality Split by dateand time Spark NumericScorer Spark SQL Query Spark Predictor(MLlib) Spark Random ForestsLearner (MLlib) View line plotPrediction vs Expected Spark Lag Column Spark Lag Column Spark Lag Column Spark Lag Column Create Local BigData Environment Parquet to Spark Spark Lag Column Spark Lag Column Spark Lag Column Original data First order difference Daily seasonality removal Load the data Taxi Demand Prediction with Spark Random ForestThis workflow uses a subset of the popular NYC taxi dataset and Spark Random Forest node to train a simple time series prediction model to predict taxi demand in the next hour based on data from the past hours.For more information see the workflow metadata. Find it here: View -> Description Weekly seasonality removal Visualization partitioninto training and test settrain the modelpredictthe test settrain the modelpredictthe test setrecomputepredicted trip counttrain the modelpredictthe test setrecomputepredicted trip countpartitioninto training and test setpartitioninto training and test setpartitioninto training and test setrecomputepredicted trip countpredictthe test settrain the modelload the Parquetdataset to Spark View line plotPrediction vs Expected Split by dateand time Spark NumericScorer Find lag Spark Random ForestsLearner (MLlib) Spark Predictor(MLlib) Path totraining set Spark Random ForestsLearner (MLlib) Spark Predictor(MLlib) Spark NumericScorer Spark SQL Query Spark Lag Column Inspect the dataset Spark Random ForestsLearner (MLlib) Spark Predictor(MLlib) Spark SQL Query Spark NumericScorer Calculate firstorder difference Remove dailyseasonality Split by dateand time Split by dateand time View line plotPrediction vs Expected View line plotPrediction vs Expected Remove weeklyseasonality Split by dateand time Spark NumericScorer Spark SQL Query Spark Predictor(MLlib) Spark Random ForestsLearner (MLlib) View line plotPrediction vs Expected Spark Lag Column Spark Lag Column Spark Lag Column Spark Lag Column Create Local BigData Environment Parquet to Spark Spark Lag Column Spark Lag Column Spark Lag Column

Nodes

Extensions

Links