Icon

KNIME_​project_​Blankenhorn_​Grupp_​final

4.1 DATA MODELING: Building Flatfile & Feature Engineering
1. DATA ACCESS from files
2. DATA EXPLORATION - Plausibility check

City and country do not match (it was assumed that the country is wrong data, but the rest of the data is correct -> no other chance to validate)

Product name and category do not match

Single table view: No missing values & no negative values (check via statistic view) & no wrong data types (data type from "date" is converted from string to date at a later point)

3. DATA CLEANING

Check if discount min and max value lies between 0 and 1 -> YES (via statistic view)

Background Thoughts on Feature Engineering: (Potential Features)

Time Features

  • Seasonality ("is_holiday"): Easter, Christmas, Valentine's Day, Halloween, Mother's Day/Father's Day → the lead-up period is relevant!

    • Easter and Mother's Day/Father's Day: not always on the same date!

    • Not all countries share the same dates!

      • 2023: 26.03. – 09.04. (2-week window prior to Easter)

      • 2024: 17.03. – 31.03. (2-week window prior to Easter)

    • Christmas: 01.12. – 24.12. (entire month due to Advent season)

    • Valentine's Day: 07.02. – 14.02. (1-week window prior)

    • Halloween: 24.10. – 31.10. (1-week window prior)

    • Mother's Day/Father's Day:

      • 2023: 07.05. – 18.05. (1-week window before Mother's Day → 14.05.)

      • 2024: 02.05. – 12.05. (1-week window before Father's Day → 09.05.)

    • → The same logic is applied accordingly for 2025 through 2027.

  • Day of week: Is chocolate purchased more on weekdays or weekends? Which weekday is preferred?

    • Day of the week (Monday–Friday) → "day_of_week"

    • Weekend (yes/no) → "is_weekend"

Store Type Features

  • Should school holiday periods be considered for Airport stores?

Product Category Features

  • Gift category: Praline, Truffle → more likely purchased around holidays (peaks expected)

  • Everyday category: Dark, White, Milk → purchased consistently throughout the year

Lag & Trend Features

  • Same weekday last week (lag_7)

  • Bi-weekly rhythm (lag_14)

  • Three-weekly rhythm (lag_21)

  • Monthly rhythm (lag_28)

Discount Features

  • Has discount (yes/no)

Customer Features

  • Is loyalty member (yes/no)

  • "is_weekend" = all rows have the same value, so I choose the first

  • "proximity_holiday" = different transactions on the same day could theoretically have different proximity values -> Setting the value to Maximum” ensures that the strongest holiday effect is retained

4.2 MODEL TRAINING: Partitioning data and setting up linear regression model

Background:

  1. To be able to create lag columns efficiently, a history of transactions per day is required (through grouping some days are "lost" as no transactions happened -> this is prevented by this part of the workflow)

  2. Getting transparency on transaction combinations that did not happen

5. EVALUATION: Testing linear regression model by forecasting customer demand (& recreating features)

Transformed data type because when it is a string, it does not interpret that 1 (monday) weighs less than 7 (sunday) which is important for the model when training with the data!

Lag features are not taken into account, since there must be no missing values in the final prediction.

Option 2 for feature creation ("is in_holiday timeframe")

Lag features can be accurately reconstructed until October 7th, where all lag values still reference real historical data. From October 8th onwards, lag_7 requires the demand value of October 1st, which is itself already a forecast. This means the model is predicting based on predictions, compounding uncertainty over time.

A recursive loop forecast would solve this by iteratively feeding each day's forecast as the lag input for the next prediction.

Option 2 does not define fixed holiday windows but instead uses only the exact holiday date as a reference point. A Rule Engine then calculates the number of days between each transaction date and the next upcoming holiday, assigning a proximity score: the closer the date is to the holiday, the higher the score. The underlying hypothesis is that demand increases gradually as a holiday approaches, with the strongest effect in the final days before the event.

However, this approach requires a Cross Join between the full transaction dataset and the holiday lookup table, generating an extremely large intermediate table of several million rows. This results in prohibitively long processing times, making the approach impractical within the given workflow. Consequently, Option 2 was not pursued further and is not considered in the subsequent analysis.

To ensure applicability to future data, all holiday reference dates were defined through 2027, allowing the workflow to be extended to new datasets beyond the original training period without requiring manual updates to the holiday lookup table.

VISUALIZATION: Plotting sales broken down by category and month

Validation of forecasting power of model by forecasting september data from 2024 and analyzing how big the delta is to actual data (output: see PowerBI visualization in PowerPoint presentation)

The mapping needs to be adjusted. When the record was created, it included the order ID, which has since been removed from the master record. Due to long loading times, it hasn't since been updated, but the logic of the workflow part should be clear.

Renaming "Category" to "category" to ensure similarity to other data
Column Renamer
Splitting up the data set into two portions (portion 1: 01.01.23 - 31.03.24; portion 2: 01.04.24 - 30.09.24)
Rule-based Row Splitter
Linear Regression Learner
Filtering winner features (year, month, day, calendar_week, city, store_type, category, brand, mean(discount), mean (loyalty_member)
Column Filter
Saving prediction output in excel file
Excel Writer
Creating correct reference table for product name-category
Table Creator
Column Resorter
Column Renamer
Reading in all possible combinations of store type, city, country, category and brand (specific products, e.g. Dark 80% were not included as they would have enlarged the data set in a way that results in unacceptable loading times)
CSV Reader
GroupBy
Reading in all dates (01.01.2023 - 30.09.2024)
CSV Reader
Date&Time Part Extractor
Checking data base
Transforming number to string to enable string manipulation
Number to String
Adding correct Category column
Value Lookup
Cross joining these tables to generate the full count of possible transaction combinations per day
Cross Joiner
Pivot
Combining year and month to one column
String Manipulation
Checking data base
Plotting sum(quantity) depending on category and month
Line Plot
Reading in exact holiday dates
CSV Reader
Joining the transaction combination table with the real transaction table and adding the columns wherever a match is found (date, category, store type, city, brand)
Joiner
Transforming date column from string to "date" data type
String to Date&Time
Resorting columns (product name next to correct category)
Column Resorter
Filtering columns year and month, so that only "year - month" is included for plot
Column Filter
Adding correct Country column
Value Lookup
Splitting up september data 2024
Rule-based Row Splitter
Excluding wrong category column
Column Filter
Missing values are fixed (sum_quantity & discount & loyalty member = 0; unit_price = mean as the price remains there even with no demand)
Missing Value
Renaming columns
Column Renamer
Transforming "order_date"-column to correct data type (date instead of string)
String to Date&Time
Checking predicition capability of model by forecasting september data and evaluation the difference between predicition and acutal data
Regression Predictor
Excluding unnecessary columns (exclusion of e.g. store ID because of unacceptable loading times -> inclusion here would require inclusion in cross joined dataset in upper lane)
Column Filter
Splitting up the data set into two portions (portion 1: 01.01.2023 - 31.03.2024, portion 2: 01.04.2024 - 31.08.2024)
Rule-based Row Splitter
Excluding wrong country column
Column Filter
Column Filter
Feature Selection and Testing
Linear Regression Learner
Resorting columns (city next to country)
Column Resorter
Grouping all columns, except for quantity, unit price, discount and loyalty members, however for these the sum oder mean was calculated
GroupBy
Missing Value
Transforming string to date
String to Date&Time
Reading in forecasting dates
CSV Reader
Excel Writer
Cross joining future dates with every possible combination of store_type, city, country, category and brand
Cross Joiner
Extracting year, month, day and calendar_week out of "date"
Date&Time Part Extractor
Excluding "lag dates"
Column Filter
Reading Calendar
CSV Reader
Calculating the mean discount and loyalty member part based on historic data on the basis of every combination across the entire timeframe
GroupBy
Reading Customers
CSV Reader
Reading Products
CSV Reader
Creating lag features
Reading Stores
CSV Reader
Google Drive Connector
Google Authenticator
Reading Sales
CSV Reader
Creating feature "is_weekend"
Rule Engine
Calculating the date difference between every holiday date and order date
Date&Time Difference
Extracting the date difference to the next holiday date (days to next holiday)
Row Filter
Check if product name matches to cocoa percentage
String to Date&Time
Check if (revenue - costs) matches profit
Joining every row from data table with every row from date table
Cross Joiner
Checking data base and deleting wrong product IDs
Extracting and adding the two columns/features "mean(discount)" and "mean(loyalty_member)" based on the hypothesis that the mean discount and loyalty member part remains roughly the same
Joiner
Specify the degree to which the date falls within the holiday window
Rule Engine
Joining tables
Extracting the minimum date difference to the upcoming holiday
GroupBy
Adding the two columns "Date&Time Difference", "date_holiday" and "holiday" to data table
Joiner
Creating feature "day of week" by changing data type from number to string
Number to String
Creating feature is in holiday timeframe
Creating correct reference table for city-country
Table Creator
R^2 = 0.19
Numeric Scorer
Regression Predictor
Excluding date and country columns as these features are not significant
Column Filter
Regression Predictor

Nodes

Extensions

Links