Icon

KNIME_​project_​Blankenhorn_​Grupp

4. Building Flatfile & Feature Engineering
1. Data Access from files
2. Data Exploration - Plausibility check

City and country do not match (it was assumed that the country is wrong data, but the rest of the data is correct -> no other chance to validate)

Product name and category do not match

Single table view: No missing values & no negative values (check via statistic view) & no wrong data types

3. Data Cleaning

Check if discount min and max value lies between 0 and 1 -> YES (via statistic view)

In Sales CSV gibt es eine ProduktID P0000 und P0201, allerdings startet es in der Product CSV bei P0001 und endet bei P0200 -> betrifft 4295 Zeilen (P0000) und 4268 Zeilen (P0201) und damit 0,98% der Daten!

Feature Engineering:

Zeit-Features

  • Saisonalität ("is_holiday"): Ostern, Weihnachten, Valentinstag, Halloween, Muttertag/Vatertag -> Zeitraum vorher relevant!

    • Ostern und Muttertag/Vatertag: Nicht immer am selben Tag!

    • Nicht alle Länder haben dieselben Daten!

      • 2023: 26.03. - 09.04. (Zeitraum von 2 Wochen vorher)

      • 2024: 17.03. - 31.03. (Zeitraum von 2 Wochen vorher)

    • Weihnachten: 01.12. - 24.12. (ganzer Monat wegen Advent)

    • Valentinstag: 07.02. - 14.02. (Zeitraum von 1 Woche vorher)

    • Halloween: 24.10. - 31.10 (Zeitraum von 1 Woche vorher)

    • Muttertag/Vatertag:

      • 2023: 07.05. - 18.05. (Zeitraum von 1 Woche vor Muttertag -> 14.05.)

      • 2024: 02.05. - 12.05. (Zeitraum von 1 Woche vor Vatertag -> 09.05.)

-> mit den Jahren 2025 bis 2027 wird entsprechend genauso verfahren.

  • Wochentag: Wird Schokolade bevorzugt unter der Woche oder am Wochenende gekauft? Welcher Tag unter der Woche wird bevorzugt?

    • Tag der Woche (Montag-Freitag) -> "day of week"

    • Wochenende (ja/nein) -> "is_weekend"

Store-Type Features

  • Ferienzeiten bei Airport Store berücksichtigen?!

Produktkategorie-Features

  • Gift Kategorie: Praline, Truffle -> eher zu Feiertagen gekauft (vsl. dort Peaks)

  • Everyday Kategorie: Dark, White, Milk -> immer gekauft

Lag & Trendfeatures

  • Gleicher Wochentag letzte Woche

  • Zweiwöchiger Rhythmus

  • Monatlicher Rhythmus

  • Monatstrend

Rabatt Features

  • Hat Discount (ja/nein)

Customer Features

  • Ist Loyality Kunde (ja/nein)

  • Alter (kaufen Ältere mehr als Jüngere?)

  • Geschlecht (kaufen Frauen mehr als Männer?)

OPTION 1: FASTER LOADING TIME (Holiday dates as timeframe in CSV)

  • "is_weekend" = all rows have the same value, so I choose the first

  • "proximity_holiday" = different transactions on the same day could theoretically have different proximity values -> Setting the value to Maximum” ensures that the strongest holiday effect is retained

"Die lineare Regression erreicht ein R² von 0.35, was zeigt dass die Nachfrage nicht durch lineare Zusammenhänge erklärt werden kann. Die nicht-linearen Muster wie Saisonalität und Feiertagseffekte erfordern komplexere Modelle.

5. Partitioning data and setting up linear regression model

Background:

  1. To be able to create lag columns efficiently, a history of transactions per day is required (through grouping some days are "lost" as no transactions happened -> this is prevented by this part of the workflow)

  2. Getting transparency on transaction combinations that did not happen

6. Testing linear regression model by forecasting customer demand
Splitting up the data set into two portions (portion 1: 01.01.23 - 31.03.24; portion 2: 01.04.24 - 30.09.24)
Rule-based Row Splitter
Linear Regression Learner
winner features
Column Filter
Creating correct reference table for product name-category
Table Creator
Reading in all possible combinations of store type, city, category and brand
CSV Reader
Reading in all dates (01.01.2023 - 30.09.2024)
CSV Reader
Checking data base
Adding correct Category column
Value Lookup
Cross joining these tables to generate the full count of possible transaction combinations per day
Cross Joiner
Checking data base
Reading in exact holiday dates
CSV Reader
Joining the transaction combination table with the real transaction table and adding the columns whereever a match is found (date, category, store type, city, brand)
Joiner
Transforming date column from string to "date" data type
String to Date&Time
Resorting columns (product name next to correct category)
Column Resorter
Adding correct Country column
Value Lookup
Date&Time Part Extractor
Excluding wrong category column
Column Filter
Missing values are fixed (sum_quantity & discount & loyalty member = 0; unit_price = mean as the price remains there even with no demand)
Missing Value
Transforming "order_date"-column to correct data type (date instead of string)
String to Date&Time
Excluding unnecessary columns
Column Filter
Excluding wrong country column
Column Filter
Column Filter
Feature Selection and Testing
Resorting columns (city next to country)
Column Resorter
GroupBy
Missing Value
Transforming string to date
String to Date&Time
Reading in forecasting dates
CSV Reader
Cross joining future dates with every possible combination of store_type, city, country, category and brand
Cross Joiner
Extracting year, month, day and calendar_week out of "date"
Date&Time Part Extractor
Excluding "lag dates"
Column Filter
Reading Calendar
CSV Reader
Calculating the mean discount and loyalty member part based on historic data on the basis of every combination across the entire timeframe
GroupBy
Reading Customers
CSV Reader
Reading Products
CSV Reader
Creating lag features
Reading Stores
CSV Reader
Google Drive Connector
Google Authenticator
Reading Sales
CSV Reader
Column Resorter
Creating feature "is_weekend"
Rule Engine
Column Renamer
Calculating the date difference between every holiday date and order date
Date&Time Difference
Extracting the date difference to the next holiday date (days to next holiday)
Row Filter
Check if product name matches to cocoa percentage
String to Date&Time
Check if (revenue - costs) matches profit
Joining every row from data table with every row from date table
Cross Joiner
Regression Predictor
Creating feature is in holiday timeframe
Checking data base and deleting wrong product IDs
Extracting and adding the two columns/features "mean(discount)" and "mean(loyalty_member)" based on the hypothesis that the mean discount and loyalty member part remains roughly the same
Joiner
Specify the degree to which the date falls within the holiday window
Rule Engine
Numeric Scorer
Creating lag features
Renaming columns to match historic learner data
Column Renamer
Regression Predictor
Excluding unnecessary columns
Column Filter
Joining tables
Grouping by day, store and product category
GroupBy
Extracting the minimum date difference to the upcoming holiday
GroupBy
Adding the two columns "Date&Time Difference", "date_holiday" and "holiday" to data table
Joiner
Column Filter
Table Partitioner
Lag Column
Creating feature "day of week" by changing data type from number to string
Number to String
Splitting rows into missing and non-missing lag rows
Rule-based Row Splitter
One to Many
Creating feature is in holiday timeframe
Recursive Loop End
Creating correct reference table for city-country
Table Creator
Recursive Loop Start
Linear Regression Learner
Date Shifter
Top k Row Filter
R^2 = 0.327
Numeric Scorer
Rule Engine
Regression Predictor
Joiner

Nodes

Extensions

Links