Icon

KNIME_​Group_​project_​v1

Data Preprocessing — Data Quality Treatment

  • Education = 0 replaced with N/A

  • Internet values capped at 100% (6 records affected)

Missing Value Imputation

- Marital_Status (489) & Gender (169): Most Frequent Value (Mode)

- Income (62 records: 35 NaN + 27 zeros):

Context-aware median imputation

- Income = 0 treated as missing (implausible for active customers)

- Primary lookup: median by Age_Bin × Monetary_Bin (12 cells)

- Fallback: Age_Bin median when group count < 100

- Rationale: Income-Age correlation = 0.88,

Income-Monetary correlation = 0.80

Feature Engineering & Anomaly Flagging

[1] Derived Variables (6 new columns)

• AvgTicket = Monetary / Frequency

(basket size; Frequency=0 → 0, 10 records)

• Perishables_Pct, Beverages_Pct, Frozen_Pct,

Canned_Pct, Others_Pct (= LoB / Monetary)

Purpose: distinguish "small basket frequent" vs

"large basket occasional" shoppers; capture

category preference independent of total spend.

[2] Coherence Flags (~360 distinct records, retained)

• Flag_FreqZero: 10 records

• Flag_Recency_GT365: 26 records

• Flag_Young_DivWidow: 2 records

• Flag_Negative_LoB: 322 records (returns)

Note on outliers (1.5 × IQR, upper bound)

- Monetary > 16,772: 126 customers (0.63%)

- Frequency > 54: 42 customers (0.21%)

- Recency > 157: 796 customers (3.98%) — lapsed

- AvgTicket > 506: 29 customers (0.14%)

- ≥1 dim outlier: 972 (4.86%); ≥2 dims: 21

Retained, not flagged. These represent valid customer behavior — VIPs, heavy buyers, and lapsed customers — and are intended to form distinct segments in clustering rather than be corrected. Box plots in EDA confirm visual inspection.

K Selection & Validation — Value Segmentation

Purpose: Determine optimal k for value-based customer segmentation via elbow + silhouette analysis.

Variables: Income, Frequency, Monetary, Recency, AvgTicket.

Methodology:

  • Log transformation ln(x+1) on Income, Monetary, AvgTicket → reduces right-skew (raw AvgTicket max +13.8 std, Income +7.7 std)

  • Z-score normalization on all 5 variables (equal weighting in Euclidean distance)

  • Counting Loop: k = 2 to 10 (Math Formula Variable: currentIteration + 2)

  • k-Means + Silhouette per iteration, GroupBy aggregates mean silhouette

  • Loop End collects results, Line Plot visualizes curve

  • Static seed = 0 for reproducibility

Silhouette Results:

  • k=2 → 0.417 | k=3 → 0.369 | k=4 → 0.387 ← local peak

  • k=5 → 0.307 | k=6 → 0.303 | k=7 → 0.256

  • k=8 → 0.246 | k=9 → 0.248 | k=10 → 0.236

Decision: k = 4

  • Local silhouette peak (>k=3, sharp drop at k=5)

  • Marginal loss vs. k=2 (-0.030) but richer business interpretation

  • k=2 collapses to trivial "high vs. low value" split — insufficient for campaign targeting

  • Clear elbow k=4→k=5 (0.387 → 0.307, ~20% drop)

Outlier Treatment — Retain & Interpret: No rows excluded. All 20,000 customers retained. Two anomaly groups from Phase 1 treated as real signals:

  • Recency > 365 (26 records): churned / lapsed → win-back campaign target

  • Frequency = 0 & Monetary > 0 (10 records): methodological artifact, splits into recent new-buyers (n=7, activation campaign) and old churned (n=3)

Why retain rather than remove:

  • Genuine behavioral patterns, not noise

  • Log transformation absorbs scale impact on k-Means centroids

  • Final cluster profiles cross-tabulated with flags → quantify anomaly composition per segment

  • Win-back + activation recommendations emerge from post-hoc analysis (brief's actionable insight requirement)

Value Segmentation — Final Clustering & Profiling

Purpose: Apply k=4 to produce value-based segments; profile with raw metrics + flag validation.

Pipeline: Final k-Means (k=4, seed=0) on log+Z-score data → Column Filter/Rename (Value_Cluster) → Joiner (CustID, raw data) → GroupBy ×2 (profiles + flag sums) → Rule Engine (business names).

Cluster Sizes & Profiles (raw means):

  • VIP_Loyalists (cluster_1): 8,196 (41%) | €64K | 32 visits | €8.8K | 51d | €272

  • Regular (cluster_3): 7,887 (39%) | €38K | 14 visits | €2.1K | 51d | €154

  • Casual_Buyers (cluster_0): 3,159 (16%) | €26K | 11 visits | €479 | 53d | €42

  • Churned (cluster_2): 758 (4%) | €28K | 8 visits | €774 | 310d | €111

Anomaly Validation — Did flags land in coherent segments?

All 26 flagged churned customers (Recency>365) landed in the Churned cluster — perfect alignment, confirming the segment is correctly defined. The 322 negative-LoB (return) customers concentrated 83.5% in Casual_Buyers, revealing that returns are a low-engagement trait, not a VIP one. The 10 FreqZero flagged customers split coherently: 7 recent (new) customers into Casual_Buyers (activation targets), 3 old customers into Churned (win-back targets). VIP + Regular segments together (80% of the base) carry only 0.05% flag density, forming a clean behavioral mass.

Key Insights:

  • Retain-and-interpret strategy worked: anomalies weren't lost, they self-organized into actionable segments

  • Returns ≠ high-value problem; it's a Casual segment issue → pre-purchase clarity & guarantees needed

  • Churned segment is well-defined and ready for standard win-back tactics

  • Separation is driven by spend intensity, not recency: The three active segments (VIP, Regular, Casual)
    share near-identical recency (~51 days) — all recently active — and differ mainly in spend depth (
    Monetary, AvgTicket) and frequency. Recency isolates only the Churned segment (310 days).
    The radar profile visualizes this: the three active polygons converge on the recency axis and diverge on
    the spend axes.

Campaign Mapping:

  • VIP_Loyalists → Loyalty rewards, premium tier, retention

  • Regular → Basket expansion, frequency optimization, wallet-share growth

  • Casual_Buyers → Return reduction (clarity, guarantees), activation

  • Churned→ Win-back campaign (personalized re-engagement)

K Selection & Validation — Preference Segmentation

Purpose: Determine optimal k for preference-based customer segmentation using elbow + silhouette analysis.

Variables: Perishables_Pct, Beverages_Pct, Frozen_Pct, Canned_Pct, Others_Pct (LoB share of total spend).

Methodology:

  • No log transformation (ratios already bounded 0-1, no skew issue)

  • Z-score normalization on all 5 ratios (equal weighting in Euclidean distance)

  • Counting Loop: k = 2 to 10 (Math Formula Variable: currentIteration + 2)

  • k-Means + Silhouette per iteration, GroupBy aggregates mean silhouette

  • Loop End collects results, Line Plot visualizes curve

  • Static seed = 0 for reproducibility

Silhouette Results:

  • k=2 → 0.298 | k=3 → 0.263 | k=4 → 0.275 ← local peak

  • k=5 → 0.262 | k=6 → 0.269 | k=7 → 0.244

  • k=8 → 0.237 | k=9 → 0.239 | k=10 → 0.238

Decision: k = 4

  • Local silhouette peak (>k=3, before drop at k=5)

  • Symmetric with Value segmentation (k=4) — methodological consistency, easier cross-tab profiling

  • k=2 trivially splits "generalist vs specialist" — too coarse for category-level campaign targeting

  • k=6 marginal alternative (0.269) but offers no clear interpretive gain over k=4

Weak Cluster Structure:

All silhouette values fall in 0.24–0.30 range, below the 0.40 threshold typically considered "reasonable structure". This is expected for three reasons:

(1) Ratio constraint: the 5 LoB shares sum to 1.0, creating multicollinearity. Variables cannot move independently, which dampens cluster separation.

(2) Behavioral reality: most customers shop across multiple categories rather than specializing. Pure category-specialist customer types are rare in grocery retail.

(3) Outliers retained: Phase 1 flags (Negative_LoB, n=322) were not removed. These distort some ratios but are kept consistent with the retain-and-interpret strategy.

Interpretation as Business Insight:

Weak preference clusters are themselves a finding: NEW SMI's customer base is generalist, not category-specialist. This shapes strategy — broad assortment matters more than category-targeted campaigns for most segments. Category specialists exist but form niche groups, addressable through targeted but limited campaigns.

Preference Segmentation — Final Clustering & Profiling

Purpose: Apply k=4 to produce preference-based segments; profile with LoB ratios and flag observations. Used as auxiliary segmentation alongside Value.

Pipeline: Final k-Means (k=4, seed=0) on Z-score normalized LoB ratios → Column Filter/Rename (Preference_Cluster) → Joiner (CustID, raw data) → GroupBy ×2 (profiles + flag sums) → Rule Engine (business names).

Cluster Sizes & Profiles (LoB ratio means):

  • Fresh_Food_Beverage_Consumers (cluster_0): 6,716 (34%) | Perish 0.42 | Bev 0.23 | Frozen 0.14 | Canned 0.10 | Others 0.12

  • Fresh_Food_Consumers (cluster_3): 6,365 (32%) | Perish 0.69 | Bev 0.12 | Frozen 0.08 | Canned 0.05 | Others 0.06

  • Fresh_Frozen_Food_Others_Consumers (cluster_2): 4,045 (20%) | Frozen 0.30 | Perish 0.24 | Others 0.23 | Bev 0.12 | Canned 0.10

  • Balanced_Basket_Consumers (cluster_1): 2,874 (14%) | Perish 0.26 | Canned 0.25 | Frozen 0.18 | Bev 0.16 | Others 0.16

Weak Cluster Structure — Rationale for Auxiliary Use: Silhouette stayed in 0.24–0.30 across k=2–10, below the 0.40 threshold. Three drivers:

  • Ratio constraint: The 5 LoB variables sum to 1, introducing multicollinearity that limits separability.

  • Generalist customer base: Customers consume across categories with gradient differences rather than categorical splits — no sharp preference types.

  • Retained outliers: Per brief, anomalies were flagged not removed, adding within-cluster variance.

Segments therefore capture consumption tendencies but inform strategy in combination with Value, not as a standalone driver.

Flag Observations:

  • Negative_LoB (322 total): Fresh_Frozen_Food_Others_Consumers 169 (4.18%) — 3–9× other segments. Fresh_Food_Consumers 93 (1.46%), Balanced 29 (1.01%), Fresh_Food_Beverage 31 (0.46%).

  • Recency>365 (26 total): Fresh_Frozen_Food_Others_Consumers 16 (0.40%), highest across segments. Others 0.03–0.09%. Absolute count small.

Data-Backed Findings:

  • Fresh-food dominant customer base: Fresh_Food_Consumers (32%) and Fresh_Food_Beverage_Consumers (34%) together cover 66% of customers, both perishables-primary. Consistent with NEW SMI's market positioning.

  • Balanced segment profile: 14% of customers consume across all categories with no LoB above 0.30; Perishables (0.26) and Canned (0.25) lead, the rest range 0.16–0.18. The only segment without a dominant category.

  • Returns concentrate in Fresh_Frozen_Food_Others_Consumers: segment-level Negative_LoB rate (4.18%) is 3–9× higher than other segments.

Hypotheses Requiring Validation:All three observations below require further analysis to confirm; data limitations are noted where relevant.

  • Return-prone categories (Frozen, Others) in Fresh_Frozen_Food_Others_Consumers: The Negative_LoB concentration here suggests Frozen and Others categories may be driving returns disproportionately. The flag captures customers with negative LoB totals, not the specific category returned, so category-level breakdown is needed.

  • Churn risk in the same segment: This segment also carries the highest Recency>365 rate (0.40%, 4–13× other segments). Both flags landing on the same segment means a deeper look into this product category group (Frozen, Others) can yield insight on both return reduction and churn prevention simultaneously — one analytical effort addressing two business problems.

  • Fresh_Food perishable returns in Fresh_Food_Consumers: This perishable-dominant segment shows the second-highest Negative_LoB rate, though only 1.46% (93 customers). Given perishables are NEW SMI's largest consumption category and this segment is the second-largest (32%), individual-level review of these 93 cases may surface specific issues worth understanding and inform satisfaction improvement in this core category.

Auxiliary Use — Combining with Value: Given the weak cluster structure, Preference segments are positioned as auxiliary rather than as a primary targeting driver. Their role is to provide consumption-tendency context that enables Value-driven campaigns to be more focused: where a campaign audience is defined by Value segment, the Preference dimension may narrow that audience by consumption preference, sharpening message and offer relevance.

Example (illustrative only): A loyalty/premium campaign for VIP_Loyalists could be narrowed by Preference — e.g. a fresh/organic emphasis for Fresh_Food_Consumers within VIP.

K Selection Attempt — Demographic Segmentation [❌ ABANDONED APPROACH — see Methodology]

Purpose: Initial attempt to determine optimal k for demographic segmentation using the same elbow + silhouette pipeline applied in Value

and Preference segmentations. This approach was abandoned after diagnosis; retained in the workflow to document the methodological

reasoning behind the switch to k-Prototypes.

Variables: Age (min-max normalized), Dependents (binary), Marital_Status (one-hot, 5 columns), Gender (one-hot, 3 columns) — 10 columns total, all in 0-1 range.

Methodology:

  • Min-max normalization (not Z-score): binary-dominant structure (8 of 10 columns already 0/1); Z-score would distort one-hot columns by inflating
    rare categories with low variance.

  • Counting Loop: k = 2 to 10, k-Means + Silhouette per iteration, static seed = 0.

  • Silhouette result: monotonically increasing, no peak — k=2 → 0.254, rising steadily to k=10 → 0.602 (see Line Plot).

  • Diagnosis: the monotonic curve is not a tuning problem but a method–data mismatch. With 9 of 10 columns being one-hot categorical, k-Means + Euclidean
    distance treats categorical dissimilarity as spatial distance, which has no meaningful interpretation ("distance" between Married and Single is undefined). As
    k increases, clusters simply converge toward purer categorical combinations (e.g. Married+Male, Single+Female), mechanically raising silhouette without
    producing interpretable segments. Silhouette therefore cannot serve as a selection criterion here — it rewards fragmentation, not structure.

Decision: Approach abandoned — switched to k-Prototypes

  • Silhouette could not function as a selection criterion: the monotonic curve rewards categorical fragmentation rather than identifying meaningful structure
    (see Diagnosis above).

  • Root cause is the method, not the parameter — k-Means + Euclidean distance is inappropriate for categorical-dominant data, regardless of k.

  • Pursuing this branch further (e.g. tuning, alternative distance weighting) would not address the underlying method–data mismatch; reallocating effort to
    k-Prototypes was the more cost-effective path to a valid segmentation.

  • Resolution: switched to k-Prototypes (Huang), which applies Euclidean distance to the numeric variable and matching dissimilarity to categorical variables
    — aligning the distance measure with the data types. See "K Selection & Validation — Demographic Segmentation (k-Prototypes — switched from k-Means)".

  • Branch retained (not deleted) for methodological transparency: it documents that the final method was selected through diagnosis, not by default.

K Selection & Validation — Demographic Segmentation (k-Prototypes — switched from k-Means)

Purpose: Determine optimal k for demographic segmentation using a distance measure appropriate for categorical-dominant data. Adopted after the standard k-Means + silhouette pipeline proved unsuitable (see "K Selection Attempt — Demographic Segmentation [ABANDONED]").

Variables: Age (min-max normalized, numeric), Dependents (binary, categorical), Marital_Status (nominal, 5 levels), Gender (nominal, 3 levels). Education excluded from segmentation (weak correlation with behavior; retaining its 222 N/A records would require methodologically unsound imputation) but kept as a profiling variable.

Methodology:

  • k-Prototypes (Huang) via Python (kmodes library), executed in a KNIME Python Script node — no native KNIME node exists for mixed-type clustering.

  • Categorical variables passed as raw values (not one-hot): the algorithm handles them internally via matching dissimilarity. One-hot encoding would revert the categorical part to Euclidean logic, defeating the method's purpose.

  • Distance measure: squared Euclidean on the numeric variable (Age) + matching dissimilarity on categorical variables, combined via gamma. Gamma left at the library default (auto-derived from numeric standard deviation).

  • k = 2 to 10 loop, cost (total within-cluster dissimilarity) recorded per k, init = "Huang", n_init = 5, random_state = 0 for reproducibility.

  • Cost-based elbow used instead of silhouette: cost is the method's native objective and is consistent with the mixed distance measure, whereas silhouette is unreliable for categorical-dominant data.

Cost Results:

  • k=2 → 3354 | k=3 → 2548 | k=4 → 2088 ← elbow

  • k=5 → 1963 | k=6 → 1776 | k=7 → 1643

  • k=8 → 1388 | k=9 → 1246 | k=10 → 1287

  • Marginal cost reduction drops sharply after k=4 (806 → 460 → 125), the clearest inflection in the curve (see Line Plot).

  • Cost rises at k=10 (1246 → 1287), indicating the algorithm fails to find a better solution at high k — further evidence against over-segmentation.

Decision: k = 4

  • Clear cost elbow at k=4: marginal gains collapse from −460 (k=3→4) to −125 (k=4→5).

  • Unlike the abandoned k-Means attempt, here the method's native criterion (cost) points to a definite k — selection is data-driven, not imposed.

  • Symmetric with Value and Preference segmentations (both k=4), enabling consistent cross-tab profiling for next phases.

  • High-k solutions rejected: cost plateaus then rises (k=10), confirming additional clusters fragment rather than reveal structure.

  • Resulting segments are balanced (5185 / 4954 / 5386 / 4475) — no degenerate or trivially small cluster.

Education — Excluded but Retained & Profiled: Not used as a segmentation variable, but kept in the dataset and analyzed at the profiling stage. The decision followed a correlation check, not assumption:

  • Spearman correlation (Education ordinal vs. Age, Income, Monetary, Frequency) returned weak coefficients (rho ≈ 0.07–0.15). A signal exists but is too narrow to drive segmentation.

  • Including Education would also require imputing its 222 N/A records — methodologically unsound, since imputing an ordinal education level fabricates information that cannot be inferred reliably.

Why retain rather than remove:

  • Excluding from segmentation ≠ deleting from data. Consistent with the project's retain-and-interpret principle applied throughout Phase 1.

  • Removing the variable would discard a valid profiling dimension; removing the 222 N/A rows would lose real customers. Neither is justified.

  • N/A retained as its own category (6 levels: 5 real + N/A), making missingness itself observable in the segment × Education cross-tab.

Post-hoc confirmation (segment × Education cross-tab):

  • Education is near-uniformly distributed across segments — with one exception. Within-segment shares are almost identical for three of four segments (BSc ~46–49%, others proportional).

  • The single deviation: Young Singles show ~21% Primary education vs. ~9% in the other three segments. This aligns with the Phase 1 finding (Primary concentrated in the under-30 group) and reinforces the demographic coherence of that segment.

  • N/A is evenly spread (~1% per segment), confirming missingness is not segment-dependent and did not bias the segmentation.

  • Conclusion: the cross-tab confirms the exclusion decision — Education carries almost no segment-discriminating power (the near-uniform distribution), so omitting it from the distance computation cost nothing, while profiling still surfaced its one meaningful pattern.

Demographic Segmentation — Final Clustering & Profiling

Purpose: Apply k=4 to produce demographic segments via k-Prototypes; profile with raw demographics, education cross-tab, and flag observations.

Pipeline: Final k-Prototypes (k=4, init=Huang, n_init=5, random_state=0) on raw categorical + min-max normalized Age → CustID + Demographic_Cluster output → Joiner (CustID, master segment table)
→ GroupBy ×2 (profiles + flag sums) + Pivot ×3 (Marital, Gender, Education) → Rule Engine (business names).

Cluster Sizes & Profiles (raw means + dominant categories):

  • Young Singles (cluster_0): 5,185 (26%) | Age 36 | Dependents 0.76 | Single-dominant | Male-leaning | NPS 3.0

  • Senior Independents (cluster_1): 4,954 (25%) | Age 70 | Dependents 0.04 | mixed marital (Widow/Divorced elevated) | Male-leaning | NPS 4.2

  • Family Men (cluster_2): 5,386 (27%) | Age 48 | Dependents 0.94 | Married-dominant | 100% Male | NPS 3.3

  • Family Women (cluster_3): 4,475 (22%) | Age 47 | Dependents 0.90 | Married-dominant | 100% Female | NPS 3.3

Anomaly Distribution — Validation & Behavioral Patterns: Unlike Value segmentation (where flags share variables with the segmentation and thus validate cluster definitions directly), demographic segments were
formed independently of these flags. The distribution therefore serves two distinct purposes:

Validation (definition-consistent):

  • Flag_Young_DivWidow (2 total): both landed in Young Singles. This flag derives from an age–marital contradiction (young yet divorced/widowed), so its concentration in the youngest segment confirms internal
    consistency between the flag logic and the segment boundary.

Behavioral patterns (independent flags):

  • Flag_Negative_LoB (322 total): concentrated in Young Singles (172, 53% of all flagged; 3.3% within-segment), absent in Senior Independents (0). Returns/negative-LoB behavior associates with the youngest
    segment, not the oldest.

  • Flag_Recency_GT365 (26 total): spread across Young Singles (9), Family Women (9), Family Men (8), but zero in Senior Independents. Lapsed-purchase behavior avoids the senior segment entirely.

  • Flag_FreqZero (10 total): scattered (6/1/0/3) with no meaningful pattern — consistent with its previous phases classification as noise.

Coherent narrative across segmentations: Senior Independents carry near-zero anomaly density (Recency 0, Negative_LoB 0) alongside the highest NPS (4.2) — a clean, satisfied, loyal demographic mass. Young Singles concentrate the opposite: highest anomaly load (most Negative_LoB, the only Young_DivWidow cases) and lowest NPS (3.0). The two demographic extremes (youngest vs. oldest) anchor opposite ends of the behavioral-quality spectrum, a pattern that invites cross-reading with the Value segments in the Unified Segment Analysis section.

Key Insights:

  • Strong, well-separated segmentation: Unlike the auxiliary Preference segmentation, demographic segments are cleanly separated and individually interpretable. The cost elbow was definite (k=4), segment sizes are balanced (22–27%), and each segment maps to a distinct, recognizable demographic profile. This is a primary segmentation, usable as a standalone targeting driver.

  • Age and gender are the dominant separators: The three pivot tables show segmentation is driven mainly by age (36 / 47 / 48 / 70) and gender. Family Men and Family Women are demographically near-identical (Age ~47–48, Dependents ~0.9, Married-dominant) and split almost entirely on gender (100% M vs. 100% F) — a clean basis for gender-differentiated messaging within the same family-stage audience.

  • Marital structure follows life stage: Young Singles are Single-dominant (75%); family segments are Married-dominant (~66% / ~47%); Senior Independents show no single dominant status, with Widowed and Divorced elevated — consistent with an older population. Marital distribution reinforces rather than contradicts the age-based separation.

  • Education is near-uniform, with one exception: Across three of four segments, education shares are almost identical (BSc ~46–50%). The sole deviation — Young Singles at ~21% Primary vs. ~9% elsewhere — aligns with the age–education correlation reported in the K Selection annotation. This confirms the decision to exclude Education from segmentation (negligible discriminating power) while validating its retention as a profiling dimension.

  • Standalone use and combination potential: Demographic segments can directly drive demographically-targeted campaigns (e.g. life-stage or gender-specific messaging). When layered with Value and Preference, they offer the potential for sharper targeting — narrowing a Value-defined audience by life stage or gender. This combined targeting is examined in the Unified Segment Analysis section with a unified analysis.

Unified Segment Analysis — Cross-Segmentation Validation

Purpose: Cross-analyze the three independent segmentations (Value, Preference, Demographic) to test whether they converge on a consistent customer structure,
and to operationalize the combined use of Preference Segmentation with Value Segmentation.

Method: Three cross-tabs on the master segment table (Pivot nodes): Value × Demographic, Demographic × Preference, Value × Preference. Each shown as
a row-percentage heatmap (Python) — rows sum to 100%, so segments of different sizes stay comparable.

Cross-Tab Findings:

  • Older segments are more valuable (age–value alignment). Moving from the youngest to the oldest demographic segment, the share of high-value (VIP)
    customers rises steadily: Young Singles 7%, Family Women 30%, Family Men 33%, Senior Independents 95% (Demographic Seg. x Value Seg.). Since age and
    value come from completely separate variable sets, this steady rise means the two segmentations point to the same customers.

  • Higher-value customers prefer fresh food. Fresh-oriented consumption is strongest in the older, higher-value segments and weakest in the youngest.
    VIP_Loyalists (Value Seg.) are about three-quarters fresh-focused; the same holds for Senior Independents (Demographic Seg.), which — per the finding above —
    are almost entirely VIP. Their fresh focus differs slightly in composition: Seniors lean on fresh food plus beverages, where family segments lean on fresh food alone.
    The Fresh/Frozen/Others (Preference Seg.) group — the return-prone profile — runs the opposite way: highest in the youngest, lowest-value segments and lowest
    in the oldest, highest-value ones

  • Family Men and Family Women behave identically. Family Men and Family Women (Demographic Seg.) differ only by gender. In both value and consumption
    their numbers are nearly the same. For example, their high-value/VIP share (33% vs 30%) and fresh-food consumption share (43% vs 42%) almost match. Gender
    separates them demographically but does not change how they shop — so it is useful for gender focused messaging, not for predicting value.

  • Internet usage aligns with value (channel–value alignment). Average Internet share rises monotonically across the value tiers: Casual_Buyers 43.0%, Churned 43.2%,
    Regular 49.3%, VIP_Loyalists 72.1% (Value Seg. × Mean Internet). Online orientation is therefore not evenly spread — it concentrates in the highest-value segment,
    where the VIP tier sits roughly 23 points above the next group. Since Value Segmentation is built only from spend behaviour (recency, frequency, monetary) and never
    sees the Internet variable, this alignment is independent evidence that online channel use and customer value describe the same underlying high-value core.

Flag Synthesis — A Single Profile Across Three Segmentations: Each segmentation flagged returns (Negative_LoB) in a different segment: Young Singles (Demographic),
Casual_Buyers (Value), and Fresh/Frozen/Others consumers (Preference). The cross-tabs show these are largely the same people — most Casual_Buyers fall in the
Fresh/Frozen/Others consumption group, and Young Singles lead in both. So the three separate flag findings describe a single return-prone customer profile from three angles.

Combined Use — Confirmed: Preference Segmentation was positioned as auxiliary, to be used together with Value Segmentation rather than alone. The Value × Preference
cross-tab makes this concrete: 3,057 customers are both VIP_Loyalists and Fresh_Food_Consumers — a real, sized audience for a fresh/premium campaign. The combination
proposed in the Preference Segmentation annotation now has actual numbers behind it.

Methodological Significance: The three segmentations are built from completely separate inputs — demographics, spend behavior, and consumption mix. Despite this,
they converge on the same two customer groups: an older, high-value, fresh-food-loyal core and a younger, low-value, return-prone group. When three independent methods
agree, the structure is unlikely to be an artifact of any one method — it reflects a real pattern in the customer base. Beyond statistical convergence, the three segmentations
tell one coherent narrative: the same customer structure emerges whether viewed through demographics, spend, or consumption.

Predictive Modeling — Model Development & Selection

Purpose: This stage builds and compares predictive models for NEW SMI's delivery subscription and selects the one best suited for campaign targeting.
The aim is not maximum accuracy but identifying which customers to contact.

Data & Method: The labelled training set (2,000 customers) is an exact subset of the 20K main customer data, with the addition of the real Subs label.
Therefore, all of its columns except Subs are discarded, and each customer's predictors are instead taken from the cleaned and improved 20K dataset
through a CustID join, preserving the engineering already done in previous phases (capped Internet, imputed Income, AvgTicket, LoB ratios) instead of
recomputing it on the smaller subset. Thus the 2,000-customer labelled set is aligned with the main dataset. The target is cast to nominal (Number To
String), and the data is split 70/30 with stratified partitioning (Table Partitioner) to preserve the 10.4% positive rate in both sets. Three models are
trained along two branches: Decision Tree and Random Forest run directly on the full feature set, while Logistic Regression runs through a Correlation
Filter, a Normalizer (Z-score), and L2 regularization. Each model is evaluated with a ROC curve, a Lift chart, and a Scorer; the three probability outputs
are joined for a combined ROC comparison.

Two-Branch Rationale: The two branches exist because the models react differently to correlated features. Decision Tree and Random Forest are
robust to multicollinearity and split on features individually, so they run on the full feature set. Logistic Regression is sensitive to correlated inputs, which
destabilise its coefficients; it therefore runs through a Correlation Filter (threshold 0.8, which drops the redundant Age/Income/Frequency/AvgTicket
cluster and keeps Monetary as their representative), a Normalizer for scale, and a Gauss prior (L2) to absorb the sum-to-one collinearity of the LoB ratios.

Evaluation:

  • ROC (combined): Logistic Regression leads with AUC 0.909, just ahead of Random Forest (0.896) and Decision Tree (0.807). The three models are close,
    with Random Forest separating best in the low-FPR region — exactly where campaign targeting operates.

  • Lift: All three concentrate value in the top deciles, with Random Forest holding a marginal edge in the first decile (~5.8) and cumulative lift staying
    above baseline to roughly the top 30% of customers before falling below it. This points to a shallow, high-precision targeting depth.

  • Scorer (0.5 cut-off): On accuracy the three are almost identical (RF 91.8%, DT 92.0%, LogReg 91.3%), and precision is effectively tied (RF 0.686,
    DT 0.652, LogReg 0.692). They separate on recall: Decision Tree recovers most positives (48.4%), Random Forest is intermediate (38.7%), and Logistic
    Regression is weakest (29.0%). At this imbalance the 0.5 cut-off leaves all three with low recall, which is why model quality is judged on ranking behaviour
    rather than this threshold.

  • Threshold selection (Youden's J): Since the 0.5 cut-off is inappropriate at this imbalance, the operating threshold is derived from the ROC curve using
    Youden's J statistic (max TPR − FPR), which selects the point of best separation without assuming any cost ratio. On the test set this gives a threshold of
    0.192, recovering 82% of true subscribers (TPR) at a 9% false-positive rate — a far more usable operating point than 0.5, and the threshold later applied to
    the full base for targeting.

Model Selection — Random Forest: Random Forest is selected as the primary model. The choice is not driven by overall fit: with subscribers at only
10.4%, the practical task is ranking customers by propensity so the campaign can target a narrow, high-likelihood group rather than classifying the whole
base. The segmentation stage already showed that online ordering concentrates in the high-value core and thins out across lower tiers; since a delivery
subscription is a related online behaviour, propensity is expected to cluster similarly, making low-FPR separation — how cleanly a model isolates the densest
prospects at the top of the ranking — the criterion that matters most. On this criterion Random Forest is the strongest choice. Logistic Regression has a
marginal edge in overall AUC (0.909 vs 0.896), but Random Forest separates best in the low-FPR region and holds a slight edge in first-decile lift, placing
more true subscribers at the top of the list. The recall comparison confirms the trade-off: at the 0.5 cut-off the three models tie on accuracy and precision,
yet Random Forest recovers noticeably more subscribers than Logistic Regression (38.7% vs 29.0%) while giving up almost no specificity (0.98 vs 0.985)
— Logistic Regression is simply too conservative, missing genuine prospects to avoid a handful of extra contacts. Decision Tree recovers the most positives
but is the weakest ranker (AUC 0.807) and, as a single tree, the least stable; Random Forest delivers a comparable recall advantage in a more robust form.

Methodological Note: The selected model feeds the next stage, where ranked subscription probabilities are combined with per-customer profitability to
decide whom to contact (profit/cost analysis of contacted). The model is trained here on the labelled set, then applied to the full 20K base for that step,
using the Youden-derived threshold (0.192) as the contact cut-off, since the campaign decision is made over all customers, not only the labelled ones.

Predictive Modeling — Profit/Cost Analysis of Contacted

Purpose: With Random Forest selected, this phase turns its subscription probabilities into a contact decision: which customers to contact for the new delivery subscription, and the expected cost & profit of doing so.

Revenue & Profit Model: Expected monthly profit is built across three chained Math Formula nodes. Internet is used as the revenue base: because the subscription is a delivery service, a customer's current
online spending is the best available indicator of the revenue they would generate as a subscriber. The first formula node computes monthly online revenue as (Internet / 100) × Monetary / 12 — the Internet
share converts annual spend into its online portion, divided by twelve for a monthly figure. The second formula node applies a 25% gross margin, a typical grocery-retail figure drawn from sector benchmarks.
The third formula node weights this gross profit by P(Subs=1), so each customer counts only to the extent that they are both likely to subscribe and profitable to serve. A Sorter then ranks customers by
expected profit, from highest to lowest. This online revenue is treated as a baseline: a subscription could increase a customer's online spending, but without data on subscriber behaviour to quantify that growth,
profitability is computed on current spending alone.

Ranking & Contact Decision: Ranking customers by expected profit answers who is most valuable to contact first, but not where to stop. A fixed probability cut-off such as 0.5 does not work at a 10.4% positive
rate, where it labels almost everyone a non-subscriber and recovers only a fraction of true ones. Instead, the threshold is derived from the model's ROC curve using Youden's J statistic (max TPR − FPR), computed
in a Python node. This selects the point of strongest separation without assuming any cost ratio, giving a threshold of 0.192 on the test set. A Math Formula numbers the sorted customers with a
Contact_Rank ($$ROWINDEX$$ + 1), and a Moving Aggregator accumulates expected profit down this ranking. A Rule Engine then applies the threshold to the full 20K base — customers with P(Subs=1) ≥ 0.192
are flagged "Contact", the rest "Skip" — yielding 3,354 target customers (16.8% of the base).

Targeting Depth & Cost: The Moving Aggregator's cumulative expected profit, plotted against contact depth in a Python node, shows how return builds as more customers are contacted. The curve rises steeply
and then flattens: the first customers carry most of the profit, while later ones add almost nothing. Against this, acquisition cost grows linearly — each contact costs the same. The two lines cross at around 33%
of the base; beyond that point, each additional contact costs more than it returns. The Youden cutoff at 16.8% sits well inside the profitable region, where the gap between profit and cost is still wide: contacting
this 16.8% captures roughly 90% of the total expected profit available. Cost itself is not fixed to an invented figure but bounded relative to profit. Among the 3,354 contacted customers, the one ranked last by
expected profit — the weakest case still flagged "Contact" — has a monthly gross profit of 134.47. Because every other contacted customer ranks higher, anchoring the cost ceiling to this customer guarantees it
for the whole list. Grocery-retail benchmarks place a sustainable acquisition cost at 30–50% of gross profit; taking the midpoint (40%) sets the ceiling at 53.79 per customer. The campaign stays profitable down
to the last contacted customer as long as per-customer acquisition cost remains below this figure.

Segment Alignment: A Pivot node cross-tabulates the contact decision against the value segments, visualised as a column-percentage heatmap (Python node). The result concentrates almost entirely in one
segment: of the 3,354 customers flagged "Contact", 98.8% are VIP_Loyalists, with only a handful of Regular customers and virtually none from Casual_Buyers or Churned. This closes the loop with the segmentation
phase, where three independently built segmentations converged on the same high-value core; the subscription targeting, derived from a separate model, lands on that same core. The selection is not indiscriminate,
however: only about 40% of VIP_Loyalists are contacted, so the model still discriminates within the segment rather than contacting every customer in it.

Methodological Note: Two simplifications bound the profit/cost results:

  • Expected profit uses current behaviour. It is based on a customer's current online spending and does not model how subscribing might change it; a customer with high propensity but low current online volume
    can rank lower than their subscription potential might justify. This is a deliberate, defensible floor, not a prediction of post-subscription spending.

  • Youden weights both errors equally. The threshold treats false positives and false negatives as equally costly, whereas missing a genuine subscriber and contacting a non-subscriber may not carry the same
    cost in practice. With no data to quantify that asymmetry, the neutral equal-weight assumption was kept.

GroupBy
Silhouette Coefficient
Column Renamer
RowID
Joiner
RowID
Loop End
RowID
k-Means
Joiner
RowID
Counting Loop Start
Lift Chart (JavaScript) (legacy)
Math Formula (Variable)
ROC Curve
Lift Chart (JavaScript) (legacy)
Line Plot
Math Formula
Scorer
Math Formula
Lift Chart (JavaScript) (legacy)
k-Means
Scorer
Column Filter
Scorer
Math Formula
Math Formula
Python Script
GroupBy
Math Formula
Random Forest Predictor
Joiner
Column Renamer
Column Renamer
Math Formula
Excel Reader
String Replacer
Math Formula
Table View
Missing Value
Table View
Age Distribution- Range: 19–83 years- Highest concentration: 30–40 age group- Secondary peak around 65 years- Bimodal distribution suggests two distinct customer age profiles
Histogram
Math Formula
Joiner
Excel Reader
GroupBy
Sorter
Column Renamer
Column Filter
Column Renamer
Joiner
Joiner
GroupBy
Logistic Regression Learner
Expression
Number to String
Column Renamer
Joiner
Rule Engine
Column Filter
Joiner
Column Filter
Expression
Math Formula (Variable)
Expression
k-Means
Column Filter
Loop End
Counting Loop Start
Moving Aggregator
Column Filter
Math Formula
AvgTicket — Box Plot- Median ~180, IQR ~100–270- Upper whisker ~500- High-value outliers: cluster 600–850, isolated cases above 1,100- Outliers retained — represent VIP/heavy buyers expected to form distinct segment
Box Plot
Rule Engine
LoB Ratios — Box Plot- Perishables_Pct dominant (median ~0.43), widest spread (IQR 0.30–0.60)- Beverages & Canned: narrow, low medians- Frozen & Others: low median, outliers up to 0.90 — niche category specialists
Box Plot
Python Script
Column Filter
Pivot
AvgTicket Distribution- Range: 0–1,635 (basket size in €)- Right-skewed; modal range 50-300- Median ~180, mean ~183- 10 records at 0 (Frequency=0 by definition)- Few extreme baskets above 500
Histogram
Python Script
Income Distribution- Range: 36–191,402 (post-imputation; Income=0 treated as missing)- Near-normal, slightly right-skewed- Majority between 30,000–70,000- Long right tail; few high-income outliers
Histogram
Column Filter
k-Means
Perishables Distribution- Range: -2–20,692- Strongly right-skewed- Most customers spend low amounts, few high-value spenders- Negative values present — likely returns or corrections
Histogram
Normalizer
Normalizer
Python Script
AvgTicket vs Frequency- Bimodal behavior: • Low freq + high basket (occasional) • High freq + moderate basket (regulars)- Upper-right region near-empty: high freq × high basket is rare- Strong signal for segmentation
Scatter Plot
Silhouette Coefficient
Column Filter
Math Formula
Line Plot
GroupBy
Column Filter
One to Many
Column Filter
Column Filter
Marital Status Distribution- Married largest (~7,713, includes 489 mode-imputed missing values)- Single (~5,081) and Together (~4,645) follow- Divorced (~1,557) and Widow (~1,004) are minorities
Bar Chart
Gender Distribution- Male majority (~12,737, includes 169 mode-imputed missing values)- Female ~7,068 (~35%)- "Other" negligible (~195)- Note: mode imputation assigns all NaN to majority class (M)
Bar Chart
NPS Distribution- Score 4 is the most frequent (~7,500)- Scores 2 and 3 are similar (~4,700 each)- Score 5 represents ~15% of customers- Overall satisfaction is moderately positive
Bar Chart
Normalizer
Monetary vs Frequency- Strong positive relationship: higher frequency leads to higher spending- High-frequency customers (>40) show wide spending variance
Scatter Plot
Line Plot
Monetary vs Recency- Active customers (Recency <100) show full spending range- Inactive customers (Recency >200) are exclusively low spenders
Scatter Plot
Python Script
Dependents Distribution- Majority of customers have dependents (1) — approximately 13,200- Customers without dependents represent approximately 34% of the base
Bar Chart
Joiner
Column Filter
Python Script
Recency vs Internet- No clear relationship between online channel usage and recency- Most active customers (Recency <100) are spread across all internet usage levels- Inactive customers (Recency >200) are very few and show no distinct pattern
Scatter Plot
GroupBy
Recency — Box Plot- Median ~53 days, IQR: 27–79 days- Most customers are recently active- Upper whisker ~160 days, outliers extend to 373 days — potential churners
Box Plot
GroupBy
Income vs Monetary- Moderate positive relationship: higher income tends to higher spending- Wide variance at all income levels — income alone does not determine spending
Scatter Plot
Pivot
NPS vs Monetary- Higher NPS scores (4–5) tend to concentrate more high-spending customers- Low NPS scores (1–2) are predominantly associated with lower spending- Suggests a moderate positive relationship between satisfaction and spending
Scatter Plot
Pivot
Pivot
Monetary — Box Plot- Median ~2,900, IQR: 1,159–7,405- Significant right skew with outliers above 20,000- Wide spending variance across the customer base
Box Plot
Linear Correlation
Pivot
Frequency — Box Plot- Median ~18 purchases, IQR: 12–29- A few high-frequency outliers above 55- Right-skewed distribution confirmed
Box Plot
Rule Engine
Binner
Pivot
Expression
Pivot
Linear Correlation Matrix- Frequency & Monetary: 0.922 (strongest RFM)- Age & Income: 0.884- AvgTicket & Monetary: 0.820- Dependents & Monetary: -0.519- Frozen_Pct & Age: -0.298 (younger customers prefer frozen)
Table View
Radar chart of Value segments across 5 normalized RFM-style metrics. Each metric scaled to its column max; Recency inverted so outer = more recent (better). Outputs matplotlib PNG. Profiling visual only.
Python Script
Binner
Radar/spider chart of Preference segments across the 5 LoB consumption ratios. Reads cluster-level ratio means (GroupBy), maps cluster IDs to business names, outputs a matplotlib PNG via the image port. Profiling visual only
Python Script
GroupBy
Python Script
Python Script
Row Filter
GroupBy
GroupBy
Python Script
Math Formula (Variable)
Column Filter
GroupBy
Rule Engine
Math Formula
Column Filter
GroupBy
Table Partitioner
Line Plot
Decision Tree Learner
Random Forest Learner
Decision Tree Predictor
Random Forest Predictor
Loop End
Counting Loop Start
Silhouette Coefficient
Correlation Filter
Frequency Distribution- Range: 0–64- Right-skewed, majority of customers shop 10–20 times per year- Small segment of very high-frequency shoppers (>40)
Histogram
Column Renamer
Linear Correlation
Normalizer
Reference Column Filter
k-Means
Normalizer
Canned Distribution- Range: -2–19,656- Strongly right-skewed- Most customers spend low amounts on canned products
Histogram
Recency Distribution- Range: 0–373- Majority of customers made a purchase within the last 100 days- Small group of inactive customers with recency >300 days — potential churners
Histogram
Normalizer (Apply)
Monetary Distribution- Range: 27–31,939- Strongly right-skewed distribution- Majority of customers are low spenders, small group of high-value customers
Histogram
GroupBy
Logistic Regression Predictor
Frozen Distribution- Range: -2–9,299- Strongly right-skewed- Most customers spend low amounts on frozen products
Histogram
Internet Distribution- Range: 10–100- Broad distribution, customers spread across all online usage levels- Two peaks visible around 40–45% and 80–85% — suggests two distinct channel preference profiles
Histogram
Column Renamer
Education Distribution- BSc dominant (~9,665 customers)- High School and MSc follow- PhD and Primary are minorities- 222 records (Education='0') treated as N/A; not imputed (not used for segmentation)
Bar Chart
Column Renamer
Beverages Distribution- Range: -2–13,025- Strongly right-skewed- Most customers spend low amounts on beverages
Histogram
Others Distribution- Range: -2–13,074- Strongly right-skewed- Most customers spend low amounts on other product categories- Negative values present — likely returns or corrections
Histogram

Nodes

Extensions

Links