Icon

KNIME_​Group_​project

Data Preprocessing — Data Quality Treatment

  • Education = 0 replaced with N/A

  • Internet values capped at 100% (6 records affected)

Missing Value Imputation

- Marital_Status (489) & Gender (169): Most Frequent Value (Mode)

- Income (62 records: 35 NaN + 27 zeros):

Context-aware median imputation

- Income = 0 treated as missing (implausible for active customers)

- Primary lookup: median by Age_Bin × Monetary_Bin (12 cells)

- Fallback: Age_Bin median when group count < 100

- Rationale: Income-Age correlation = 0.88,

Income-Monetary correlation = 0.80

Feature Engineering & Anomaly Flagging

[1] Derived Variables (6 new columns)

• AvgTicket = Monetary / Frequency

(basket size; Frequency=0 → 0, 10 records)

• Perishables_Pct, Beverages_Pct, Frozen_Pct,

Canned_Pct, Others_Pct (= LoB / Monetary)

Purpose: distinguish "small basket frequent" vs

"large basket occasional" shoppers; capture

category preference independent of total spend.

[2] Coherence Flags (~360 distinct records, retained)

• Flag_FreqZero: 10 records

• Flag_Recency_GT365: 26 records

• Flag_Young_DivWidow: 2 records

• Flag_Negative_LoB: 322 records (returns)

Note on outliers

Statistical outliers in RFM and AvgTicket are not flagged separately. These represent valid high-value customer behavior (e.g. VIPs, heavy buyers) and are intended to form distinct segments in clustering rather than to be corrected. Box plots in the EDA section provide visual outlier inspection.

Marital Status Distribution- Married largest (~7,713, includes 489 mode-imputed missing values)- Single (~5,081) and Together (~4,645) follow- Divorced (~1,557) and Widow (~1,004) are minorities
Bar Chart
Gender Distribution- Male majority (~12,737, includes 169 mode-imputed missing values)- Female ~7,068 (~35%)- "Other" negligible (~195)- Note: mode imputation assigns all NaN to majority class (M)
Bar Chart
NPS Distribution- Score 4 is the most frequent (~7,500)- Scores 2 and 3 are similar (~4,700 each)- Score 5 represents ~15% of customers- Overall satisfaction is moderately positive
Bar Chart
Monetary vs Frequency- Strong positive relationship: higher frequency leads to higher spending- High-frequency customers (>40) show wide spending variance
Scatter Plot
Monetary vs Recency- Active customers (Recency <100) show full spending range- Inactive customers (Recency >200) are exclusively low spenders
Scatter Plot
Dependents Distribution- Majority of customers have dependents (1) — approximately 13,200- Customers without dependents represent approximately 34% of the base
Bar Chart
Column Filter
Recency vs Internet- No clear relationship between online channel usage and recency- Most active customers (Recency <100) are spread across all internet usage levels- Inactive customers (Recency >200) are very few and show no distinct pattern
Scatter Plot
Recency — Box Plot- Median ~53 days, IQR: 27–79 days- Most customers are recently active- Upper whisker ~160 days, outliers extend to 373 days — potential churners
Box Plot
Income vs Monetary- Moderate positive relationship: higher income tends to higher spending- Wide variance at all income levels — income alone does not determine spending
Scatter Plot
NPS vs Monetary- Higher NPS scores (4–5) tend to concentrate more high-spending customers- Low NPS scores (1–2) are predominantly associated with lower spending- Suggests a moderate positive relationship between satisfaction and spending
Scatter Plot
Monetary — Box Plot- Median ~2,900, IQR: 1,159–7,405- Significant right skew with outliers above 20,000- Wide spending variance across the customer base
Box Plot
Linear Correlation
Frequency — Box Plot- Median ~18 purchases, IQR: 12–29- A few high-frequency outliers above 55- Right-skewed distribution confirmed
Box Plot
Binner
Expression
Linear Correlation Matrix- Frequency & Monetary: 0.922 (strongest RFM)- Age & Income: 0.884- AvgTicket & Monetary: 0.820- Dependents & Monetary: -0.519- Frozen_Pct & Age: -0.298 (younger customers prefer frozen)
Table View
Binner
GroupBy
Row Filter
GroupBy
Excel Reader
String Replacer
Math Formula
Table View
Missing Value
Table View
Age Distribution- Range: 19–83 years- Highest concentration: 30–40 age group- Secondary peak around 65 years- Bimodal distribution suggests two distinct customer age profiles
Histogram
Joiner
GroupBy
Column Renamer
Column Renamer
Joiner
Expression
Column Renamer
Joiner
Expression
Expression
Column Filter
AvgTicket — Box Plot- Median ~180, IQR ~100–270- Upper whisker ~500- High-value outliers: cluster 600–850, isolated cases above 1,100- Outliers retained — represent VIP/heavy buyers expected to form distinct segment
Box Plot
Frequency Distribution- Range: 0–64- Right-skewed, majority of customers shop 10–20 times per year- Small segment of very high-frequency shoppers (>40)
Histogram
LoB Ratios — Box Plot- Perishables_Pct dominant (median ~0.43), widest spread (IQR 0.30–0.60)- Beverages & Canned: narrow, low medians- Frozen & Others: low median, outliers up to 0.90 — niche category specialists
Box Plot
Column Filter
AvgTicket Distribution- Range: 0–1,635 (basket size in €)- Right-skewed; modal range 50-300- Median ~180, mean ~183- 10 records at 0 (Frequency=0 by definition)- Few extreme baskets above 500
Histogram
Income Distribution- Range: 36–191,402 (post-imputation; Income=0 treated as missing)- Near-normal, slightly right-skewed- Majority between 30,000–70,000- Long right tail; few high-income outliers
Histogram
Canned Distribution- Range: -2–19,656- Strongly right-skewed- Most customers spend low amounts on canned products
Histogram
Perishables Distribution- Range: -2–20,692- Strongly right-skewed- Most customers spend low amounts, few high-value spenders- Negative values present — likely returns or corrections
Histogram
Recency Distribution- Range: 0–373- Majority of customers made a purchase within the last 100 days- Small group of inactive customers with recency >300 days — potential churners
Histogram
Monetary Distribution- Range: 27–31,939- Strongly right-skewed distribution- Majority of customers are low spenders, small group of high-value customers
Histogram
AvgTicket vs Frequency- Bimodal behavior: • Low freq + high basket (occasional) • High freq + moderate basket (regulars)- Upper-right region near-empty: high freq × high basket is rare- Strong signal for segmentation
Scatter Plot
Frozen Distribution- Range: -2–9,299- Strongly right-skewed- Most customers spend low amounts on frozen products
Histogram
Internet Distribution- Range: 10–100- Broad distribution, customers spread across all online usage levels- Two peaks visible around 40–45% and 80–85% — suggests two distinct channel preference profiles
Histogram
Education Distribution- BSc dominant (~9,665 customers)- High School and MSc follow- PhD and Primary are minorities- 222 records (Education='0') treated as N/A; not imputed (not used for segmentation)
Bar Chart
Beverages Distribution- Range: -2–13,025- Strongly right-skewed- Most customers spend low amounts on beverages
Histogram
Others Distribution- Range: -2–13,074- Strongly right-skewed- Most customers spend low amounts on other product categories- Negative values present — likely returns or corrections
Histogram

Nodes

Extensions

Links