Feature Engineering & Anomaly Flagging
[1] Derived Variables (6 new columns)
• AvgTicket = Monetary / Frequency
(basket size; Frequency=0 → 0, 10 records)
• Perishables_Pct, Beverages_Pct, Frozen_Pct,
Canned_Pct, Others_Pct (= LoB / Monetary)
Purpose: distinguish "small basket frequent" vs
"large basket occasional" shoppers; capture
category preference independent of total spend.
[2] Coherence Flags (~360 distinct records, retained)
• Flag_FreqZero: 10 records
• Flag_Recency_GT365: 26 records
• Flag_Young_DivWidow: 2 records
• Flag_Negative_LoB: 322 records (returns)
Note on outliers (1.5 × IQR, upper bound)
- Monetary > 16,772: 126 customers (0.63%)
- Frequency > 54: 42 customers (0.21%)
- Recency > 157: 796 customers (3.98%) — lapsed
- AvgTicket > 506: 29 customers (0.14%)
- ≥1 dim outlier: 972 (4.86%); ≥2 dims: 21
Retained, not flagged. These represent valid customer behavior — VIPs, heavy buyers, and lapsed customers — and are intended to form distinct segments in clustering rather than be corrected. Box plots in EDA confirm visual inspection.