Icon

KNIME_​Group_​project

Data Preprocessing — Data Quality Treatment

  • Education = 0 replaced with N/A

  • Internet values capped at 100% (6 records affected)

Missing Value Imputation

  • Marital_Status & Gender: Most Frequent Value (Mode)

  • Income: Median (35 records affected)

Excel Reader
String Replacer
Marital Status Distribution- Married customers are the largest group (~7,500)- Single and Together categories are similar in size (~5,000 each)- Widow and Divorced are minorities
Bar Chart
Math Formula
Table View
Missing Value
Table View
Age Distribution- Range: 19–83 years- Highest concentration: 30–40 age group- Secondary peak around 65 years- Bimodal distribution suggests two distinct customer age profiles
Histogram
Gender Distribution- Male customers are the majority (~12,700)- Female customers represent approximately 35% of the base- "Other" category is negligible
Bar Chart
NPS Distribution- Score 4 is the most frequent (~7,500)- Scores 2 and 3 are similar (~4,700 each)- Score 5 represents ~15% of customers- Overall satisfaction is moderately positive
Bar Chart
Monetary vs Frequency- Strong positive relationship: higher frequency leads to higher spending- High-frequency customers (>40) show wide spending variance
Scatter Plot
Monetary vs Recency- Active customers (Recency <100) show full spending range- Inactive customers (Recency >200) are exclusively low spenders
Scatter Plot
Dependents Distribution- Majority of customers have dependents (1) — approximately 13,200- Customers without dependents represent approximately 34% of the base
Bar Chart
Column Filter
Recency vs Internet- No clear relationship between online channel usage and recency- Most active customers (Recency <100) are spread across all internet usage levels- Inactive customers (Recency >200) are very few and show no distinct pattern
Scatter Plot
Recency — Box Plot- Median ~53 days, IQR: 27–79 days- Most customers are recently active- Upper whisker ~160 days, outliers extend to 373 days — potential churners
Box Plot
Income vs Monetary- Moderate positive relationship: higher income tends to higher spending- Wide variance at all income levels — income alone does not determine spending
Scatter Plot
NPS vs Monetary- Higher NPS scores (4–5) tend to concentrate more high-spending customers- Low NPS scores (1–2) are predominantly associated with lower spending- Suggests a moderate positive relationship between satisfaction and spending
Scatter Plot
Monetary — Box Plot- Median ~2,900, IQR: 1,159–7,405- Significant right skew with outliers above 20,000- Wide spending variance across the customer base
Box Plot
Frequency Distribution- Range: 0–64- Right-skewed, majority of customers shop 10–20 times per year- Small segment of very high-frequency shoppers (>40)
Histogram
Linear Correlation
Frequency — Box Plot- Median ~18 purchases, IQR: 12–29- A few high-frequency outliers above 55- Right-skewed distribution confirmed
Box Plot
Income Distribution- Range: 0–191,402- Near-normal distribution, slightly right-skewed- Majority of customers concentrated between 30,000–70,000- Small number of high-income outliers (>120,000) visible on the right tail
Histogram
Canned Distribution- Range: -2–19,656- Strongly right-skewed- Most customers spend low amounts on canned products
Histogram
Perishables Distribution- Range: -2–20,692- Strongly right-skewed- Most customers spend low amounts, few high-value spenders- Negative values present — likely returns or corrections
Histogram
Recency Distribution- Range: 0–373- Majority of customers made a purchase within the last 100 days- Small group of inactive customers with recency >300 days — potential churners
Histogram
Monetary Distribution- Range: 27–31,939- Strongly right-skewed distribution- Majority of customers are low spenders, small group of high-value customers
Histogram
Linear Correlation MatrixFrequency & Monetary show the strongest correlation (0.922)Age & Income are strongly correlated (0.879)Recency shows negative correlations across all variables — inactive customers spend lessLoB variables (Perishables, B
Table View
Frozen Distribution- Range: -2–9,299- Strongly right-skewed- Most customers spend low amounts on frozen products
Histogram
Internet Distribution- Range: 10–100- Broad distribution, customers spread across all online usage levels- Two peaks visible around 40–45% and 80–85% — suggests two distinct channel preference profiles
Histogram
Education Distribution- BSc is the dominant education level (~9,700 customers)- MSc and High School follow as second and third largest groups- PhD and Primary education are minorities- Small number of N/A records (Education = 0 replaced)
Bar Chart
Beverages Distribution- Range: -2–13,025- Strongly right-skewed- Most customers spend low amounts on beverages
Histogram
Others Distribution- Range: -2–13,074- Strongly right-skewed- Most customers spend low amounts on other product categories- Negative values present — likely returns or corrections
Histogram

Nodes

Extensions

Links