Icon

Clustering_​Retrieval

From the KNIMEExplorer, dragand dropSHP_data.csv Preprocessing phase. Our objective is to cluster SKUs by Size and Unit Price. In order to get there, we need to:1) remove "ml" from the size colum and convert it to integer. We can use 'String Manipulation' node followed by'String to Number' node to do this.2) calculate Price per Unit. We use 'Math Formula' node to get Price per SU. We then use Size information andShampoo SU factor (2571ml) to get to price per Unit.3) Normalize PPU and Size with 'Normalizer' node. NOTE: use z-score to avoid 1-element clusters in ANOVA.We can also use 'Color Manager' and 'Scatter Plot' nodes to visualize our dataset. We will perform k-Means clustering with a number of clusters ranging from 2 to 10. At each step, we willcompute the variance metrics of each resulting cluster. In order:1) Drag a generic loop start node2) Create a 'Java Edit Variable' node to generate the cluster number variable. This can be done byincrementing the loop counter by 2.3) Connect a 'k-Means' node to loop start and variable node. Configure it to cluster on Size and PPU.4) Use a 'One-way ANOVA' node to computer variance metrics.5) Close the loop with 'Variable Condition Loop End', with termination condtion set to cluster number=10. From the KNIMEExplorer, dragand dropSHP_data.csv Preprocessing phase. Our objective is to run Knn using Variant as target column. We will use SKU Size and Priceper SU for distance evaluation. In order to get there, we need to:1) calculate Price per SU (Value/Volume) using 'Math Formula'. 2) Normalize PPSU and Size with 'Normalizer' node.We can also use 'Color Manager' and 'Scatter Plot' nodes to visualize our dataset. We'll split the table and run Knn with Euclidean Distance and k=31) Use 'Row Splitter' to extract SKUs without package information2) For both outputs, use Column filter to remove unnecessarycolumns, keeping only Size, Packaging and PPSU3) Connect 'Numeric Distance' node to Normalized data and set itto compute Euclidean Distance on both Size and PPSU4) Use 'K Nearest Neighbor' node to assign Packaging labelsbased on the chosen distance metric. In the last part, we visually compare final results to the initial dataset.1) Drag a 'Concatenate' node to stitch together Knn result and labeledPackaging table.2) Use 'Column Merger' to merge Packaging to Clustering Result.3) Use 'Column Filter' to keep only Size, PPSU and merged Packagingcolumn.4) Visualize results using 'Color Appender' and 'Scatter Plot' nodes. The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. In Section 2, we use the "Elbow Method" to determine the optimal number of clusters to use for our dataset. This method aims at plotting a line chart with number ofclusters on the x axis, and whithin-cluster variance on the y axis. The idea is to use visual inspection to understand where the line starts flattening, i.e. when the use ofadditional clusters stops bringing incremental knowledge. We will now plot the elbow curves for Size and PPU.1) Use 'Row Filter' node to keep only "whithin group variance" rows.2) Use 'Column Filter' to keep Sum of Squares.3) use 'Row Splitter' to divide PPU and Size rows.4) connect 2 'Line Plot' nodes to visualize results. The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. In the second part of the exercise we will assign a Packaging label to SKUs with missing packaging information. K Nearest Neighbourwill be used for this purpose. Addendum: Evaluation of optimal number of clusters for Shampoo SKUs Size/Price data. Exercise 1: Retrieve missing packaging information using k-nearest neighbour We now calculate P&G Value Share in each cluster.1) Use 'Column Filter' node connected to 'c-Means' to keep Code and Winning Cluster. Join theresulting table to the original dataset.2) Use 'GroupBy' to calculate sum of Value Sales for each cluster and join results to dataset.3) Calculate SKU Share via 'Math Formula' row.4) Use 'Pivoting' node to compute Company share for each cluster.5) Finally, use 'Column Filter' to keep P&G shares only. Preprocessing phase. Our objective is to cluster SKUs by Size and Unit Price. In order to get there, we need to:1) calculate Price per Unit. We use 'Math Formula' node to get Price per SU. We then use Size information andShampoo SU factor (2571ml) to get to price per Unit.2) Normalize PPU and Size with 'Normalizer' node. Lastly, we use 'Rule Engine' to qualify whether acertain SKU launch is good or bad. We define alaunch in a cluster where P&G has more than50% share as bad, and the opposite as good.1) Use 'Rule Engine' node to codify success/failure rules.2) Use 'Set Colors' and 'Pie Chart' nodes tovisualize results. We will perform k-Means and Fuzzy c-Means onnormalized data, setting number of clusters to 6.1) Select 'k-Means' node and cluster on Size and PPU.Use 'Color Manager' and 'Scatter Plot" nodes to visualizeresults.2) Select 'Fuzzy c-Means' node and cluster on Size andPPU with seed=1. Use 'Color Manager' and 'Scatter Plot"nodes to visualize results. Exercise 2: Clustering of SKU by Size/Price and evaluation of new SKU launches We will now load the new launches dataset and evaluate which clusters the new SKUs will be belonging to.1) From the KNIME Explorer, drag and drop SHP_data_launches.csv2) We need to preprocess the data as done before. Simply copy the nodes from the preprocessing step and connect them to the FileReader node.3) To assign clusters to new data, use the 'Cluster Assigner' node, connected both to input data and to the 'k-Means' node model port.4) Use 'Column Filter' to keep Code and Cluster.5) Join the new dataset to the Cluster Share information table.6) Finally, use 'Sorter' node to sort by cluster name. From the KNIMEExplorer, dragand dropSHP_data.csv With optimal number of clusters now determined, we can proceed to test a couple different clustering techniques and see how results compare. We can then select one of the approaches and analyse how P&G is performing in each cluster in terms of share. Finally, we will import a list of soon-to-be-launched SKUs and evaluate thecluster they will belong to. Coupling this with the P&G share information will allow us to understand how P&G SKUs are covering the Size/Price Clusters. The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. From the KNIMEExplorer, dragand dropSHP_data.csv The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. In the last part, we visually compare final results to the initial dataset.1) Drag a 'Concatenate' node to stitch together Knn result and labeledPackaging table.2) Use 'Column Merger' to merge Packaging to Clustering Result.3) Use 'Column Filter' to keep only Size, PPSU and merged Packagingcolumn.4) Visualize results using 'Color Appender' and 'Scatter Plot' nodes. We will perform k-Means and Fuzzy c-Means onnormalized data, setting number of clusters to 6.1) Select 'k-Means' node and cluster on Size and PPU.Use 'Color Manager' and 'Scatter Plot" nodes to visualizeresults.2) Select 'Fuzzy c-Means' node and cluster on Size andPPU with seed=1. Use 'Color Manager' and 'Scatter Plot"nodes to visualize results. Exercise 1: Retrieve missing packaging information using k-nearest neighbour We will now load the new launches dataset and evaluate which clusters the new SKUs will be belonging to.1) From the KNIME Explorer, drag and drop SHP_data_launches.csv2) We need to preprocess the data as done before. Simply copy the nodes from the preprocessing step and connect them to the File Reader node.3) To assign clusters to new data, use the 'Cluster Assigner' node, connected both to input data and to the 'k-Means' node model port.4) Use 'Column Filter' to keep Code and Cluster.5) Join (Full Outer) the new dataset to the Cluster Share information table. From the KNIMEExplorer, dragand dropSHP_data.csv Preprocessing phase. Our objective is to cluster SKUs by Size and Unit Price. In order to get there, we need to:1) calculate Price per Unit. We use 'Math Formula' node to get Price per SU. We then use Size information andShampoo SU factor (2571ml) to get to price per Unit.2) Normalize PPU and Size with 'Normalizer' node. Lastly, we use 'Rule Engine' to qualify whether acertain SKU launch is good or bad. We define alaunch in a cluster where P&G has more than50% share as bad, and the opposite as good.1) Use 'Rule Engine' node to codify success/failure rules (see side note for code).2) Use 'Set Colors' and 'Pie Chart' nodes tovisualize results. In the second part of the exercise we will assign a Packaging label to SKUs with missing packaging information. K Nearest Neighbourwill be used for this purpose. We now calculate P&G Value Share in each cluster.1) Use 'Column Filter' node connected to 'c-Means' to keep Code and WinningCluster. Join the resulting table to the original dataset on EAN code.2) Use 'GroupBy' to calculate sum of Value Sales for each cluster and join resultsto dataset on Winning Cluster.3) Calculate SKU Share via 'Math Formula' row.4) Use 'Pivoting' node to compute Company share for each cluster.5) Finally, use 'Column Filter' to keep P&G shares only. With optimal number of clusters now determined, we can proceed to test a couple different clustering techniques and see how results compare. We can then select one of the approaches and analyse how P&G is performing in each cluster in terms of share. Finally, we will import a list of soon-to-be-launched SKUs and evaluate thecluster they will belong to. Coupling this with the P&G share information will allow us to understand how P&G SKUs are covering the Size/Price Clusters. We'll split the table and run Knn with Euclidean Distance and k=31) Use 'Row Splitter' to extract SKUs without package information2) For both outputs, use Column filter to remove unnecessarycolumns, keeping only Size, Packaging and PPSU3) Connect 'Numeric Distance' node to Normalized data and set itto compute Euclidean Distance on both Size and PPSU4) Use 'K Nearest Neighbor' node to assign Packaging labelsbased on the chosen distance metric. Preprocessing phase. Our objective is to run Knn using Variant as target column. We will use SKU Size and Priceper SU for distance evaluation. In order to get there, we need to:1) calculate Price per SU (Value/Volume) using 'Math Formula'. 2) Normalize PPSU and Size with 'Normalizer' node.We can also use 'Color Manager' and 'Scatter Plot' nodes to visualize our dataset. Exercise 2: Clustering of SKU by Size/Price and evaluation of new SKU launches We will perform k-Means clustering with a number of clusters ranging from 2 to 10. At each step, we willcompute the variance metrics of each resulting cluster. In order:1) Drag a generic loop start node2) Create a 'Java Edit Variable' node to generate the cluster number variable. This can be done byincrementing the loop counter by 2.3) Connect a 'k-Means' node to loop start and variable node. Configure it to cluster on Size and PPU.4) Use a 'One-way ANOVA' node to computer variance metrics.5) Close the loop with 'Variable Condition Loop End', with termination condtion set to cluster number=10. Preprocessing phase. Our objective is to cluster SKUs by Size and Unit Price. In order to get there, we need to:1) calculate Price per Unit. We use 'Math Formula' node to get Price per SU. We then use Size information andShampoo SU factor (2571ml) to get to price per Unit.2) Normalize PPU and Size with 'Normalizer' node. NOTE: use z-score to avoid 1-element clusters in ANOVA.We can also use 'Color Manager' and 'Scatter Plot' nodes to visualize our dataset. In Section 2, we use the "Elbow Method" to determine the optimal number of clusters to use for our dataset. This method aims at plotting a line chart with number ofclusters on the x axis, and whithin-cluster variance on the y axis. The idea is to use visual inspection to understand where the line starts flattening, i.e. when the use ofadditional clusters stops bringing incremental knowledge. We will now plot the elbow curves for Size and PPU.1) Use 'Row Filter' node to keep only "whithin group variance" rows.2) Use 'Column Filter' to keep Sum of Squares.3) use 'Row Splitter' to divide PPU and Size rows.4) connect 2 'Line Plot' nodes to visualize results. The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. From the KNIMEExplorer, dragand dropSHP_data.csv Addendum: Evaluation of optimal number of clusters for Shampoo SKUs Size/Price data. $P&G+Value Share$ >= 50 AND NOT (MISSING $Code$) =>"bad choice"$P&G+Value Share$ < 50 AND MISSING $Code$ =>"bad choice"MISSING $P&G+Value Share$ AND MISSING $Code$ =>"bad choice"TRUE =>"good choice" Compute PPSUCompute PPUNormalize Measures using z-scoreColor by BrandVisualize ResultsCompute VarianceLoop endLoop till 10 clustersCluster CounterElbow ChartWithin Group VarianceElbow ChartNormalize MeasuresCompute PPSUSplit Size and PPUKeep relevantKnn with Euclidean DistanceCompute DistancesKeeprelevant columnsKeep Codeand ClusterAdd it to TableCompute ClusterValue SalesCompute PPSUJoin toDenormalizedDataCalculateShareSet ColorsMerge launchesand cluster infoPlotKeep Company,Code and ClusterSet ColorsClassifyPrep dataPlotQualify SKU LaunchesCompute PPUClusteringNormalize MeasuresFilter "NA"Keeprelevant columnsVisualize ResultsColor by PackagingBuild full setVisualize ResultsMerge Variant andClustering ResultKeep relevantcolumnsNormalizePivotKeep P&G onlyClusteringPlotSet ColorsClusteringAppend ColorsRead SKU listRead SKU listRead SKU listRead SKU listMath Formula Math Formula Normalizer Color Manager Scatter Plot(legacy) One-way ANOVA Variable ConditionLoop End (deprecated) Generic Loop Start(deprecated) Java Edit Variable Line Plot (legacy) Row Filter Line Plot (legacy) Normalizer Math Formula Row Splitter Column Filter K Nearest Neighbor (DistanceFunction) (deprecated) Numeric Distances Column Filter Column Filter Joiner (deprecated) GroupBy Math Formula Joiner (deprecated) Math Formula Color Manager Joiner (deprecated) Scatter Plot(legacy) Column Filter Color Manager Cluster Assigner Preprocessing Pie chart (legacy) Rule Engine Math Formula k-Means Normalizer Row Splitter Column Filter Scatter Plot(legacy) Color Manager Concatenate Scatter Plot(legacy) Column Merger Column Filter Normalizer (Apply) Pivot Column Filter Fuzzy c-Means Scatter Plot(legacy) Color Manager Fuzzy c-Means Color Appender(deprecated) File Reader File Reader File Reader File Reader From the KNIMEExplorer, dragand dropSHP_data.csv Preprocessing phase. Our objective is to cluster SKUs by Size and Unit Price. In order to get there, we need to:1) remove "ml" from the size colum and convert it to integer. We can use 'String Manipulation' node followed by'String to Number' node to do this.2) calculate Price per Unit. We use 'Math Formula' node to get Price per SU. We then use Size information andShampoo SU factor (2571ml) to get to price per Unit.3) Normalize PPU and Size with 'Normalizer' node. NOTE: use z-score to avoid 1-element clusters in ANOVA.We can also use 'Color Manager' and 'Scatter Plot' nodes to visualize our dataset. We will perform k-Means clustering with a number of clusters ranging from 2 to 10. At each step, we willcompute the variance metrics of each resulting cluster. In order:1) Drag a generic loop start node2) Create a 'Java Edit Variable' node to generate the cluster number variable. This can be done byincrementing the loop counter by 2.3) Connect a 'k-Means' node to loop start and variable node. Configure it to cluster on Size and PPU.4) Use a 'One-way ANOVA' node to computer variance metrics.5) Close the loop with 'Variable Condition Loop End', with termination condtion set to cluster number=10. From the KNIMEExplorer, dragand dropSHP_data.csv Preprocessing phase. Our objective is to run Knn using Variant as target column. We will use SKU Size and Priceper SU for distance evaluation. In order to get there, we need to:1) calculate Price per SU (Value/Volume) using 'Math Formula'. 2) Normalize PPSU and Size with 'Normalizer' node.We can also use 'Color Manager' and 'Scatter Plot' nodes to visualize our dataset. We'll split the table and run Knn with Euclidean Distance and k=31) Use 'Row Splitter' to extract SKUs without package information2) For both outputs, use Column filter to remove unnecessarycolumns, keeping only Size, Packaging and PPSU3) Connect 'Numeric Distance' node to Normalized data and set itto compute Euclidean Distance on both Size and PPSU4) Use 'K Nearest Neighbor' node to assign Packaging labelsbased on the chosen distance metric. In the last part, we visually compare final results to the initial dataset.1) Drag a 'Concatenate' node to stitch together Knn result and labeledPackaging table.2) Use 'Column Merger' to merge Packaging to Clustering Result.3) Use 'Column Filter' to keep only Size, PPSU and merged Packagingcolumn.4) Visualize results using 'Color Appender' and 'Scatter Plot' nodes. The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. In Section 2, we use the "Elbow Method" to determine the optimal number of clusters to use for our dataset. This method aims at plotting a line chart with number ofclusters on the x axis, and whithin-cluster variance on the y axis. The idea is to use visual inspection to understand where the line starts flattening, i.e. when the use ofadditional clusters stops bringing incremental knowledge. We will now plot the elbow curves for Size and PPU.1) Use 'Row Filter' node to keep only "whithin group variance" rows.2) Use 'Column Filter' to keep Sum of Squares.3) use 'Row Splitter' to divide PPU and Size rows.4) connect 2 'Line Plot' nodes to visualize results. The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. In the second part of the exercise we will assign a Packaging label to SKUs with missing packaging information. K Nearest Neighbourwill be used for this purpose. Addendum: Evaluation of optimal number of clusters for Shampoo SKUs Size/Price data. Exercise 1: Retrieve missing packaging information using k-nearest neighbour We now calculate P&G Value Share in each cluster.1) Use 'Column Filter' node connected to 'c-Means' to keep Code and Winning Cluster. Join theresulting table to the original dataset.2) Use 'GroupBy' to calculate sum of Value Sales for each cluster and join results to dataset.3) Calculate SKU Share via 'Math Formula' row.4) Use 'Pivoting' node to compute Company share for each cluster.5) Finally, use 'Column Filter' to keep P&G shares only. Preprocessing phase. Our objective is to cluster SKUs by Size and Unit Price. In order to get there, we need to:1) calculate Price per Unit. We use 'Math Formula' node to get Price per SU. We then use Size information andShampoo SU factor (2571ml) to get to price per Unit.2) Normalize PPU and Size with 'Normalizer' node. Lastly, we use 'Rule Engine' to qualify whether acertain SKU launch is good or bad. We define alaunch in a cluster where P&G has more than50% share as bad, and the opposite as good.1) Use 'Rule Engine' node to codify success/failure rules.2) Use 'Set Colors' and 'Pie Chart' nodes tovisualize results. We will perform k-Means and Fuzzy c-Means onnormalized data, setting number of clusters to 6.1) Select 'k-Means' node and cluster on Size and PPU.Use 'Color Manager' and 'Scatter Plot" nodes to visualizeresults.2) Select 'Fuzzy c-Means' node and cluster on Size andPPU with seed=1. Use 'Color Manager' and 'Scatter Plot"nodes to visualize results. Exercise 2: Clustering of SKU by Size/Price and evaluation of new SKU launches We will now load the new launches dataset and evaluate which clusters the new SKUs will be belonging to.1) From the KNIME Explorer, drag and drop SHP_data_launches.csv2) We need to preprocess the data as done before. Simply copy the nodes from the preprocessing step and connect them to the FileReader node.3) To assign clusters to new data, use the 'Cluster Assigner' node, connected both to input data and to the 'k-Means' node model port.4) Use 'Column Filter' to keep Code and Cluster.5) Join the new dataset to the Cluster Share information table.6) Finally, use 'Sorter' node to sort by cluster name. From the KNIMEExplorer, dragand dropSHP_data.csv With optimal number of clusters now determined, we can proceed to test a couple different clustering techniques and see how results compare. We can then select one of the approaches and analyse how P&G is performing in each cluster in terms of share. Finally, we will import a list of soon-to-be-launched SKUs and evaluate thecluster they will belong to. Coupling this with the P&G share information will allow us to understand how P&G SKUs are covering the Size/Price Clusters. The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. From the KNIMEExplorer, dragand dropSHP_data.csv The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. In the last part, we visually compare final results to the initial dataset.1) Drag a 'Concatenate' node to stitch together Knn result and labeledPackaging table.2) Use 'Column Merger' to merge Packaging to Clustering Result.3) Use 'Column Filter' to keep only Size, PPSU and merged Packagingcolumn.4) Visualize results using 'Color Appender' and 'Scatter Plot' nodes. We will perform k-Means and Fuzzy c-Means onnormalized data, setting number of clusters to 6.1) Select 'k-Means' node and cluster on Size and PPU.Use 'Color Manager' and 'Scatter Plot" nodes to visualizeresults.2) Select 'Fuzzy c-Means' node and cluster on Size andPPU with seed=1. Use 'Color Manager' and 'Scatter Plot"nodes to visualize results. Exercise 1: Retrieve missing packaging information using k-nearest neighbour We will now load the new launches dataset and evaluate which clusters the new SKUs will be belonging to.1) From the KNIME Explorer, drag and drop SHP_data_launches.csv2) We need to preprocess the data as done before. Simply copy the nodes from the preprocessing step and connect them to the File Reader node.3) To assign clusters to new data, use the 'Cluster Assigner' node, connected both to input data and to the 'k-Means' node model port.4) Use 'Column Filter' to keep Code and Cluster.5) Join (Full Outer) the new dataset to the Cluster Share information table. From the KNIMEExplorer, dragand dropSHP_data.csv Preprocessing phase. Our objective is to cluster SKUs by Size and Unit Price. In order to get there, we need to:1) calculate Price per Unit. We use 'Math Formula' node to get Price per SU. We then use Size information andShampoo SU factor (2571ml) to get to price per Unit.2) Normalize PPU and Size with 'Normalizer' node. Lastly, we use 'Rule Engine' to qualify whether acertain SKU launch is good or bad. We define alaunch in a cluster where P&G has more than50% share as bad, and the opposite as good.1) Use 'Rule Engine' node to codify success/failure rules (see side note for code).2) Use 'Set Colors' and 'Pie Chart' nodes tovisualize results. In the second part of the exercise we will assign a Packaging label to SKUs with missing packaging information. K Nearest Neighbourwill be used for this purpose. We now calculate P&G Value Share in each cluster.1) Use 'Column Filter' node connected to 'c-Means' to keep Code and WinningCluster. Join the resulting table to the original dataset on EAN code.2) Use 'GroupBy' to calculate sum of Value Sales for each cluster and join resultsto dataset on Winning Cluster.3) Calculate SKU Share via 'Math Formula' row.4) Use 'Pivoting' node to compute Company share for each cluster.5) Finally, use 'Column Filter' to keep P&G shares only. With optimal number of clusters now determined, we can proceed to test a couple different clustering techniques and see how results compare. We can then select one of the approaches and analyse how P&G is performing in each cluster in terms of share. Finally, we will import a list of soon-to-be-launched SKUs and evaluate thecluster they will belong to. Coupling this with the P&G share information will allow us to understand how P&G SKUs are covering the Size/Price Clusters. We'll split the table and run Knn with Euclidean Distance and k=31) Use 'Row Splitter' to extract SKUs without package information2) For both outputs, use Column filter to remove unnecessarycolumns, keeping only Size, Packaging and PPSU3) Connect 'Numeric Distance' node to Normalized data and set itto compute Euclidean Distance on both Size and PPSU4) Use 'K Nearest Neighbor' node to assign Packaging labelsbased on the chosen distance metric. Preprocessing phase. Our objective is to run Knn using Variant as target column. We will use SKU Size and Priceper SU for distance evaluation. In order to get there, we need to:1) calculate Price per SU (Value/Volume) using 'Math Formula'. 2) Normalize PPSU and Size with 'Normalizer' node.We can also use 'Color Manager' and 'Scatter Plot' nodes to visualize our dataset. Exercise 2: Clustering of SKU by Size/Price and evaluation of new SKU launches We will perform k-Means clustering with a number of clusters ranging from 2 to 10. At each step, we willcompute the variance metrics of each resulting cluster. In order:1) Drag a generic loop start node2) Create a 'Java Edit Variable' node to generate the cluster number variable. This can be done byincrementing the loop counter by 2.3) Connect a 'k-Means' node to loop start and variable node. Configure it to cluster on Size and PPU.4) Use a 'One-way ANOVA' node to computer variance metrics.5) Close the loop with 'Variable Condition Loop End', with termination condtion set to cluster number=10. Preprocessing phase. Our objective is to cluster SKUs by Size and Unit Price. In order to get there, we need to:1) calculate Price per Unit. We use 'Math Formula' node to get Price per SU. We then use Size information andShampoo SU factor (2571ml) to get to price per Unit.2) Normalize PPU and Size with 'Normalizer' node. NOTE: use z-score to avoid 1-element clusters in ANOVA.We can also use 'Color Manager' and 'Scatter Plot' nodes to visualize our dataset. In Section 2, we use the "Elbow Method" to determine the optimal number of clusters to use for our dataset. This method aims at plotting a line chart with number ofclusters on the x axis, and whithin-cluster variance on the y axis. The idea is to use visual inspection to understand where the line starts flattening, i.e. when the use ofadditional clusters stops bringing incremental knowledge. We will now plot the elbow curves for Size and PPU.1) Use 'Row Filter' node to keep only "whithin group variance" rows.2) Use 'Column Filter' to keep Sum of Squares.3) use 'Row Splitter' to divide PPU and Size rows.4) connect 2 'Line Plot' nodes to visualize results. The first section of the workflow deals with loading the data into KNIME and carry out some basic pre-processing. From the KNIMEExplorer, dragand dropSHP_data.csv Addendum: Evaluation of optimal number of clusters for Shampoo SKUs Size/Price data. $P&G+Value Share$ >= 50 AND NOT (MISSING $Code$) =>"bad choice"$P&G+Value Share$ < 50 AND MISSING $Code$ =>"bad choice"MISSING $P&G+Value Share$ AND MISSING $Code$ =>"bad choice"TRUE =>"good choice" Compute PPSUCompute PPUNormalize Measures using z-scoreColor by BrandVisualize ResultsCompute VarianceLoop endLoop till 10 clustersCluster CounterElbow ChartWithin Group VarianceElbow ChartNormalize MeasuresCompute PPSUSplit Size and PPUKeep relevantKnn with Euclidean DistanceCompute DistancesKeeprelevant columnsKeep Codeand ClusterAdd it to TableCompute ClusterValue SalesCompute PPSUJoin toDenormalizedDataCalculateShareSet ColorsMerge launchesand cluster infoPlotKeep Company,Code and ClusterSet ColorsClassifyPrep dataPlotQualify SKU LaunchesCompute PPUClusteringNormalize MeasuresFilter "NA"Keeprelevant columnsVisualize ResultsColor by PackagingBuild full setVisualize ResultsMerge Variant andClustering ResultKeep relevantcolumnsNormalizePivotKeep P&G onlyClusteringPlotSet ColorsClusteringAppend ColorsRead SKU listRead SKU listRead SKU listRead SKU listMath Formula Math Formula Normalizer Color Manager Scatter Plot(legacy) One-way ANOVA Variable ConditionLoop End (deprecated) Generic Loop Start(deprecated) Java Edit Variable Line Plot (legacy) Row Filter Line Plot (legacy) Normalizer Math Formula Row Splitter Column Filter K Nearest Neighbor (DistanceFunction) (deprecated) Numeric Distances Column Filter Column Filter Joiner (deprecated) GroupBy Math Formula Joiner (deprecated) Math Formula Color Manager Joiner (deprecated) Scatter Plot(legacy) Column Filter Color Manager Cluster Assigner Preprocessing Pie chart (legacy) Rule Engine Math Formula k-Means Normalizer Row Splitter Column Filter Scatter Plot(legacy) Color Manager Concatenate Scatter Plot(legacy) Column Merger Column Filter Normalizer (Apply) Pivot Column Filter Fuzzy c-Means Scatter Plot(legacy) Color Manager Fuzzy c-Means Color Appender(deprecated) File Reader File Reader File Reader File Reader

Nodes

Extensions

Links