Optimized K-Means (Silhouette Coefficient)

This component determines the best number of clusters (k) for k-Means according to the mean silhouette coefficient.
The component uses the Parameter Optimization Loop which retrains k-Means with a different k at each iteration.
In the dialog, select the columns used for k-Means, set the range of tested k's by choosing a start value, the maximum number of iterations and the step size taken at each iteration.
The data gets shuffled using the configured seed before passing it to k-Means to prevent bad initialization of the cluster centers in case the data is ordered.
The clustering algorithm uses the Euclidean distance on the selected attributes. The data is not normalized by the node (if required, you should consider to use the "Normalizer" as a preprocessing step).

Options

Select columns:
Only numerical columns are valid, because k-Means doesn't work for non-numeric features.
Start value for k:
The smallest number of clusters to be tested.%%00010If the 'BruteForce' optimization strategy is selected the selected value will be the first k to be tested for training.%%00010If the 'Hillclimbing' optimization strategy is selected the selected value will be the smallest value of k allowed during the iterations.
Max number of iterations:
At each iteration k-Means is trained with a different k. Select here the maximum number of iterations allowed. %%00010If the 'BruteForce' optimization strategy is selected the selected value will be the exact number of iterations performed.%%00010If the 'Hillclimbing' optimization strategy is selected it could happen that less iterations are required to reach a local optimum and end the iterations.%%00010In both cases the maximum value of k allowed will be equal to: "start value + (max number of iterations - 1) * step size".
Step size for k:
Define the increment in the number of clusters for each tested clustering.
Seed for shuffling data and Hillclimbing starting k:
Shuffling data prevents bad initialization of cluster centers in case of ordered data. Hillclimbing starting k highly influences which local optimum will be found.%%00010Change this value to make sure your results are not depending on the shuffled order of the rows or on the starting k of the Hillclimbing strategy.
Optimization strategy:
Choose 'BruteForce' or 'Hillclimbing' as optimization strategy.

Input Ports

Icon
Table containing the data to be clustered.

Output Ports

Icon
Table containing the different k-values and the mean silhoutte coefficient sorted according the best silhouette coefficient.
Icon
The input data labeled with the cluster they are contained in. Output from k-Means for best k.
Icon
The created clusters from k-Means with best k.
Icon
PMML cluster model from k-Means with best k.

Nodes

Extensions

Links