Spark k-Means

This node applies the Apache Spark K-means clustering algorithm. It outputs the cluster centers for a predefined number of clusters (no dynamic number of clusters). K-means performs a crisp clustering that assigns a data vector to exactly one cluster. The data is not normalized by the node (if required, you should consider to use the "Spark Normalizer" as a preprocessing step).

Use the Spark Cluster Assigner node to apply the learned model to unseen data.

Options

Number of clusters: The number of clusters (cluster centers) to be created.
Number of iterations: The maximal number of iterations after which the algorithm terminates, independent of the accuracy improvement of the cluster centers.
Initialization seed: Random seed for cluster initialization (requires Apache Spark 1.3 or later).
Feature Columns: The feature columns to learn the model from. Supports only numeric columns.

Input Ports

: Input data (JavaRDD)

Output Ports

: The input data labeled with the cluster they are contained in.
: MLlib Cluster Model

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Extension for Apache Spark (legacy) from the below update site following our NodePit Product and Node Installation Guide:

v5.5

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.5.0.v202506051107

On NodePit since: 2025-07-02

Last update: 2025-08-14

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!