0 ×

Spark Entropy Scorer

KNIME Extension for Apache Spark core infrastructure version 4.2.0.v202007072005 by KNIME AG, Zurich, Switzerland

Scorer for clustering results given a reference clustering. Connect the Spark DataFrame/RDD containing a column with the reference cluster IDs as well as a column containing the clustering results to the input port. The respective columns can be selected in the dialog. After successful execution, the view will show entropy values (the smaller the better) and a quality value (in [0,1] - with 1 being the best possible value, as used in Fuzzy Clustering in Parallel Universes , section 6: "Experimental results").


Reference column
Column containing the reference clustering.
Clustering column
Column containing the cluster IDs to evaluate.
Output scores as flow variables
The scores can be exported as flow variables.
Prefix of flow variables
This option allows you to define a prefix for these variable identifiers so that name conflicts are resolved.

Input Ports

Arbitrary input Spark DataFrame/RDD with at least two columns, where one column contains the reference clustering and one the clustering that shall be scored.

Output Ports

Table containing entropy values for each cluster. The last row contains statistics on the entire clustering. It corresponds to the table shown in the Statistics View.


Statistics View
Simple statistics on the clustering such as number of clusters being found, number of objects in clusters, number of reference clusters, and total number of objects. Further statistics include:
  • Entropy: The accumulated entropy of all identified clusters, weighted by the relative cluster size. The entropy is not normalized and may be greater than 1.
  • Quality: The quality value according to the formula referenced above. It is the sum of the weighted qualities of the individual clusters, whereby the quality of a single cluster is calculated as (1 - normalized_entropy). The domain of the quality value is [0,1].
The table at the bottom of the view provides statistics on cluster size, cluster entropy, normalized cluster entropy and quality. The entropy of a clusters is based on the reference clustering (provided at the first input port) and the normalized entropy is this value scaled to an interval [0, 1]. More precisely, it is the entropy divided by log2(number of different clusters in the reference set). The quality value is only available in the last row (showing the overall statistics).

Best Friends (Incoming)

Best Friends (Outgoing)


To use this node in KNIME, install KNIME Extension for Apache Spark from the following update site:


A zipped version of the software site can be downloaded here.

You don't know what to do with this link? Read our NodePit Product and Node Installation Guide that explains you in detail how to install nodes to your KNIME Analytics Platform.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform. Browse NodePit from within KNIME, install nodes with just one click and share your workflows with NodePit Space.


You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.