Spark Association Rule Learner

This rule learner* uses Spark MLlib to compute frequent item sets and then extract association rules from the given input data. Association rules describe relations between items in a set of transactions. For example, if a customer bought onions, potatos and meat in a transaction, this implies that a new customer who buys onions and potatos is likely to also buy meat. This can be written as an association rule with onions and potatos as antecedents and meat as consequent.

Transactions/item sets are represented as collection columns. The Spark GroupBy or Spark SQL nodes are recommended to create collection columns in Spark.

Frequent item sets are computed using the FP-growth implementation provided by Spark MLlib, using input data with a collection column, where each cell holds the items of a transaction. Rows with missing values in the selected item column are ignored. FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets explicitly and then extracts the frequent item sets from this FP-tree. This approach avoids the usually expensive generation of explicit candidates sets used in Apriori-like algorithms designed for the same purpose. More information about the FP-Growth algorithm can be found in Han et al., Mining frequent patterns without candidate generation. Spark implements Parallel FP-growth (PFP) described in Li et al., PFP: Parallel FP-Growth for Query Recommendation.

Association rules are computed using Spark MLlib, using the previously computed frequent item sets. Each association rule maps an item set (antecedent) to a single item (consequent). The Spark Association Rule Apply node can be used to apply the rules produced by this node.

See Association rule learning (Wikipedia) for general information.

This node requires at least Apache Spark 2.0.

(*) RULE LEARNER is a registered trademark of Minitab, LLC and is used with Minitab’s permission.

Options

Item Column: Collection column, where each cell holds the items of a transaction.
Minimum Support: The minimum support for an item set to be identified as frequent. For example, if an item set appears in 3 out of 5 transactions, it has a support of 3/5=0.6 (default: 0.3).
Number of partitions: Optional: Number of partitions used by the Parallel FP-growth algorithm to distribute the work (default: same as input data).
Minimum Confidence: Sets the minimum confidence for association rules (default: 0.8). Association rules are filtered based on confidence. Confidence is an indication of how often an association rule has been found to be true. For example, if the item set A appears in 10 transactions, and item sets A and B co-occur one time, then the confidence for the rule A => B is 1/10 = 0.1.

Input Ports

: Spark DataFrame with a collection column, where each cell holds the items of a transaction

Output Ports

: Spark DataFrame with association rules
: Spark DataFrame with frequent item sets

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Extension for Apache Spark (legacy) from the below update site following our NodePit Product and Node Installation Guide:

v5.6

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.6.0.v202507151409

On NodePit since: 2025-08-15

Last update: 2025-08-16

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!