Item Set Finder (Borgelt)

The node provides different algorithms to searches for frequent items in a list of item sets. The integrated algorithms are:
  1. Apriori (Agrawal et al. 1993)
  2. FPgrowth (frequent pattern growth, Han et al 2000)
  3. RElim (recursive elimination)
  4. SaM (Split and Merge)
  5. JIM (Jaccard Item Set Mining)
The algorithms have been implemented by Christian Borgelt. The following descriptions have been taken from his homepage.

Apriori: The apriori algorithm (Agrawal et al. 1993) carries out a breadth first search on the subset lattice and determines the support of item sets by subset tests. This is a pretty fast implementation that uses a prefix tree to organize the counters for the item sets. The census data set may be used to test the program.

FPgrowth: The FPgrowth algorithm (frequent pattern growth, Han et al 2000) represents the transaction database as a prefix tree which is enhanced with pointers that organize the nodes into lists referring to the same item. The search is carried out by projecting the prefix tree, working recursively on the result, and pruning the original tree. Since version 1.2 this implementation also contains the alpha-pruning of the FP-Bonsai techniques.

RElim: The RElim algorithm (recursive elimination) is inspired by the FP-growth algorithm, but does its work without prefix trees or any other complicated data structures. The main strength of this algorithm is not its speed (although it is not slow, but even outperforms apriori and eclat on some data sets), but the simplicity of its structure. Basically all the work is done in one recursive function of fairly few lines of code.

SaM: The split and merge algorithm (Split and Merge) combines a depth-first traversal of the subset lattice with a horizontal transaction representation. The main strength of this algorithm is not its speed (although it is not slow, but even outperforms apriori and Eclat on some data sets), but the simplicity of its structure. Basically all the work is done in one recursive function of about fairly few lines of code. In addition, it only uses a simple array as the only data structure.

JIM: Finds Jaccard item sets with an extension of the Eclat algorithm. In analogy to frequent item set mining, where one tries to find item sets the support of which exceeds a user-specified threshold (minimum support) in a database of transactions, a Jaccard item set is an item set for which the (generalized) Jaccard index of its item covers exceeds a user-specified threshold. This measure yields a much better assessment of the association strength of the items than simple support. Since the (generalized) Jaccard index is, like the support, also anti-monotone, the same basic approach can be used for the search, provided it is extended to compute the denominator of the Jaccard index.

Dice: Finds Dice item sets with an extension of the Eclat algorithm. In analogy to frequent item set mining, where one tries to find item sets the support of which exceeds a user-specified threshold (minimum support) in a database of transactions, a Dice item set is an item set for which the Dice index of its item covers exceeds a user-specified threshold.

Tanimoto: Finds Tanimoto item sets with an extension of the Eclat algorithm. In analogy to frequent item set mining, where one tries to find item sets the support of which exceeds a user-specified threshold (minimum support) in a database of transactions, a Tanimoto item set is an item set for which the Tanimoto index of its item covers exceeds a user-specified threshold.

Options

Item column
The collection column that contains the item set to mine.
Algorithm
The algorithm to use. For details about the different algorithms see the description above.
Target type
The type of item set to produce:
  1. Frequent: All frequent sets.
  2. Closed: A set that is frequent but none of the superset has the same support.
  3. Maximal: A set that is frequent with no frequent supersets.
Minimum set size
The minimum size of a set.
Limit set size
Whether to limit the maximum size of a set. It is recommended to limit the maximum set size in order to keep memory consumption to a minimum.
Maximum set size
The maximum size of a set.
Minimum support
The minimum support. Note that the smaller this value, the more itemsets are considered by the algorithms. For some datasets this can cause a very high memory consumption. If you find yourself in such a scenario, it is advised to increase the minimum support or at least limit the set size.
Threshold
This is an optional parameter that is only enabled for certain algorithm. The algorithm that support a threshold it including its description are:
  1. Apriori: Minimum confidence of a rule (default 80%).
  2. RElim: Minimum weight of an item set (default: 10%).
  3. SaM: Minimum weight of an item set (default: 10%).
  4. JIM: Minimum Jaccard index of an item set (default: 10%).
Sort item set
The items in the frequent item set are sorted in ascending order if this option is selected.

Advanced Settings

Additional parameter
Additional parameter that should be passed to the algorithm separated by a space. For details about the available parameters see the web page of the corresponding algorithm on Christian Borgelts web page.

Input Ports

Icon
Transaction list

Output Ports

Icon
Item Sets

Popular Predecessors

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.