0 ×

Spark Pivot

KNIME Extension for Apache Spark core infrastructure version 2.4.0.v201811301556 by KNIME AG, Zurich, Switzerland

Performs a pivoting on the given Spark DataFrame/RDD using a selected number of columns for grouping and one column for pivoting. Each combination of values in the grouping columns will result into an output row. Each combination of pivot values and aggregations becomes a new output column.

The aggregations to perform can be specified (a) by selecting the columns directly in the "Manual Aggregation" tab, and (b) by a column name search pattern or regular expression in the "Pattern Based Aggregation" tab, and (c) by column type in the "Type Based Aggregation" tab. Each input column is only considered once, i.e. columns that are added directly on the "Manual Aggregation" tab are ignored even if their name matches a search pattern on the "Pattern Based Aggregation" tab or their type matches a type on the "Type Based Aggregation" tab. The same holds for columns that are added based on a search pattern. They are ignored even if they match a criterion that has been defined in the "Type Based Aggregation" tab.

A detailed description of the available aggregation methods can be found on the 'Description' tab in the node dialog. Further information can be also found on the Spark documentation and the Spark API documentation.

This node requires at least Apache Spark 2.0.

Options

Groups

Group settings
Select one or more column(s) according to which the group(s) is/are created.

Advanced settings

Column naming
The names of the resulting pivot/aggregation columns depend on the selected naming schema. The name of each such column consists of a value from the pivot column, a separator ("+") and then the name of an aggregation, which is chosen according to one of the following schemata:
  • Keep original name(s): Keeps the original names of the input columns. With this option, you can use each input column only for one aggregation to prevent duplicate column names.
  • Aggregation method (column name): Uses the aggregation method first and appends the name of the input column in brackets.
  • Column name (aggregation method): Uses the the name of the input column first and appends the aggregation method in brackets.
Add COUNT(*)
Tick this option to perform the COUNT(*) aggregation in addition to the selected aggregations.
column name
The display of the COUNT(*) aggregation. Only enabled if the "Add COUNT(*)" option is selected.

Pivot

The "Pivot" tab allows to transpose the values of one input column into individual output columns. To pivot over multiple columns, you can use the Spark SQL with the concat() function before pivoting.

Pivot column
The column containing the values to transpose.
Pivot values
  • Use all values: If set, Spark will automatically determine all distinct values in the chosen pivot column and sort them. Selecting this option implies that the ingoing Spark DataFrame/RDD will be materialized, which may take some time.
  • Use values from data table: This option is set, if a data table is connected to the node's optional input port. In this case, the pivot values are taken from a column in the data table. This is useful if (a) the pivot values of interest need to be dynamically specified, or (c) you want to avoid materialization of the ingoing DataFrame/RDD to speed up pivoting performance.
  • Manually specify values: If set, pivot values can be manually specified. This is useful if (a) you are only interested in a small subset of the pivot values, or (b) the resulting columns should be ordered in a specific way, or (c) you want to avoid materialization of the ingoing DataFrame/RDD to speed up pivoting performance.
Limit number of values
This option can be set to only take the top-k pivot values. This is useful to avoid excessive amounts of columns in the resulting data. This options is available when Use all values or Use values from data table is selected.
Ignore missing values
Ignore rows containing missing values in pivot column. This option is available when Use all values or Use values from data table is selected.
Fail, if pivot column in DataFrame/RDD contains different values
Fail execution, if the specified pivot values do not contain all values present the Spark DataFrame/RDD. The failure does not occur when the Spark Pivot node executes, but when the resulting Spark DataFrame/RDD is materialized. This happens usually during one of the subsequent nodes. This option is available when Use values from data table or Manually specify values is selected.

Manual Aggregation

In the "Manual Aggregation" tab you can select one or more columns for aggregation.

Aggregation settings
Select one or more column(s) for aggregation from the available columns list. Change the aggregation method in the Aggregation column of the table. You can add the same column multiple times. In order to change the aggregation method of more than one column select all columns to change, open the context menu with a right mouse click and select the aggregation method to use.
Parameter
The parameter column shows an "Edit" button for all aggregation operators that require additional information. Clicking on the "Edit" button opens the parameter dialog which allows changing the operator specific settings.

Pattern Based Aggregation

In the "Pattern Based Aggregation" tab you can assign aggregation methods to columns based on a search pattern. The pattern can be either a string with wildcards or a regular expression. Columns where the name matches the pattern but where the data type is not compatible with the selected aggregation method are ignored. Only columns that have not been selected as group column or that have not been selected as aggregation column on the "Manual Aggregation" tab are considered.

Aggregation settings
Use the "Add" button to add a new row with a search pattern to the aggregation settings. The search pattern can either be a string with wildcards or a regular expression. Supported wildcards are * (matches any number of characters) and ? (matches one character) e.g. KNI* would match all strings that start with KNI such as KNIME whereas KNI? would match only strings that start with KNI followed by a fourth character. Double click the "Search pattern" cell to edit the pattern. The cell is colored in red if the pattern is invalid.
RegEx
Tick this option if the search pattern is a regular expression otherwise it is treated as string with wildcards ('*' and '?').
Parameter
The parameter column shows an "Edit" button for all aggregation operators that require additional information. Clicking on the "Edit" button opens the parameter dialog which allows changing the operator specific settings.

Type Based Aggregation

The "Type Based Aggregation" tab allows to select an aggregation method for all columns of a certain data type e.g. to compute the mean for all numerical columns (DoubleCell). Only columns that have not been handled by the other tabs e.g. group, column based and pattern based are considered. The data type list to choose from contains basic types e.g String, Double, etc. and all data types the current input table contains.

Aggregation Settings
Select one or more data type from the available type list. Change the aggregation method in the Aggregation column of the table. You can add the same data type multiple times. The list contains standard types e.g. Double, String etc. and all types of the input table.
Parameter
The parameter column shows an "Edit" button for all aggregation operators that require additional information. Clicking on the "Edit" button opens the parameter dialog which allows changing the operator specific settings.

Input Ports

Spark DataFrame/RDD to apply pivoting on.
KNIME table with pivot values (optional).

Output Ports

DataFrame/RDD with pivoted data.

Update Site

To use this node in KNIME, install KNIME Extension for Apache Spark core infrastructure from the following update site:

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.