Create Local Big Data Environment

Creates a fully functional local big data environment including Apache Hive, Apache Spark and HDFS.

The Spark WebUI of the created local Spark context is available via the Spark context outport view. Simply click on the Click here to open link and the Spark WebUI is opened in the internal web browser.

Note: Executing this node only creates a new Spark context, when no local Spark context with the same Context name currently exists. Resetting the node does not destroy the context. Whether closing the KNIME workflow will destroy the context or not, depends on the configured Action to perform on dispose. Spark contexts created by this node can be shared between KNIME workflows.

Options

Settings

Context name
The unique name of the context. Only one Spark context will be created when you execute several Create Local Big Data Environment nodes with the same context name.
Number of threads
The number of threads the local Spark runtime can be use.
On dispose
Decides what happens with the Spark context when the workflow or KNIME is closed.
  • Destroy Spark context: Will destroy the Spark context and free up all allocated resources.
  • Delete Spark DataFrames: Will delete all Spark DataFrames but keeps the Spark context with all the allocated resources open.
  • Do nothing: Leaves the Spark context and all created Spark DataFrames as is.
SQL Support
  • Spark SQL only: The Spark SQL node will only support Spark SQL syntax. The Hive connection port will be disabled.
  • HiveQL: The Spark SQL node will support HiveQL syntax. The Hive connection port will be disabled.
  • HiveQL and provide JDBC connection: The Spark SQL node will support HiveQL syntax. The Hive connection port will be enabled, which allows you to also work with a local Hive instance using the KNIME database nodes.
Working directory
Specify the working directory of the resulting file system connection. The working directory is a local directory. Based on the working directory, downstream nodes can access files/folders using relative paths and do not have to specify full (absolute) paths.
  • Manual: Sets the working directory to the given (absolute) path.
  • Home directory: Sets the working directory to the home directory of the user executing this node, e.g. C:\Users\myuser.
  • Current workflow data area: Sets the working directory to the data area of the current workflow. The data area is a directory called data inside the directory of the current workflow.

Time

This tab allows to set a time zone that is applied in two cases:

  • To set the Spark SQL session time zone, which is relevant in Spark SQL, when parsing Strings such as '2020-03-27 08:00:00' into Timestamps (to_timestamp) and vice versa, as well as datetime manipulations.
  • In KNIME, when mapping between Spark Timestamps and the KNIME (legacy) Date and Time type column type. Here, the specified time zone will be used to make the Date and Time value in KNIME equal to what would be displayed when displaying a Timestamp in Spark.

Do not set
Leaves the Spark SQL session time zone unchanged and does not align the display of KNIME (legacy) Date and Time columns.
Use default time zone from Spark cluster
Leaves the Spark SQL session time zone unchanged but aligns the display of KNIME (legacy) Date and Time columns, based on the Spark SQL session timezone. This is the default.
Use fixed time zone
Allows to specify a time zone that will be set as the Spark SQL session time zone. The same time zone and which will be used to align the display of KNIME (legacy) Date and Time columns.
Fail on different cluster default time zone
Allows to specify whether this node should fail, when the cluster-side default time zone is different from the fixed time zone that was specified.

Advanced

Use custom Spark settings
Select this option to specify additional Spark settings. For more details see Custom Spark settings description.
Custom Spark settings
Allows you to pass on any settings to the Spark context. Especially interesting if you want to add additional jars e.g. to test your own UDFs.
Hide warning about an existing local Spark context
Enable this option to suppress a warning message shown when the Spark context to be created by this node already exists. For further details see the Context name option.
Use custom Hive data folder (Metastore DB & Warehouse)
If selected, the Hive table definitions and data files are stored in the specified location and will be also available after a KNIME restart. If not selected, all Hive related information is stored in a temporary location which will be deleted when the local Spark context is destroyed.

Input Ports

This node has no input ports

Output Ports

Icon
JDBC connection to a local Hive instance. This port can be connected to the KNIME database nodes.
Icon
File System connection that points to the local file system. This port can be connected for example to the Spark nodes that read/write files.
Icon
Local Spark context, that can be connected to all Spark nodes.

Views

This node has no views

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.