Create Spark Context (Livy)

Creates a new Spark context via Apache Livy.

This node requires access to a remote file system such as HDFS/webHDFs/httpFS or S3/Blob Store/Cloud Store in order to exchange temporary files between KNIME and the Spark context (running on the cluster).

Note: Executing this node always creates a new Spark context. Resetting the node or closing the KNIME workflow will destroy the Spark context. Spark contexts created by this node cannot be shared between KNIME workflows.

Options

General

Spark version
The Spark version used by Livy. If this is set incorrectly, creating the Spark context will fail.
Livy URL
The URL of Livy including protocol and port e.g. http://localhost:8998.
Authentication
Select
  • None, if Livy does not require any credentials.
  • Credentials, if Livy requires HTTP "Basic" authentication and a credentials variable should be used to determine username and password.
  • Username, if Livy does not require authentication, but the Spark context should run as a particular user.
  • Username & password, if Livy requires HTTP "Basic" authentication.
  • Kerberos, if Livy requires Kerberos authentication.
Spark executor resources
Select the "Override default Spark executor resources" option to manually set the resources for the Spark executors. If enabled you can specify the amount of memory and the number of cores for each executor.
In addition you can specify the Spark executor allocation strategy:
  • Default allocation uses the cluster default allocation strategy.
  • Fixed allocation allows you to specify a fixed number of Spark executors.
  • Dynamic allocation allows you to specify the minimum and maximum number of executors that Spark can use. Executors are allocated up to the maximum number and destroyed if no longer needed until the minimum number of executors is reached.
Estimated resources
An estimation of the resources that are allocated in your cluster by the Spark context. The calculation uses default settings for memory overheads etc. and is thus only an estimate. The exact resources might be different depending on your specific cluster settings.

Advanced

Override default Spark driver resources
If enabled you can specify the amount of memory and number of cores the Spark driver process will allocate.
Set staging area for Spark jobs
If enabled you can specify a directory in the connected remote file system, that will be used to transfer temporary files between KNIME and the Spark context. If no directory is set, then a default directory will be chosen, e.g. the HDFS user home directory. However, if the remote file system is Amazon S3, Azure Blob Store or Google Cloud Storage, then a staging directory must be provided.
Set custom Spark settings
If enabled you can specify additional Spark setting. A tooltip is provided for the keys if available. For further information about the Spark settings refer to the Spark documentation. Invalid keys or values are highlighted with a red background. Custom keys are highlighted with a yellow background and should be prefixed with spark or spark.hadoop.

Time

This tab allows to set a time zone that is applied in two cases:

  • To set the Spark SQL session time zone, which is relevant in Spark SQL, when parsing Strings such as '2020-03-27 08:00:00' into Timestamps (to_timestamp) and vice versa, as well as datetime manipulations.
  • In KNIME, when mapping between Spark Timestamps and the KNIME (legacy) Date and Time type column type. Here, the specified time zone will be used to make the Date and Time value in KNIME equal to what would be displayed when displaying a Timestamp in Spark.

Do not set
Leaves the Spark SQL session time zone unchanged and does not align the display of KNIME (legacy) Date and Time columns.
Use default time zone from Spark cluster
Leaves the Spark SQL session time zone unchanged but aligns the display of KNIME (legacy) Date and Time columns, based on the Spark SQL session timezone. This is the default.
Use fixed time zone
Allows to specify a time zone that will be set as the Spark SQL session time zone. The same time zone and which will be used to align the display of KNIME (legacy) Date and Time columns.
Fail on different cluster default time zone
Allows to specify whether this node should fail, when the cluster-side default time zone is different from the fixed time zone that was specified.

Input Ports

Icon
A connection to a remote file system to exchange temporary files between KNIME and the Spark context (running on the cluster). Supported file systems are:
  • HDFS, webHDFS and httpFS. Note that here KNIME must access the remote file system with the same user as Spark, otherwise Spark context creation fails. When authenticating with Kerberos against both HDFS/webHDFs/httpFS and Livy, then usually the same user will be used. Otherwise, this must be ensured manually.
  • Amazon S3, Azure Blob Store and Google Cloud Storage (recommended when using Spark on Amazon EMR/Azure HDInsight/Google Cloud Dataproc). Note that for these file systems a staging area must be specified (see above).

Output Ports

Icon
Spark context.

Views

Spark log
Displays the log messages returned by the spark-submit process on the Livy server machine. This view does not provide the YARN container logs.

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.