0 ×

Create Databricks Environment

KNIME Extension for Apache Spark core infrastructure version 4.2.0.v202007072005 by KNIME AG, Zurich, Switzerland

Creates a Databricks Environment connected to an existsing Databricks cluster. See AWS or Azure Databricks documentation for more information.

Note: To avoid an accidental cluster startup, this node creates a dummy DB and Spark port if loaded in executed state from a stored workflow. Reset and execute the node to start the cluster and create a Spark execution context.

Cluster access control: KNIME uploads additional libraries to the cluster. This requires manage cluster-level permissions if your cluster is secured with access control. See the Databricks documentation on how to set up the permission.

Options

General

Spark version
The Spark version used by Databricks. If this is set incorrectly, creating the Spark context will fail.
Databricks URL
Full URL of the Databricks deployment, e.g. https://<account>.cloud.databricks.com on AWS or https://<region>.azuredatabricks.net on Azure.
Cluster ID
Unique identifier of a cluster in the databricks workspace. See AWS or Azure Databricks documentation for more informations.
Workspace ID
Workspace ID for Databricks on Azure, leave blank on AWS. See Azure Databricks documentation for more informations.
Authentication
Workflow credentials, username and password or tokens can be used for authentication. Databricks strongly recommends tokens. See authentication in Databricks AWS or Azure documentation for more informations about personal access token.
To use tokens in workflow credentials, use token as username in the credentials and the token as password.

Advanced

Create Spark context
If enabled, an execution context will be started on Databricks to run KNIME Spark jobs. If disabled, the Spark context port will be disabled. This might be useful to save resources in the driver process or required if the cluster runs with Table Access Control.
Set staging area for Spark jobs
If enabled you can specify a directory in the connected Databricks file system, that will be used to transfer temporary files between KNIME and the Spark context. If no directory is set, then a default directory will be chosen in /tmp.
Terminate cluster on context destroy
If selected, the cluster will be destroyed when the node will be reseted, the Destroy Spark Context node executed on the context, the workflow or KNIME is closed. This way, resources are released, but all data cached inside the cluster are lost, unless they have been saved to persistent storage such as DBFS.
Databricks connection and receive timeout
Timeouts for the REST client in seconds.
Job status polling interval
The frequency with which KNIME polls the status of a job in seconds.

DB Port: Connection settings

Database Dialect
Choose the registered database dialect here.
Driver Name
Choose the registered database driver here. The node includes the Apache Hive driver. Proprietary drivers are also supported, but need to be downloaded and registered in the KNIME preferences under "KNIME -> Databases" with Database type Databricks.
The node uses the proprietary driver as default if registered and the Apache Hive driver otherwise.

DB Port: JDBC Parameters

This tab allows you to define JDBC driver connection parameter. The value of a parameter can be a constant, variable, credential user, credential password or KNIME URL.

DB Port: Advanced

This tab allows you to define KNIME framework properties such as connection handling, advanced SQL dialect settings or logging options. The available properties depend on the selected database type and driver.

DB Port: Input Type Mapping

This tab allows you to define rules to map from database types to KNIME types.

Mapping by Name
Columns that match the given name (or regular expression) and database type will be mapped to the specified KNIME type.
Mapping by Type
Columns that match the given database type will be mapped to the specified KNIME type.

DB Port: Output Type Mapping

This tab allows you to define rules to map from KNIME types to database types.

Mapping by Name
Columns that match the given name (or regular expression) and KNIME type will be mapped to the specified database type.
Mapping by Type
Columns that match the given KNIME type will be mapped to the specified database type.

Output Ports

JDBC connection, that can be connected to the KNIME database nodes.
DBFS connection, that can be connected to the Spark nodes to read/write files.
Spark context, that can be connected to all Spark nodes.

Workflows

Installation

To use this node in KNIME, install KNIME Extension for Apache Spark from the following update site:

KNIME 4.2

A zipped version of the software site can be downloaded here. Read our FAQs to get instructions about how to install nodes from a zipped update site.

Wait a sec! You want to explore and install nodes even faster? We highly recommend our NodePit for KNIME extension for your KNIME Analytics Platform.

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.