Tika Parser

Apache Tika is a library that is mainly used to detect document types and extract textual contents and metadata from various file formats. Internally, Tika delegates all the parsing and detecting works to various existing document parsers and document type detection libraries. Tika provides a single generic API as a universal type detector and content extractor for many file formats. For more information about Tika, please check the Tika website .

This node allows parsing of any kind of documents that are supported by Tika. The type of the files can be selected in the configuration dialog. Users have the choice between selecting the file extensions, or the MIME-types. What kind of information that are to be extracted from the file (metadata and content) can also be selected in the dialog. If possible, user can also extract files that are embedded in the input files, such as attachments in E-mails, etc, and store them in a specified directory. Authentication setting is also provided to parse any encrypted files.

Options

Document directory: Specify the directory where the files are located.
Ignore hidden files: If checked, hidden files will not be considered for parsing.
Recursive: If checked, sub-directories inside the directory will be checked as well.
Choose which type to parse: Specify how to choose the file types that are to be parsed. There are two options, either through the file extensions or the MIME-types.
File Extension: If selected, the list of all file extensions that are supported by Tika will be shown.
MIME-Type: If selected, the list of all MIME-types that are supported by Tika will be shown. For more information about MIME-types and their extensions, check here .
Metadata: The list of all available information that could be parsed from the files by Tika. For more details about the metadata, check here .
Create error column: If checked, an additional error column will be created. This string column will contain any error messages that appear while parsing the files.
New error output column: The name of the new error column.
Extract attachments and embedded files: If checked, any embedded files that are contained in the input files would be extracted and stored in the output directory.
Ouput directory: Specify the directory where the extracted files should be stored. If the specified directory doesn't exist, the node will try to create it.
Parse encrypted files: If checked, the node will try to open any detected encrypted files using the given password. If the password is invalid, a warning will be given.
Enter password: Specify the password for any encrypted files in the directory. Note: this password will be used to open all encrypted files in the directory.

Input Ports

This node has no input ports

Output Ports

: An output table containing the parsed document data. The columns are the same as what was selected in the Metadata list in the configure dialog.
: An output table containing the names of input files that contain any embedded files and also the paths to the extracted files in the output directory.

Popular Predecessors

Popular Successors

Views

This node has no views

Workflows

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.

Installation

To use this node in KNIME, install the extension KNIME Textprocessing from the below update site following our NodePit Product and Node Installation Guide:

v5.5

A zipped version of the software site can be downloaded here.

Plugin provider: KNIME AG, Zurich, Switzerland

Plugin version: 5.5.0.v202412191419

On NodePit since: 2025-07-02

Last update: 2025-07-25

Tags: Streamable

KNIME versions: Since v3.6

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!