Challenge 37 - Deduplicate Text - Solution

Challenge 37 - Text Deduplication - Solution

You are asked to read Swedish textual data from a PDF using the Tika Parser. You then notice that much of the text is duplicated, which could be an encoding issue with the PDF itself. Consequently, you decide to to deduplicate the text. In this challenge, do your best to remove excessive amounts of duplicated text using as few nodes as possible. In most cases like this, you are not aiming for perfect removal of text, but instead are aiming for a cost effective approach which eliminates a large chunk of the duplication. Hint: Our solution consists of 5 nodes, but the 5th node may be unnecessary depending on your workflow.

Nodes

Extensions

Download

To use this workflow in KNIME, download it from the below URL and open it in KNIME:

Download Workflow

Created by: victorpalacios

Created at: 2022-08-23

On NodePit since: 2024-03-05

Last update: 2025-08-22

Created with KNIME version: v5.2.1

Tags: text processingtika parserocrjustknimeitjustknimeit-37easydata wranglingdata aggregation

Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.

Try NodePit Runner!

Challenge 37 - Deduplicate Text - Solution

Nodes

Extensions

Links

Download