Icon

jKi-37

jKi-37
Challenge 37: Text DeduplicationLevel: EasyDescription: You are asked to read Swedish textual data from a PDF using the Tika Parser. You thennotice that much of the text is duplicated, which could be an encoding issue with the PDF itself.Consequently, you decide to to deduplicate the text. In this challenge, do your best to removeexcessive amounts of duplicated text using as few nodes as possible. In most cases like this, youare not aiming for perfect removal of text, but instead are aiming for a cost effective approach whicheliminates a large chunk of the duplication. Hint: Our solution consists of 5 nodes, but the 5th nodemay be unnecessary depending on your workflow. Node 1Node 2Node 3Node 4Node 5 Tika Parser Cell Splitter Transpose Column Filter DuplicateRow Filter Challenge 37: Text DeduplicationLevel: EasyDescription: You are asked to read Swedish textual data from a PDF using the Tika Parser. You thennotice that much of the text is duplicated, which could be an encoding issue with the PDF itself.Consequently, you decide to to deduplicate the text. In this challenge, do your best to removeexcessive amounts of duplicated text using as few nodes as possible. In most cases like this, youare not aiming for perfect removal of text, but instead are aiming for a cost effective approach whicheliminates a large chunk of the duplication. Hint: Our solution consists of 5 nodes, but the 5th nodemay be unnecessary depending on your workflow. Node 1Node 2Node 3Node 4Node 5 Tika Parser Cell Splitter Transpose Column Filter DuplicateRow Filter

Nodes

Extensions

Links