Icon

Challenge 37 - Text Deduplication

Challenge 37 - Text Deduplication
Description: You are asked to read Swedish textual data from a PDF using the Tika Parser. You then notice that much of the text is duplicated, which could be an encoding issue with the PDF itself. Consequently, you decide to to deduplicate the text. In this challenge, do your best to remove excessive amountsof duplicated text using as few nodes as possible. In most cases like this, you are not aiming for perfect removalof text, but instead are aiming for a cost effective approach which eliminates a large chunk of the duplication. Hint: Our solution consists of 5 nodes, but the 5th node may be unnecessary depending on your workflow. Challenge 37: Text Deduplication READ DATA DEDUPLICATE DATA Read PDFSplit each row by "\n"Columns toRowApply variousFilters(missing, spaces, descriptions)Apply anotherspaces filterEliminate DuplicationsTika Parser Cell Splitter Transpose Rule-basedRow Filter Row Filter GroupBy Description: You are asked to read Swedish textual data from a PDF using the Tika Parser. You then notice that much of the text is duplicated, which could be an encoding issue with the PDF itself. Consequently, you decide to to deduplicate the text. In this challenge, do your best to remove excessive amountsof duplicated text using as few nodes as possible. In most cases like this, you are not aiming for perfect removalof text, but instead are aiming for a cost effective approach which eliminates a large chunk of the duplication. Hint: Our solution consists of 5 nodes, but the 5th node may be unnecessary depending on your workflow. Challenge 37: Text Deduplication READ DATA DEDUPLICATE DATA Read PDFSplit each row by "\n"Columns toRowApply variousFilters(missing, spaces, descriptions)Apply anotherspaces filterEliminate DuplicationsTika Parser Cell Splitter Transpose Rule-basedRow Filter Row Filter GroupBy

Nodes

Extensions

Links