Icon

justknimeit-37

justknimeit-37
Challenge 37: Text DeduplicationLevel: EasyDescription: You are asked to read Swedish textual data from a PDF using theTika Parser. You then notice that much of the text is duplicated, which could be anencoding issue with the PDF itself. Consequently, you decide to to deduplicatethe text. In this challenge, do your best to remove excessive amounts ofduplicated text using as few nodes as possible. In most cases like this, you arenot aiming for perfect removal of text, but instead are aiming for a cost effectiveapproach which eliminates a large chunk of the duplication. Hint: Our solutionconsists of 5 nodes, but the 5th node may be unnecessary depending on yourworkflow. Read PDFSplitusing \nTransposeRemoveduplicates Tika Parser Cell Splitter Unpivoting DuplicateRow Filter Challenge 37: Text DeduplicationLevel: EasyDescription: You are asked to read Swedish textual data from a PDF using theTika Parser. You then notice that much of the text is duplicated, which could be anencoding issue with the PDF itself. Consequently, you decide to to deduplicatethe text. In this challenge, do your best to remove excessive amounts ofduplicated text using as few nodes as possible. In most cases like this, you arenot aiming for perfect removal of text, but instead are aiming for a cost effectiveapproach which eliminates a large chunk of the duplication. Hint: Our solutionconsists of 5 nodes, but the 5th node may be unnecessary depending on yourworkflow. Read PDFSplitusing \nTransposeRemoveduplicates Tika Parser Cell Splitter Unpivoting DuplicateRow Filter

Nodes

Extensions

Links