Icon

justknimeit-37 - Text Deduplication

justknimeit-37 - Text Deduplication
Challenge 37: Text DeduplicationLevel: EasyDescription: You are asked to read Swedish textual data from a PDFusing the Tika Parser. You then notice that much of the text isduplicated, which could be an encoding issue with the PDF itself.Consequently, you decide to to deduplicate the text. In thischallenge, do your best to remove excessive amounts of duplicatedtext using as few nodes as possible. In most cases like this, you arenot aiming for perfect removal of text, but instead are aiming for acost effective approach which eliminates a large chunk of theduplication. Hint: Our solution consists of 5 nodes, but the 5th nodemay be unnecessary depending on your workflow. PDF readerSplits new line into columnsTransposesRemoves row duplicates Tika Parser Cell Splitter Transpose DuplicateRow Filter Challenge 37: Text DeduplicationLevel: EasyDescription: You are asked to read Swedish textual data from a PDFusing the Tika Parser. You then notice that much of the text isduplicated, which could be an encoding issue with the PDF itself.Consequently, you decide to to deduplicate the text. In thischallenge, do your best to remove excessive amounts of duplicatedtext using as few nodes as possible. In most cases like this, you arenot aiming for perfect removal of text, but instead are aiming for acost effective approach which eliminates a large chunk of theduplication. Hint: Our solution consists of 5 nodes, but the 5th nodemay be unnecessary depending on your workflow. PDF readerSplits new line into columnsTransposesRemoves row duplicates Tika Parser Cell Splitter Transpose DuplicateRow Filter

Nodes

Extensions

Links