Icon

justKnimeit-37

justKnimeit-37
Challenge 37: Text DeduplicationDescription: You are asked to read Swedish textual data from a PDF using the Tika Parser. You thennotice that much of the text is duplicated, which could be an encoding issue with the PDF itself.Consequently, you decide to to deduplicate the text. In this challenge, do your best to remove excessiveamounts of duplicated text using as few nodes as possible. In most cases like this, you are not aimingfor perfect removal of text, but instead are aiming for a cost effective approach which eliminates a largechunk of the duplication. read pdfsplit new lineclear text Tika Parser Cell Splitter Transpose RowID String Manipulation DuplicateRow Filter Challenge 37: Text DeduplicationDescription: You are asked to read Swedish textual data from a PDF using the Tika Parser. You thennotice that much of the text is duplicated, which could be an encoding issue with the PDF itself.Consequently, you decide to to deduplicate the text. In this challenge, do your best to remove excessiveamounts of duplicated text using as few nodes as possible. In most cases like this, you are not aimingfor perfect removal of text, but instead are aiming for a cost effective approach which eliminates a largechunk of the duplication. read pdfsplit new lineclear text Tika Parser Cell Splitter Transpose RowID String Manipulation DuplicateRow Filter

Nodes

Extensions

Links