Icon

Challenge 37

justknimeit-37
Challenge 37: Text DeduplicationLevel: EasyDescription: You are asked to read Swedish textual data from a PDF using the Tika Parser. You then notice that much of the text is duplicated,which could be an encoding issue with the PDF itself. Consequently, you decide to to deduplicate the text. In this challenge, do your best toremove excessive amounts of duplicated text using as few nodes as possible. In most cases like this, you are not aiming for perfect removal oftext, but instead are aiming for a cost effective approach which eliminates a large chunk of the duplication. Hint: Our solution consists of 5nodes, but the 5th node may be unnecessary depending on your workflow.Author: Victor PalaciosData: PDF in Swedish in the KNIME HubOur solution will appear here next Tuesday. In the meantime, feel free to discuss your work on the KNIME Forum or on social media using thehashtag #justknimeit.Remember to upload your solution with tag justknimeit-37 to your public space on the KNIME Hub. To increase the visibility of your solution,also post it to this challenge thread on the KNIME Forum. easy solution by Erik Pinter: Find and eliminate duplicates by regex.Could easily be adjusted / fine tuned for avarying pdfs ;-) utbildningsplan-VASIN-2.pdfremove duplicateswith regex Tika Parser String Replacer Challenge 37: Text DeduplicationLevel: EasyDescription: You are asked to read Swedish textual data from a PDF using the Tika Parser. You then notice that much of the text is duplicated,which could be an encoding issue with the PDF itself. Consequently, you decide to to deduplicate the text. In this challenge, do your best toremove excessive amounts of duplicated text using as few nodes as possible. In most cases like this, you are not aiming for perfect removal oftext, but instead are aiming for a cost effective approach which eliminates a large chunk of the duplication. Hint: Our solution consists of 5nodes, but the 5th node may be unnecessary depending on your workflow.Author: Victor PalaciosData: PDF in Swedish in the KNIME HubOur solution will appear here next Tuesday. In the meantime, feel free to discuss your work on the KNIME Forum or on social media using thehashtag #justknimeit.Remember to upload your solution with tag justknimeit-37 to your public space on the KNIME Hub. To increase the visibility of your solution,also post it to this challenge thread on the KNIME Forum. easy solution by Erik Pinter: Find and eliminate duplicates by regex.Could easily be adjusted / fine tuned for avarying pdfs ;-) utbildningsplan-VASIN-2.pdfremove duplicateswith regex Tika Parser String Replacer

Nodes

Extensions

Links