Icon

Parallel writing to Files

ShowcaseIntentionDisplay issue, intended to failExpected ResultShould create uniform table with threecolumnsProblemFile (table type) either gets corrupted ordata (CSV) gets so too since chunksinterfere with each other. Writing to table in parallel can corrupt table file. Reading a corrupttable can result in an error like:Execute failed: invalid entry size (expected 114508917 but got76869 bytes) ISSUE with CSV WriterDespite beign executed (state green),despite lock files which prevent inteerference,despite implicit wait of two seconds ...thee CSV Writer does not seem to finished!Result: Corrupt Table Structure!!! IMPORTANT !!!There are a few isssue with the CSV Writer as well asthe List Files/Folders NodeUsing Wait node to wait for file deletion is buggy. See:https://forum.knime.com/t/wait-node-wait-for-file-deletion-stuck-in-loop/83324Using List FIles/Folders is buggy too. See:https://forum.knime.com/t/list-files-folders-should-return-no-result-but-fails-instead/83377 Python WorksUsing a Python node to write to the CSVFile creates a uniform data Preamble - "With great power comes great responsibility"KNIME offers a vast array of options. Its "low to no code" approach makes problem-solvingaccessible and democratizes its use. However, with that "power" also comes responsibility.The ease of use in KNIME can create a false sense of security because "it just works."IntentionKNIME users might simply go with the flow of "it just works," without verifying results whenadvancing to more sophisticated solutions.Alternatively, while attempting to fully leverage KNIME's parallel processing capabilities,users might encounter data consistency issues, as demonstrated here.Purpose - ScenarioGiven the situation of scraping a complte website. Some URLs might not be present in theXML-Sitemap or no XML-Sitemap exists et all. If scraping each page takes five seconds andassuming a website has 10k pages, scraping it with one process, it would take:50.000 seconds = 833 Minutes = circa 14 hoursFor ten parallel processes that time would be reduced to a little more than one hour!In order to scrape all URLs in the most efficient (parallel) way, it must be ensured that eachthread (chunk) identifies new URLs but no parallel process is scraping already identifiedURLs.That requires one file all parallel processes can access to write newly identifies URLs into,"claiming scrape-authority". Row IndexFirst15 rowsAdd 10krows+1 and enforceInt Typetest.csvDeletetest.tabletest.tableDoes NOT failif schemahas changedDeletetest.csvtest-parallel-python.csvtest-parallel-column-to-one-cell.csvtest-parallel-default.csvDefineTemp PrefixDefineTemp PrefixDefineTemp PrefixRule Engine Row Filter Empty Table Creator Add Empty Rows Math Formula CSV Reader DeleteFiles/Folders Table Reader NoOp (Table) List Files/Folders Delete Files/Folders(Table) DeleteFiles/Folders CSV Reader CSV Reader CSV Reader Java Edit Variable Use Python to writeto CSV result file Use CSV Writer and GroupBy towrite only one cell with all data Delete Files/Folders(Table) List Files/Folders Java Edit Variable Java Edit Variable Use CSSV to writeteh default way Delete Files/Folders(Table) List Files/Folders Use Table Writer w/o anyLock File or other Logic Use CSV Writer w/o anyLock File or other Logic ShowcaseIntentionDisplay issue, intended to failExpected ResultShould create uniform table with threecolumnsProblemFile (table type) either gets corrupted ordata (CSV) gets so too since chunksinterfere with each other. Writing to table in parallel can corrupt table file. Reading a corrupttable can result in an error like:Execute failed: invalid entry size (expected 114508917 but got76869 bytes) ISSUE with CSV WriterDespite beign executed (state green),despite lock files which prevent inteerference,despite implicit wait of two seconds ...thee CSV Writer does not seem to finished!Result: Corrupt Table Structure!!! IMPORTANT !!!There are a few isssue with the CSV Writer as well asthe List Files/Folders NodeUsing Wait node to wait for file deletion is buggy. See:https://forum.knime.com/t/wait-node-wait-for-file-deletion-stuck-in-loop/83324Using List FIles/Folders is buggy too. See:https://forum.knime.com/t/list-files-folders-should-return-no-result-but-fails-instead/83377 Python WorksUsing a Python node to write to the CSVFile creates a uniform data Preamble - "With great power comes great responsibility"KNIME offers a vast array of options. Its "low to no code" approach makes problem-solvingaccessible and democratizes its use. However, with that "power" also comes responsibility.The ease of use in KNIME can create a false sense of security because "it just works."IntentionKNIME users might simply go with the flow of "it just works," without verifying results whenadvancing to more sophisticated solutions.Alternatively, while attempting to fully leverage KNIME's parallel processing capabilities,users might encounter data consistency issues, as demonstrated here.Purpose - ScenarioGiven the situation of scraping a complte website. Some URLs might not be present in theXML-Sitemap or no XML-Sitemap exists et all. If scraping each page takes five seconds andassuming a website has 10k pages, scraping it with one process, it would take:50.000 seconds = 833 Minutes = circa 14 hoursFor ten parallel processes that time would be reduced to a little more than one hour!In order to scrape all URLs in the most efficient (parallel) way, it must be ensured that eachthread (chunk) identifies new URLs but no parallel process is scraping already identifiedURLs.That requires one file all parallel processes can access to write newly identifies URLs into,"claiming scrape-authority". Row IndexFirst15 rowsAdd 10krows+1 and enforceInt Typetest.csvDeletetest.tabletest.tableDoes NOT failif schemahas changedDeletetest.csvtest-parallel-python.csvtest-parallel-column-to-one-cell.csvtest-parallel-default.csvDefineTemp PrefixDefineTemp PrefixDefineTemp PrefixRule Engine Row Filter Empty Table Creator Add Empty Rows Math Formula CSV Reader DeleteFiles/Folders Table Reader NoOp (Table) List Files/Folders Delete Files/Folders(Table) DeleteFiles/Folders CSV Reader CSV Reader CSV Reader Java Edit Variable Use Python to writeto CSV result file Use CSV Writer and GroupBy towrite only one cell with all data Delete Files/Folders(Table) List Files/Folders Java Edit Variable Java Edit Variable Use CSSV to writeteh default way Delete Files/Folders(Table) List Files/Folders Use Table Writer w/o anyLock File or other Logic Use CSV Writer w/o anyLock File or other Logic

Nodes

Extensions

Links