I had problems with the mentioned KNIME only approaches so I tried something with KNIME and R. It has these steps:
* run and configure R's "tabulizer"
* it seems the settings 'stream' and GUESS are working best in your case
* it would extract one table from each page and try to find headers and bring them to a table
* not all information would be in the same columns (we come to that later)
* the tables are saved as single CSVs (with their varying structure)
* then they would be imported into KNIME forcing the columns to be all strings and be brought into a single table
* the text fields which contain information in three columns would be integrated
* the summary lines with the Credit balance would be separated
* a single ID for each transaction block is created and distributed
* the "our reference" field is extracted separately and be stored in a separate column (you might do that to other information as well)
* the remaining "communication" is brought into one cell
* all the information is being put together and could be stored
Of course, you might do further manipulations like converting the sums into numbers. Introducing checks with the separate balances and so on. If you have columns that would change very much you might have to alter the workflows and change the definitions in R.
To use this workflow in KNIME, download it from the below URL and open it in KNIME:
Download WorkflowDeploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.
Try NodePit Runner!Do you have feedback, questions, comments about NodePit, want to support this platform, or want your own nodes or workflows listed here as well? Do you think, the search results could be improved or something is missing? Then please get in touch! Alternatively, you can send us an email to mail@nodepit.com, follow @NodePit on Twitter or botsin.space/@nodepit on Mastodon.
Please note that this is only about NodePit. We do not provide general support for KNIME — please use the KNIME forums instead.