Icon

kn_​forum_​26384_​pdf_​table_​extract_​r

Extract Table from PDF with the help of R "tabulizer" and KNIME

I had problems with the mentioned KNIME only approaches so I tried something with KNIME and R. It has these steps:

* run and configure R's "tabulizer"
* it seems the settings 'stream' and GUESS are working best in your case
* it would extract one table from each page and try to find headers and bring them to a table
* not all information would be in the same columns (we come to that later)
* the tables are saved as single CSVs (with their varying structure)
* then they would be imported into KNIME forcing the columns to be all strings and be brought into a single table
* the text fields which contain information in three columns would be integrated
* the summary lines with the Credit balance would be separated
* a single ID for each transaction block is created and distributed
* the "our reference" field is extracted separately and be stored in a separate column (you might do that to other information as well)
* the remaining "communication" is brought into one cell
* all the information is being put together and could be stored

Of course, you might do further manipulations like converting the sums into numbers. Introducing checks with the separate balances and so on. If you have columns that would change very much you might have to alter the workflows and change the definitions in R.

https://blog.az.sg/posts/reading-pdfs-in-r/https://community.rstudio.com/t/need-to-extract-tables-from-a-pdf-using-r/17144/3https://cran.r-project.org/web/packages/tabulizer/vignettes/tabulizer.html library("tabulizer")workpath_r <- knime.flow.in[["var_path_data"]]v_my_pdf <- paste0(workpath_r , "352266__2019_part1__removed.pdf")# stream seems to get some good resultsdf_stream <- extract_tables( v_my_pdf , guess = TRUE, method = "stream", # output = c("matrix") output = c("data.frame"))options("encoding" = "UTF-8")for (i in seq_along(df_stream)) { filename = paste0(workpath_r, i, ".csv") write.table(df_stream[[i]], file = filename, sep = "|", col.names = TRUE, row.names = FALSE) # write.csv(df_stream[[i]], filename)} how to extract "Our reference" - as an example collect all lines of the "communication" into one cell Extract Table from PDF with the help of R "tabulizer" and KNIMEhttps://forum.knime.com/t/automate-pdf-reader-and-convert-data-to-excel-table-with-correct-column-mappings/26384/10?u=mlauber71 library(readr)file_location <- knime.flow.in[["Location"]]knime.out <- as.data.frame(read_delim(file=file_location, delim="|", col_names = TRUE, col_types = cols(.default = "c"))) listCSV filesSTARTv_path_csv_fileread CSV fileand force allcells to stringENDreplacemissings with blankscombine textfieldsread CSV fileand force allcells to stringMay not work on MacOSnew_idso you could later reference it back$new column$ LIKE "Credit balance on*" => TRUETRUE => FALSE- separate balance statementsunique_idMISSING $Date$ => FALSE$Date$ = "" => FALSE$Date$ = "" => FALSEjoin theID of the groupsfor opticsnew_id_uniqueso you could later reference it backfill previous rowsto mark blocks$new column$ LIKE "Our reference*" => TRUE- separate balance statementsour_reference(Our\ +)(reference\ +)(.*)our_referenceMISSING $Date$ => FALSE$Date$ = "" => FALSE$Date$ = "" => FALSEcollect allthe communication(might split that up further)add referenceadd communicationfor opticscommunication/data/result.xlsx/data/result.xlsxoverview_credit_balanceextract Date(Credit\ +)(balance\ +)(on\ +)(.*)Date List Files Table Row ToVariable Loop Start collect meta data Java EditVariable (simple) R Source (Table) Loop End Missing Value String Manipulation Column Filter Column Resorter R Source (Table) Java Snippet(simple) Rule-basedRow Splitter Column Combiner Rule-basedRow Splitter Joiner Sorter Java Snippet(simple) Missing Value Rule-basedRow Splitter Regex Split Column Rename Rule-basedRow Splitter GroupBy Joiner Joiner Column Filter Sorter Column Rename Excel SheetAppender (XLS) Excel SheetAppender (XLS) Regex Split Column Rename Column Filter https://blog.az.sg/posts/reading-pdfs-in-r/https://community.rstudio.com/t/need-to-extract-tables-from-a-pdf-using-r/17144/3https://cran.r-project.org/web/packages/tabulizer/vignettes/tabulizer.html library("tabulizer")workpath_r <- knime.flow.in[["var_path_data"]]v_my_pdf <- paste0(workpath_r , "352266__2019_part1__removed.pdf")# stream seems to get some good resultsdf_stream <- extract_tables( v_my_pdf , guess = TRUE, method = "stream", # output = c("matrix") output = c("data.frame"))options("encoding" = "UTF-8")for (i in seq_along(df_stream)) { filename = paste0(workpath_r, i, ".csv") write.table(df_stream[[i]], file = filename, sep = "|", col.names = TRUE, row.names = FALSE) # write.csv(df_stream[[i]], filename)} how to extract "Our reference" - as an example collect all lines of the "communication" into one cell Extract Table from PDF with the help of R "tabulizer" and KNIMEhttps://forum.knime.com/t/automate-pdf-reader-and-convert-data-to-excel-table-with-correct-column-mappings/26384/10?u=mlauber71 library(readr)file_location <- knime.flow.in[["Location"]]knime.out <- as.data.frame(read_delim(file=file_location, delim="|", col_names = TRUE, col_types = cols(.default = "c"))) listCSV filesSTARTv_path_csv_fileread CSV fileand force allcells to stringENDreplacemissings with blankscombine textfieldsread CSV fileand force allcells to stringMay not work on MacOSnew_idso you could later reference it back$new column$ LIKE "Credit balance on*" => TRUETRUE => FALSE- separate balance statementsunique_idMISSING $Date$ => FALSE$Date$ = "" => FALSE$Date$ = "" => FALSEjoin theID of the groupsfor opticsnew_id_uniqueso you could later reference it backfill previous rowsto mark blocks$new column$ LIKE "Our reference*" => TRUE- separate balance statementsour_reference(Our\ +)(reference\ +)(.*)our_referenceMISSING $Date$ => FALSE$Date$ = "" => FALSE$Date$ = "" => FALSEcollect allthe communication(might split that up further)add referenceadd communicationfor opticscommunication/data/result.xlsx/data/result.xlsxoverview_credit_balanceextract Date(Credit\ +)(balance\ +)(on\ +)(.*)DateList Files Table Row ToVariable Loop Start collect meta data Java EditVariable (simple) R Source (Table) Loop End Missing Value String Manipulation Column Filter Column Resorter R Source (Table) Java Snippet(simple) Rule-basedRow Splitter Column Combiner Rule-basedRow Splitter Joiner Sorter Java Snippet(simple) Missing Value Rule-basedRow Splitter Regex Split Column Rename Rule-basedRow Splitter GroupBy Joiner Joiner Column Filter Sorter Column Rename Excel SheetAppender (XLS) Excel SheetAppender (XLS) Regex Split Column Rename Column Filter

Nodes

Extensions

Links