Icon

kn_​example_​r_​docx_​import_​table3

Read tables from Word documents (.docx) and import them into KNIME

usage of R package docxtractr to import tables from a Word document

In this example we have a changing table structure which is captured in single KNIME tables and a Excel-File per Word document



Read tables from Word documents (.docx) and import them into KNIMEusage of R package docxtractr to import tables from a Word document (https://forum.knime.com/t/extract-data-from-pdf-invoice/38492/2?u=mlauber71)In this example we have a changing table structure which is captured in single KNIME tables and a Excel-File per Word document workpath_r <- knime.flow.in[["var_path_data"]]# construct name for .RDS filedataobject_name <- paste0(workpath_r, "datalist_r_", knime.flow.in[["File name"]], ".rds")datalist <- readRDS(dataobject_name)v_no_to_fetch <- knime.in$"v_no_to_fetch"knime.out <- as.data.frame(datalist[[v_no_to_fetch]]) workpath_r <- knime.flow.in[["var_path_data"]]workspace_name <- paste0(workpath_r, "workspace_r.RData")# construct name for .RDS file to hold your list of data.framesdataobject_name <- paste0(workpath_r, "datalist_r_", knime.flow.in[["File name"]], ".rds")setwd(workpath_r) # Set work directorylibrary(docxtractr)v_documnet_to_read <- c(knime.flow.in[["Location"]])# import the Word documentreal_world <- read_docx(v_documnet_to_read)docx_tbl_count(real_world)# extract all the tables into a list of tablestbls <- docx_extract_all_tbls(real_world)# determine the number of tables that are thereno_tbls <- length(tbls)# describe what the tables look like (you could skip that in areal production environment)docx_describe_tbls(real_world)# command to extract all tables to see how they look like# docx_extract_all_tbls(real_world, guess_header = TRUE, preserve = FALSE, trim = TRUE)# https://stackoverflow.com/questions/29402528/append-data-frames-together-in-a-for-loop/29419402# create an empty list to collect the tablesdatalist = list()for (i in 1:no_tbls) { # ... make some data # extract the i-th table from the docx df <- docx_extract_tbl(real_world, i, header=TRUE) df$document <- knime.flow.in[["File name"]] # add the document name df$table_no <- i # maybe you want to keep track of which iteration/table produced it? datalist[[i]] <- df # add it to your list}# save and load the working environment# save.image(workspace_name)# load(workspace_name)saveRDS(datalist, dataobject_name)# export the number of tables that have been foundknime.out <- data.frame("no_tables"=no_tbls ) import alltables into a tibbleand store them as.RDS filedeterminepath and name of workflowfetch tablefrom tibble objectv_no_to_fetchiterate the document noto fetchv_path_docx_filesv_path_doc*list *.docx files from directory_dataOUTER LoopStartnumber of tables foundSTARTcollect structureOUTER LoopEndvar_export_tablevar_export_tablevar_export_tablevar_export_table_xlsxvar_export_table_xlsxvar_export_table_xlsxENDcollect structure R Source (Table) Extract ContextProperties R Snippet Math Formula Java EditVariable (simple) URL to File Path String to Path(Variable) determine paths List Files/Folders Path to String Table Row ToVariable Loop Start Table Rowto Variable Counting Loop Start Loop End Java EditVariable (simple) String to Path(Variable) Table Writer Excel Writer String to Path(Variable) Java EditVariable (simple) Variable Loop End Merge Variables Read tables from Word documents (.docx) and import them into KNIMEusage of R package docxtractr to import tables from a Word document (https://forum.knime.com/t/extract-data-from-pdf-invoice/38492/2?u=mlauber71)In this example we have a changing table structure which is captured in single KNIME tables and a Excel-File per Word document workpath_r <- knime.flow.in[["var_path_data"]]# construct name for .RDS filedataobject_name <- paste0(workpath_r, "datalist_r_", knime.flow.in[["File name"]], ".rds")datalist <- readRDS(dataobject_name)v_no_to_fetch <- knime.in$"v_no_to_fetch"knime.out <- as.data.frame(datalist[[v_no_to_fetch]]) workpath_r <- knime.flow.in[["var_path_data"]]workspace_name <- paste0(workpath_r, "workspace_r.RData")# construct name for .RDS file to hold your list of data.framesdataobject_name <- paste0(workpath_r, "datalist_r_", knime.flow.in[["File name"]], ".rds")setwd(workpath_r) # Set work directorylibrary(docxtractr)v_documnet_to_read <- c(knime.flow.in[["Location"]])# import the Word documentreal_world <- read_docx(v_documnet_to_read)docx_tbl_count(real_world)# extract all the tables into a list of tablestbls <- docx_extract_all_tbls(real_world)# determine the number of tables that are thereno_tbls <- length(tbls)# describe what the tables look like (you could skip that in areal production environment)docx_describe_tbls(real_world)# command to extract all tables to see how they look like# docx_extract_all_tbls(real_world, guess_header = TRUE, preserve = FALSE, trim = TRUE)# https://stackoverflow.com/questions/29402528/append-data-frames-together-in-a-for-loop/29419402# create an empty list to collect the tablesdatalist = list()for (i in 1:no_tbls) { # ... make some data # extract the i-th table from the docx df <- docx_extract_tbl(real_world, i, header=TRUE) df$document <- knime.flow.in[["File name"]] # add the document name df$table_no <- i # maybe you want to keep track of which iteration/table produced it? datalist[[i]] <- df # add it to your list}# save and load the working environment# save.image(workspace_name)# load(workspace_name)saveRDS(datalist, dataobject_name)# export the number of tables that have been foundknime.out <- data.frame("no_tables"=no_tbls ) import alltables into a tibbleand store them as.RDS filedeterminepath and name of workflowfetch tablefrom tibble objectv_no_to_fetchiterate the document noto fetchv_path_docx_filesv_path_doc*list *.docx files from directory_dataOUTER LoopStartnumber of tables foundSTARTcollect structureOUTER LoopEndvar_export_tablevar_export_tablevar_export_tablevar_export_table_xlsxvar_export_table_xlsxvar_export_table_xlsxENDcollect structure R Source (Table) Extract ContextProperties R Snippet Math Formula Java EditVariable (simple) URL to File Path String to Path(Variable) determine paths List Files/Folders Path to String Table Row ToVariable Loop Start Table Rowto Variable Counting Loop Start Loop End Java EditVariable (simple) String to Path(Variable) Table Writer Excel Writer String to Path(Variable) Java EditVariable (simple) Variable Loop End Merge Variables

Nodes

Extensions

Links