Icon

kn_​example_​r_​pdf_​read_​text

use R and KNIME to extract text from PDF file - search for page where text appears

use R and KNIME to extract text from PDF file - search for page where text appears

library("pdftools")v_pdf_file <- knime.flow.in[["File path"]]pdf.text <- pdftools::pdf_text(v_pdf_file )knime.out<- as.data.frame(pdf.text ) use R and KNIME to extract text from PDF file - search for page where text appears Extract Table from PDF with the help of R "tabulizer" and KNIMEhttps://hub.knime.com/mlauber71/spaces/Public/latest/forum/kn_forum_26384_pdf_table_extract_r~3YMQ5EiC7ojFR7AZ library("pdftools")library("stringr")# https://www.r-bloggers.com/2021/06/extract-text-from-pdf-in-r-and-word-detection/v_pdf_file <- knime.flow.in[["File path"]]pdf.text <- pdftools::pdf_text(v_pdf_file )pdf.text<-unlist(pdf.text)pdf.text<-tolower(pdf.text)v_search_word <- tolower(knime.flow.in[["v_search_word"]])res<-data.frame(str_detect(pdf.text,v_search_word))colnames(res)<-"Result"res<-subset(res,res$Result==TRUE)knime.out<- as.data.frame(res) search for line and page for the word you were searching for (in this case "Procter")https://forum.knime.com/t/text-mining-from-pdf-documents-and-results-places/48186/3?u=mlauber71 scan forPDF in sub folder/data/STARTextract PDFfile detailsextract textfile_nameENDfile_pathpdf_extract.tablepdf_extract.csvASCII 164 (¤) the "flattened turtle"https://forum.knime.com/t/csv-writer-adds-random-rows-while-creating-the-file/32456/12?u=mlauber71v_search_word=> word to search for on the PDF documentse.g. "Procter"search for wordin PDF fileSTARTfile_nameENDfile_pathPage_NumberSearch_Wordpdf_search_word.tablepdf_search_word.csvPage_Numberpdf_extract.csvCounter asRowIDDESCENDINGfillpageand file nameASCENDINGif (contains(column("pdf.text"),variable("v_search_word") )) {true}else {false}keep onlymatchesSearch_WordMaximum ofLines (Counter)keep onlymatches$Page_Number$ +1line_on_pageresult_search_word.xlsx List Files/Folders Table Row ToVariable Loop Start Path to URI URL to File Path R Source (Table) ConstantValue Column Loop End ConstantValue Column Table Writer CSV Writer Column Filter StringConfiguration R Source (Table) Table Row ToVariable Loop Start ConstantValue Column Loop End ConstantValue Column RowID ConstantValue Column Table Writer CSV Writer Column Filter RowID CSV Reader Counter Generation RowID Sorter Missing Value Sorter Column Expressions Row Filter ConstantValue Column GroupBy Row Filter Math Formula Joiner Missing Value Math Formula Column Filter Excel Writer library("pdftools")v_pdf_file <- knime.flow.in[["File path"]]pdf.text <- pdftools::pdf_text(v_pdf_file )knime.out<- as.data.frame(pdf.text ) use R and KNIME to extract text from PDF file - search for page where text appears Extract Table from PDF with the help of R "tabulizer" and KNIMEhttps://hub.knime.com/mlauber71/spaces/Public/latest/forum/kn_forum_26384_pdf_table_extract_r~3YMQ5EiC7ojFR7AZ library("pdftools")library("stringr")# https://www.r-bloggers.com/2021/06/extract-text-from-pdf-in-r-and-word-detection/v_pdf_file <- knime.flow.in[["File path"]]pdf.text <- pdftools::pdf_text(v_pdf_file )pdf.text<-unlist(pdf.text)pdf.text<-tolower(pdf.text)v_search_word <- tolower(knime.flow.in[["v_search_word"]])res<-data.frame(str_detect(pdf.text,v_search_word))colnames(res)<-"Result"res<-subset(res,res$Result==TRUE)knime.out<- as.data.frame(res) search for line and page for the word you were searching for (in this case "Procter")https://forum.knime.com/t/text-mining-from-pdf-documents-and-results-places/48186/3?u=mlauber71 scan forPDF in sub folder/data/STARTextract PDFfile detailsextract textfile_nameENDfile_pathpdf_extract.tablepdf_extract.csvASCII 164 (¤) the "flattened turtle"https://forum.knime.com/t/csv-writer-adds-random-rows-while-creating-the-file/32456/12?u=mlauber71v_search_word=> word to search for on the PDF documentse.g. "Procter"search for wordin PDF fileSTARTfile_nameENDfile_pathPage_NumberSearch_Wordpdf_search_word.tablepdf_search_word.csvPage_Numberpdf_extract.csvCounter asRowIDDESCENDINGfillpageand file nameASCENDINGif (contains(column("pdf.text"),variable("v_search_word") )) {true}else {false}keep onlymatchesSearch_WordMaximum ofLines (Counter)keep onlymatches$Page_Number$ +1line_on_pageresult_search_word.xlsxList Files/Folders Table Row ToVariable Loop Start Path to URI URL to File Path R Source (Table) ConstantValue Column Loop End ConstantValue Column Table Writer CSV Writer Column Filter StringConfiguration R Source (Table) Table Row ToVariable Loop Start ConstantValue Column Loop End ConstantValue Column RowID ConstantValue Column Table Writer CSV Writer Column Filter RowID CSV Reader Counter Generation RowID Sorter Missing Value Sorter Column Expressions Row Filter ConstantValue Column GroupBy Row Filter Math Formula Joiner Missing Value Math Formula Column Filter Excel Writer

Nodes

Extensions

Links