Icon

kn_​forum_​63429_​pdf_​extract_​png_​description

Extract Image and Text from a PDF file

Extract Image and Text from a PDF file
https://forum.knime.com/t/associate-text-to-images-from-pdf/63429/4?u=mlauber71

Extract Image and Text from a PDF filehttps://forum.knime.com/t/associate-text-to-images-from-pdf/63429/4?u=mlauber71The /data/ subfolder contains a Jupyter notebook "kn_forum_63429_pdf_extract_png_description.ipynb" def extract_images_and_descriptions(pdf_path): # Set up Tesseract configuration pytesseract.pytesseract.tesseract_cmd = 'tesseract' # Or your Tesseract executable path # Create a new Excel workbook and add a worksheet wb = Workbook() ws = wb.active ws.append(['PDF Name', 'Page', 'Image Name', 'Description']) # Extract the PDF file name without extension pdf_name = os.path.splitext(os.path.basename(pdf_path))[0] # Convert the PDF to images (one per page) images = convert_from_path(pdf_path) # Iterate over the images and extract the descriptions for i, img in enumerate(images, start=1): # Save the image as PNG img_name = f'{pdf_name}_image_{i}.png' img.save(img_name, 'PNG') # Extract text from the image text = pytesseract.image_to_string(img) # Extract the description (assuming the text below the image is the description) description = re.sub(r'\n+', '\n', text.strip()).split('\n')[-1] # Write the extracted data to the Excel sheet ws.append([pdf_name, i, img_name, description]) # Save the Excel file with the PDF name wb.save(f'{pdf_name}_image_descriptions.xlsx') extract the results Excel Reader Extract Image and Text from a PDF filehttps://forum.knime.com/t/associate-text-to-images-from-pdf/63429/4?u=mlauber71The /data/ subfolder contains a Jupyter notebook "kn_forum_63429_pdf_extract_png_description.ipynb" def extract_images_and_descriptions(pdf_path): # Set up Tesseract configuration pytesseract.pytesseract.tesseract_cmd = 'tesseract' # Or your Tesseract executable path # Create a new Excel workbook and add a worksheet wb = Workbook() ws = wb.active ws.append(['PDF Name', 'Page', 'Image Name', 'Description']) # Extract the PDF file name without extension pdf_name = os.path.splitext(os.path.basename(pdf_path))[0] # Convert the PDF to images (one per page) images = convert_from_path(pdf_path) # Iterate over the images and extract the descriptions for i, img in enumerate(images, start=1): # Save the image as PNG img_name = f'{pdf_name}_image_{i}.png' img.save(img_name, 'PNG') # Extract text from the image text = pytesseract.image_to_string(img) # Extract the description (assuming the text below the image is the description) description = re.sub(r'\n+', '\n', text.strip()).split('\n')[-1] # Write the extracted data to the Excel sheet ws.append([pdf_name, i, img_name, description]) # Save the Excel file with the PDF name wb.save(f'{pdf_name}_image_descriptions.xlsx') extract the results Excel Reader

Nodes

Extensions

Links