Icon

OCR_​Python_​Portable_​manual_​env

<p>OCR Foreign Language PDFs with Python and KNIME (with Tesseract and PDFium)<br><br>This workflow shows you how to OCR any language from PDFs which are text-based or image-based using Python and KNIME. If the desired language does not show up in the drop-down when configuring the OCR Component, additional languages can be set by tweaking the script.<br></p><p>Environment needed as stated below. Remember to set it under Preferences - KNIME - Python.</p><p><br>Linux:</p><p>conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2 opencv pytesseract tesseract pillow numpy pandas</p><p></p><p>Windows:</p><p>conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge -c pypdfium2-team -c bblanchon --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2-team::pypdfium2_helpers opencv pytesseract tesseract pillow numpy pandas</p><p></p><p>Note: If any language other than English is selected, the workflow will download Tesseract's appropriate language files and store them within the workflow folder under /data/tessdata/<br><br>This workflow is based on this one: https://hub.knime.com/s/hDBtIjjK900pPNaK</p>

Environment needed as stated below. Remember to set it under Preferences - KNIME - Python.


Linux:

conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2 opencv pytesseract tesseract pillow numpy pandas

Windows:

conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge -c pypdfium2-team -c bblanchon --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2-team::pypdfium2_helpers opencv pytesseract tesseract pillow numpy pandas

Note: If any language other than English is selected, the workflow will download Tesseract's appropriate language files and store them within the workflow folder under /data/tessdata/

This workflow is based on this one: https://hub.knime.com/s/hDBtIjjK900pPNaK

Portable OCR with Tesseract and PDFium, Manual Conda Environment Install

directory of pdfs:1 imaged-based pdf2 text-based pdfs
List Files/Folders
for OPTIONSdouble click component
OCR
Use tika parser on all filesto see if text can be extracted
Extract text if possible
open the view of the component to check the results
Check Results
Top: if less than X words extracted, then image-basedYou can set X by double clicking the component.
Set threshold to determine if image-based
put data together
Concatenate

Nodes

Extensions

Links