Icon

OCR_​Python_​Portable_​CEP

<p>OCR Foreign Language PDFs with Python and KNIME (with Tesseract and PDFium)<br><br>This workflow shows you how to OCR a Foreign Language using Python and KNIME. If the desired language does not show up in the drop-down when configuring the OCR Component, additional languages can be set by tweaking the script.<br></p><p><strong>Conda is needed on the machine</strong> and needs to be set up according to the "Prerequisites" section in this documentation (under Preferences - KNIME - Conda).<br><br>For portability, the Conda Environment Propagation node sets up the environment, so it should be <strong>not necessary</strong> to install the following environment. The commands are stated for sake of completeness, in case a workaround without the CEP node is being created:</p><ul><li><p>Linux: conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2 opencv pytesseract tesseract pillow numpy pandas</p></li><li><p>Windows: conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge -c pypdfium2-team -c bblanchon --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2-team::pypdfium2_helpers opencv pytesseract tesseract pillow numpy pandas</p></li></ul><p><br>Note: If any language other than English is selected, the <strong>workflow will</strong> <strong>download Tesseract's appropriate language</strong> files and store them within the workflow folder under /data/tessdata/</p><p></p><p>This workflow is based on this one: https://hub.knime.com/s/hDBtIjjK900pPNaK</p>

URL: Conda Documentation, Prerequisites https://docs.knime.com/ap/latest/python_installation_guide/#prerequisites

Conda is needed on the machine and needs to be set up according to the "Prerequisites" section in this documentation (under Preferences - KNIME - Conda).

For portability, the Conda Environment Propagation node sets up the environment, so it should be not necessary to install the following environment. The commands are stated for sake of completeness, in case a workaround without the CEP node is being created:

  • Linux: conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2 opencv pytesseract tesseract pillow numpy pandas

  • Windows: conda create -n knime_ocr_tess_pdfium -c knime -c conda-forge -c pypdfium2-team -c bblanchon --strict-channel-priority python=3.11 knime-python-scripting=5.8 pypdfium2-team::pypdfium2_helpers opencv pytesseract tesseract pillow numpy pandas


Note: If any language other than English is selected, the workflow willdownload Tesseract's appropriate language files and store them within the workflow folder under /data/tessdata/

This workflow is based on this one: https://hub.knime.com/s/hDBtIjjK900pPNaK

Portable OCR with Tesseract and PDFium, Automatic Environment Install

Sets up the necessary Conda environment.

Will take a long time on the first run.

On a local machine, the environment will persist,

on the Business Hub it is deleted at each restart

of the executor.

Use tika parser on all filesto see if text can be extracted
Extract text if possible
put data together
Concatenate
open the view of the component to check the results
Check Results
for OPTIONSdouble click component
OCR
directory of pdfs:1 imaged-based pdf2 text-based pdfs
List Files/Folders
Top: if less than X words extracted, then image-basedYou can set X by double clicking the component.
Set threshold to determine if image-based
Conda Environment Propagation

Nodes

Extensions

Links