Icon

OCR_​Python

OCR Foreign Language PDFs with Python and KNIME

This workflow shows you how to OCR a Foreign Language (Japanese, but this can be changed in the Python script) from PDFs which are text-based or image-based using Python and KNIME.

This workflow requires several installations via the terminal and the location of those installation locations must be entered into the component to run this workflow.

If you have any questions please post to the KNIME Forum and tag me using @victor_palacios

This was primarily created for Mac users who want to OCR, but Windows users will find instructions in the Python node.

For this to work you need:1. Python configured with KNIME: https://www.knime.com/blog/how-to-setup-the-python-extension2. Install the following via the terminal: poppler, pdf2image, cv2, PIL, pytesseract3. Download your language pack if not English (Japanese is here: https://github.com/tesseract-ocr/tessdata/blob/3.04.00/jpn.traineddata)4. Move the language pack into the tesseract folder (guide here: https://pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/ )5. Find your tesseract.exe path and enter it into the OCR component (if a mac user, see my path in the OCR component as an example)6. Find your poppler bin path and enter it into the OCR component (if a mac user, see my path in the OCR component as an example) directory of pdfs:1 imaged-based pdf2 text-based pdfsUse tika parser on all filesto see if text can be extractedtop port:if less than X words extracted,then image-basedyou can set X bydouble clicking componenthistogram can be seen with interactive view outputdouble click component to set paths and languagepreview output with pathput data togetherList Files/Folders Extract textif possible Set threshold todetermine if image-based OCR Table View Concatenate For this to work you need:1. Python configured with KNIME: https://www.knime.com/blog/how-to-setup-the-python-extension2. Install the following via the terminal: poppler, pdf2image, cv2, PIL, pytesseract3. Download your language pack if not English (Japanese is here: https://github.com/tesseract-ocr/tessdata/blob/3.04.00/jpn.traineddata)4. Move the language pack into the tesseract folder (guide here: https://pyimagesearch.com/2020/08/03/tesseract-ocr-for-non-english-languages/ )5. Find your tesseract.exe path and enter it into the OCR component (if a mac user, see my path in the OCR component as an example)6. Find your poppler bin path and enter it into the OCR component (if a mac user, see my path in the OCR component as an example) directory of pdfs:1 imaged-based pdf2 text-based pdfsUse tika parser on all filesto see if text can be extractedtop port:if less than X words extracted,then image-basedyou can set X bydouble clicking componenthistogram can be seen with interactive view outputdouble click component to set paths and languagepreview output with pathput data togetherList Files/Folders Extract textif possible Set threshold todetermine if image-based OCR Table View Concatenate

Nodes

Extensions

Links