Icon

kn_​forum_​70050_​pdf_​table_​extract_​indy_​car

Extract Table from PDF with the help of Python package "camelot"

Extract Table from PDF with the help of Python package "camelot"

KNIME and Python — Setting up and managing Conda environments
https://medium.com/p/2ac217792539



Extract Table from PDF with the help of Python package "camelot"https://forum.knime.com/t/how-to-read-indycar-pdf-file/70050/2?u=mlauber71 import knime.scripting.io as knioimport camelotimport pandas as pdimport osdef extract_tables_to_parquet_camelot(pdf_path, output_folder): # Get the base name of the PDF for later use pdf_name = os.path.basename(pdf_path).replace('.pdf', '') # Use Camelot to read tables from the PDF tables = camelot.read_pdf(pdf_path, pages='all') # Make sure output_folder exists if not os.path.exists(output_folder): os.makedirs(output_folder) for i, table in enumerate(tables): # Convert the table to a pandas DataFrame df = table.df # Convert column names to string df.columns = df.columns.astype(str) # Add a column with the PDF name df['pdf_name'] = pdf_name # Add a column with the page number df['page'] = i + 1 # Write the table to a parquet file df.to_parquet(f'{output_folder}/{pdf_name}_table_{i + 1}.parquet')# Example usage:var_pdf_path = knio.flow_variables['v_path_pdf_file']var_output_folder = knio.flow_variables['v_path_result_camelot']extract_tables_to_parquet_camelot(var_pdf_path, var_output_folder) # conda env create -f="/Users/m_lauber/Dropbox/knime-workspace/forum/2023/kn_forum_70050_pdf_table_extract_indy_car/data/py3_knime.yml"# conda env create -f="C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml"# conda remove -n py3_knime --all# conda activate py3_knime# conda update -n py3_knime --all# conda env update --name py3_knime --file "/Users/m_lauber/Dropbox/knime-workspace/forum/2023/kn_forum_70050_pdf_table_extract_indy_car/data/py3_knime.yml" --prune# conda env update --name py3_knime --file "C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml" --prune# conda env update --name py3_knime --file "/Users/m_lauber/Dropbox/knime-workspace/forum/2023/kn_forum_70050_pdf_table_extract_indy_car/data/py3_knime.yml"# conda env update --name py3_knime --file "C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml"# conda update -n base conda# KNIME official Python integration guide# https://docs.knime.com/latest/python_installation_guide/index.html#_introduction# KNIME and Python - Setting up and managing Conda environments# https://medium.com/p/2ac217792539# Hyperparameter optimization for LightGBM - wrapped in KNIME nodes# https://medium.com/p/ddb7ae1d7e2# conda activate py3_knime# conda install -n py3_knime -c conda-forge camelot-py# conda install -n py3_knime -c conda-forge tabula-py# conda install -n py3_knime -c conda-forge pdfplumber# file: py3_knime.yml with some modifications# THX Carsten Haubold (https://hub.knime.com/carstenhaubold) for hintsname: py3_knime # Name of the created environmentchannels: # Repositories to search for packages# - defaults # edit: removed to just use conda-forge# - anaconda # edit: removed to just use conda-forge - conda-forge# https://anaconda.org/knime - knime # conda search knime-python-base -c knime --info # to see what is in the packagedependencies: # List of packages that should be installed- python=3.9 # Python- knime-python-base # dependencies of KNIME - Python integration# - knime-python-scripting # everything you need to also build Python packages for KNIME- cairo # SVG support- pillow # Image inputs/outputs- matplotlib # Plotting- IPython # Notebook support- nbformat # Notebook support- scipy # Notebook support- jpype1 # A Python to Java bridge# Jupyter Notebook support- jupyter # Jupyter Notebook- pandas-profiling # create overview of your data- sweetviz # In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code!- plotly- pdfplumber- camelot-py- pip # Python installer- pip:# - kaleido # kaleido locate and create/data/ folderwith absolute pathsread camelotparquet files exportedfrom the Python scriptsplit the firsttwo columnsRow0Row0replace(column("0"),"", "|")=> replace blanks with |split header by |header_*$Row1$ = "" => FALSEreplace(column("header"),"/", "|")insertcorrect column headscorrect the Lapcolumn - fill blanks$T$ = "" => FALSEremove emptycolumnsthe head with the ? driverdriverv_table_pathv_excel_pathv_*Node 151Node 152list/result_camelot/start the loopcatch failparquet_filePathextract file namev_path_result_camelotv_path_pdf_fileedit: the PDF file !!!py3_knimeKNIME and Python — Setting up and managing Conda environmentshttps://medium.com/p/2ac217792539v_*v_path_result_camelot_2Node 173the first fileparquet_file_path Collect LocalMetadata Camelot for Extractionof tables from pdf Parquet Reader Column Splitter Row Filter Row Filter Column Expressions Cell Splitter Column Filter Transpose Rule-basedRow Filter Counter Generation Column Expressions ExtractColumn Header Transpose Counter Generation RowID Joiner Insert ColumnHeader Missing Value String ToNumber (PMML) Rule-basedRow Filter Constant ValueColumn Filter Table Rowto Variable ConstantValue Column Java EditVariable (simple) Java EditVariable (simple) String to Path(Variable) Excel Writer Table Writer List Files/Folders Table Row ToVariable Loop Start Merge Variables Variable Loop End Try (VariablePorts) Catch Errors(Var Ports) Merge Variables ConstantValue Column Path to String(Variable) URL to FilePath (Variable) Python Script Java EditVariable (simple) Java EditVariable (simple) Conda EnvironmentPropagation String to Path(Variable) Java EditVariable (simple) Table Rowto Variable Parquet Reader String to Path(Variable) Extract Table from PDF with the help of Python package "camelot"https://forum.knime.com/t/how-to-read-indycar-pdf-file/70050/2?u=mlauber71 import knime.scripting.io as knioimport camelotimport pandas as pdimport osdef extract_tables_to_parquet_camelot(pdf_path, output_folder): # Get the base name of the PDF for later use pdf_name = os.path.basename(pdf_path).replace('.pdf', '') # Use Camelot to read tables from the PDF tables = camelot.read_pdf(pdf_path, pages='all') # Make sure output_folder exists if not os.path.exists(output_folder): os.makedirs(output_folder) for i, table in enumerate(tables): # Convert the table to a pandas DataFrame df = table.df # Convert column names to string df.columns = df.columns.astype(str) # Add a column with the PDF name df['pdf_name'] = pdf_name # Add a column with the page number df['page'] = i + 1 # Write the table to a parquet file df.to_parquet(f'{output_folder}/{pdf_name}_table_{i + 1}.parquet')# Example usage:var_pdf_path = knio.flow_variables['v_path_pdf_file']var_output_folder = knio.flow_variables['v_path_result_camelot']extract_tables_to_parquet_camelot(var_pdf_path, var_output_folder) # conda env create -f="/Users/m_lauber/Dropbox/knime-workspace/forum/2023/kn_forum_70050_pdf_table_extract_indy_car/data/py3_knime.yml"# conda env create -f="C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml"# conda remove -n py3_knime --all# conda activate py3_knime# conda update -n py3_knime --all# conda env update --name py3_knime --file "/Users/m_lauber/Dropbox/knime-workspace/forum/2023/kn_forum_70050_pdf_table_extract_indy_car/data/py3_knime.yml" --prune# conda env update --name py3_knime --file "C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml" --prune# conda env update --name py3_knime --file "/Users/m_lauber/Dropbox/knime-workspace/forum/2023/kn_forum_70050_pdf_table_extract_indy_car/data/py3_knime.yml"# conda env update --name py3_knime --file "C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml"# conda update -n base conda# KNIME official Python integration guide# https://docs.knime.com/latest/python_installation_guide/index.html#_introduction# KNIME and Python - Setting up and managing Conda environments# https://medium.com/p/2ac217792539# Hyperparameter optimization for LightGBM - wrapped in KNIME nodes# https://medium.com/p/ddb7ae1d7e2# conda activate py3_knime# conda install -n py3_knime -c conda-forge camelot-py# conda install -n py3_knime -c conda-forge tabula-py# conda install -n py3_knime -c conda-forge pdfplumber# file: py3_knime.yml with some modifications# THX Carsten Haubold (https://hub.knime.com/carstenhaubold) for hintsname: py3_knime # Name of the created environmentchannels: # Repositories to search for packages# - defaults # edit: removed to just use conda-forge# - anaconda # edit: removed to just use conda-forge - conda-forge# https://anaconda.org/knime - knime # conda search knime-python-base -c knime --info # to see what is in the packagedependencies: # List of packages that should be installed- python=3.9 # Python- knime-python-base # dependencies of KNIME - Python integration# - knime-python-scripting # everything you need to also build Python packages for KNIME- cairo # SVG support- pillow # Image inputs/outputs- matplotlib # Plotting- IPython # Notebook support- nbformat # Notebook support- scipy # Notebook support- jpype1 # A Python to Java bridge# Jupyter Notebook support- jupyter # Jupyter Notebook- pandas-profiling # create overview of your data- sweetviz # In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code!- plotly- pdfplumber- camelot-py- pip # Python installer- pip:# - kaleido # kaleido locate and create/data/ folderwith absolute pathsread camelotparquet files exportedfrom the Python scriptsplit the firsttwo columnsRow0Row0replace(column("0"),"", "|")=> replace blanks with |split header by |header_*$Row1$ = "" => FALSEreplace(column("header"),"/", "|")insertcorrect column headscorrect the Lapcolumn - fill blanks$T$ = "" => FALSEremove emptycolumnsthe head with the ? driverdriverv_table_pathv_excel_pathv_*Node 151Node 152list/result_camelot/start the loopcatch failparquet_filePathextract file namev_path_result_camelotv_path_pdf_fileedit: the PDF file !!!py3_knimeKNIME and Python — Setting up and managing Conda environmentshttps://medium.com/p/2ac217792539v_*v_path_result_camelot_2Node 173the first fileparquet_file_pathCollect LocalMetadata Camelot for Extractionof tables from pdf Parquet Reader Column Splitter Row Filter Row Filter Column Expressions Cell Splitter Column Filter Transpose Rule-basedRow Filter Counter Generation Column Expressions ExtractColumn Header Transpose Counter Generation RowID Joiner Insert ColumnHeader Missing Value String ToNumber (PMML) Rule-basedRow Filter Constant ValueColumn Filter Table Rowto Variable ConstantValue Column Java EditVariable (simple) Java EditVariable (simple) String to Path(Variable) Excel Writer Table Writer List Files/Folders Table Row ToVariable Loop Start Merge Variables Variable Loop End Try (VariablePorts) Catch Errors(Var Ports) Merge Variables ConstantValue Column Path to String(Variable) URL to FilePath (Variable) Python Script Java EditVariable (simple) Java EditVariable (simple) Conda EnvironmentPropagation String to Path(Variable) Java EditVariable (simple) Table Rowto Variable Parquet Reader String to Path(Variable)

Nodes

Extensions

Links