Icon

Ollama - Extract Information from PDF Bank Statements into JSON

Extract Data from Bank Statements (PDF) into JSON files with the help of Ollama / Llama3 LLM
- list PDFs or other documents (csv, txt, log) from your drive that roughly have a similar layout and you expect an LLM to be able to extract data
- formulate a concise prompt (and instruction) and try to force the LLM to give back a JSON file with always the same structure (Mistral seems to be very good at that)
- convert the single document to a Vector Store either into Chroma or Meta's FAISS with the helop of Ollama and a suitable embedding model (mxbai-embed-large)
- Use Ollama wrapper (via Python and KNIME node) to put document and query before the LLM
- collect the data back from Python into KNIME
- extract the data from JSON files, either with the help of Regex or just convert the JSON with KKNIME nodes
- make sure they have the same structure

=> you need to have Python environment installed and Ollama and you need to have the models pulled locally and Ollama running!!!
If you experience problems with the model download: Check your Proxy settings and then kill all running Ollama jobs in your task manager and try again
------
Run in Terminal window to start Ollama. You can also try and use other models (https://ollama.com). You can also just pull the model

ollama pull llama3:instruct
ollama run llama3:instruct

To get the embedding model you run this command in the terminal window

ollama pull mxbai-embed-large

Ollama and Llama3 - A Streamlit App to convert your files into local Vector Stores and chat with them using the latest LLMs
https://medium.com/p/c5340fcd6ad0

Medium - Chat with local Llama3 Model via Ollama in KNIME Analytics Platform - Also extract Logs into structured JSON Files
https://medium.com/p/aca61e4a690a



Extract Data from Bank Statements (PDF) into JSON files with the help of Ollama / Llama3 LLM(https://forum.knime.com/t/extract-data-from-invoices-to-xml-or-csv/80875/12?u=mlauber71)- list PDFs or other documents (csv, txt, log) from your drive that roughly have a similar layout and you expect an LLM to be able to extract data- formulate a concise prompt (and instruction) and try to force the LLM to give back a JSON file with always the same structure (Mistral seems to be very good at that)- convert the single document to a Vector Store either into Chroma or Meta's FAISS with the helop of Ollama and a suitable embedding model (mxbai-embed-large)- Use Ollama wrapper (via Python and KNIME node) to put document and query before the LLM- collect the data back from Python into KNIME- extract the data from JSON files, either with the help of Regex or just convert the JSON with KKNIME nodes- make sure they have the same structure=> you need to have Python environment installed and Ollama and you need to have the models pulled locally and Ollama running!!!If you experience problems with the model download: Check your Proxy settings and then kill all running Ollama jobs in your task manager and try again Maybe download the whole LLM workflow group in order to get the folder(https://hub.knime.com/mlauber71/spaces/LLM_Space/~17k4zAECNryrZw1X/) Run in Terminal window to start Ollama. You can also try and use other models (https://ollama.com). You can also just pull the modelollama pull llama3:instructollama run llama3:instructTo get the embedding model you run this command in the terminal windowollama pull mxbai-embed-large Ollama and Llama3 - A Streamlit App to convert your files into local Vector Stores and chat with them using the latest LLMshttps://medium.com/p/c5340fcd6ad0Medium - Chat with local Llama3 Model via Ollama in KNIME Analytics Platform - Also extract Logs into structured JSON Fileshttps://medium.com/p/aca61e4a690a Extract exactly these information from this document into a JSON file.Do not add any information that is not there. Do not change the structure of the JSON file. Do not alterthe names of the fields or the order! Check that the JSON file is syntactically correct.If information is missing just leave it empty! The JSON fields:Firma (Company) Depot (Depot) Bank (Bank) Transaktionsart (Transaction Type - Purchase / Sale)ISIN / Kenn-Nr. der Aktie (ISIN / Stock ID Number) WKN (WKN) Wertstellung (Valuta) (Value Date)Belegdatum (Document Date) WSL (Währung) (Currency) FW-Kurs (Devisenkurs) (Foreign ExchangeRate) Menge (Quantity) Kurs (Price) WSL Kurs (Währung) (Currency of Price) Kurswert (Price Value)WSL Kurswert (Währung) (Currency of Price Value) Embed the (PDF) file in a Vector Store and let the LLM extract the JSON file from the documentCollect some meta information Chroma Vector Store and llama3:instruct FAISS Vector Store and mistral:instruct llama3:instruct mistral:instruct Activate Conda Environmentbased on Operating SystemWindows or macOScurrent timeENDCollect the model'sAnswersescapedPromptData=> escape the text so it will fit into a JSON fileescapedInstructionData=> escape the text so it will fit into a JSON file~#([\s\S]*?)~#Match $MATCHINDEX: => extract the JSON file from the answerregexReplace($Response$, "```", "~#")Match 0: Group 1RowID../data/Chroma Vector StoreURIvectorstore_name" vectorstore_bank_chroma" scan folder forBank .pdf statements../documents/bank/bank_01START LoopInstructionPromptyou might want to experiment with different versionsjust convert the whole JSONelement to tableyou could also try and extractdata with individual pathsmake surethe folder is emptycreate path for (temporary)vector storesFile pathFile nameFile extension1 - 3!!! remove this restrictionif you want more filesconvert the response columnto JSON../data/RowIDcreate path for (temporary)vector storesFile pathFile nameFile extensionFAISS Vector Storecurrent timeSTART Loopjust convert the whole JSONelement to tablemake surethe folder is emptyENDCollect the model'sAnswersv_llm_modelllama3:instruct=> what model to usev_llm_modelmistral:instruct=> what model to usekeep the first rowor usea sample file that ensures thestructure of the columnsmake surealle rowshave the same structureSTART 2just convert the whole JSONelement to tableEND 2vectorstore_name" vectorstore_bank_faiss" Modle nameModle name conda_knime_llama Try (VariablePorts) Catch Errors(Var Ports) Merge Variables Create Date&TimeRange Loop End Java Edit Variable Java Edit Variable Column Filter Regex Extractor String Manipulation String to JSON RowID Table Writer create filename to collect Python Script URL to File Path Path to URI StringConfiguration List Files/Folders Table Row ToVariable Loop Start Merge Variables Java EditVariable (simple) Enter the Prompt JSON to Table DeleteFiles/Folders Create Folder ConstantValue Column ConstantValue Column ConstantValue Column Column Filter Row Filter Column Filter String to JSON create filename to collect Table Writer RowID Create Folder ConstantValue Column ConstantValue Column ConstantValue Column Column Filter Python Script Create Date&TimeRange Table Row ToVariable Loop Start JSON to Table Try (VariablePorts) DeleteFiles/Folders Catch Errors(Var Ports) Loop End Merge Variables Java EditVariable (simple) Java EditVariable (simple) Row Filter ReferenceColumn Filter Chunk Loop Start JSON to Table Loop End vector_store_name vector_store_name StringConfiguration ConstantValue Column ConstantValue Column Extract Data from Bank Statements (PDF) into JSON files with the help of Ollama / Llama3 LLM(https://forum.knime.com/t/extract-data-from-invoices-to-xml-or-csv/80875/12?u=mlauber71)- list PDFs or other documents (csv, txt, log) from your drive that roughly have a similar layout and you expect an LLM to be able to extract data- formulate a concise prompt (and instruction) and try to force the LLM to give back a JSON file with always the same structure (Mistral seems to be very good at that)- convert the single document to a Vector Store either into Chroma or Meta's FAISS with the helop of Ollama and a suitable embedding model (mxbai-embed-large)- Use Ollama wrapper (via Python and KNIME node) to put document and query before the LLM- collect the data back from Python into KNIME- extract the data from JSON files, either with the help of Regex or just convert the JSON with KKNIME nodes- make sure they have the same structure=> you need to have Python environment installed and Ollama and you need to have the models pulled locally and Ollama running!!!If you experience problems with the model download: Check your Proxy settings and then kill all running Ollama jobs in your task manager and try again Maybe download the whole LLM workflow group in order to get the folder(https://hub.knime.com/mlauber71/spaces/LLM_Space/~17k4zAECNryrZw1X/) Run in Terminal window to start Ollama. You can also try and use other models (https://ollama.com). You can also just pull the modelollama pull llama3:instructollama run llama3:instructTo get the embedding model you run this command in the terminal windowollama pull mxbai-embed-large Ollama and Llama3 - A Streamlit App to convert your files into local Vector Stores and chat with them using the latest LLMshttps://medium.com/p/c5340fcd6ad0Medium - Chat with local Llama3 Model via Ollama in KNIME Analytics Platform - Also extract Logs into structured JSON Fileshttps://medium.com/p/aca61e4a690a Extract exactly these information from this document into a JSON file.Do not add any information that is not there. Do not change the structure of the JSON file. Do not alterthe names of the fields or the order! Check that the JSON file is syntactically correct.If information is missing just leave it empty! The JSON fields:Firma (Company) Depot (Depot) Bank (Bank) Transaktionsart (Transaction Type - Purchase / Sale)ISIN / Kenn-Nr. der Aktie (ISIN / Stock ID Number) WKN (WKN) Wertstellung (Valuta) (Value Date)Belegdatum (Document Date) WSL (Währung) (Currency) FW-Kurs (Devisenkurs) (Foreign ExchangeRate) Menge (Quantity) Kurs (Price) WSL Kurs (Währung) (Currency of Price) Kurswert (Price Value)WSL Kurswert (Währung) (Currency of Price Value) Embed the (PDF) file in a Vector Store and let the LLM extract the JSON file from the documentCollect some meta information Chroma Vector Store and llama3:instruct FAISS Vector Store and mistral:instruct llama3:instruct mistral:instruct Activate Conda Environmentbased on Operating SystemWindows or macOScurrent timeENDCollect the model'sAnswersescapedPromptData=> escape the text so it will fit into a JSON fileescapedInstructionData=> escape the text so it will fit into a JSON file~#([\s\S]*?)~#Match $MATCHINDEX: => extract the JSON file from the answerregexReplace($Response$, "```", "~#")Match 0: Group 1RowID../data/Chroma Vector StoreURIvectorstore_name" vectorstore_bank_chroma" scan folder forBank .pdf statements../documents/bank/bank_01START LoopInstructionPromptyou might want to experiment with different versionsjust convert the whole JSONelement to tableyou could also try and extractdata with individual pathsmake surethe folder is emptycreate path for (temporary)vector storesFile pathFile nameFile extension1 - 3!!! remove this restrictionif you want more filesconvert the response columnto JSON../data/RowIDcreate path for (temporary)vector storesFile pathFile nameFile extensionFAISS Vector Storecurrent timeSTART Loopjust convert the whole JSONelement to tablemake surethe folder is emptyENDCollect the model'sAnswersv_llm_modelllama3:instruct=> what model to usev_llm_modelmistral:instruct=> what model to usekeep the first rowor usea sample file that ensures thestructure of the columnsmake surealle rowshave the same structureSTART 2just convert the whole JSONelement to tableEND 2vectorstore_name" vectorstore_bank_faiss" Modle nameModle name conda_knime_llama Try (VariablePorts) Catch Errors(Var Ports) Merge Variables Create Date&TimeRange Loop End Java Edit Variable Java Edit Variable Column Filter Regex Extractor String Manipulation String to JSON RowID Table Writer create filename to collect Python Script URL to File Path Path to URI StringConfiguration List Files/Folders Table Row ToVariable Loop Start Merge Variables Java EditVariable (simple) Enter the Prompt JSON to Table DeleteFiles/Folders Create Folder ConstantValue Column ConstantValue Column ConstantValue Column Column Filter Row Filter Column Filter String to JSON create filename to collect Table Writer RowID Create Folder ConstantValue Column ConstantValue Column ConstantValue Column Column Filter Python Script Create Date&TimeRange Table Row ToVariable Loop Start JSON to Table Try (VariablePorts) DeleteFiles/Folders Catch Errors(Var Ports) Loop End Merge Variables Java EditVariable (simple) Java EditVariable (simple) Row Filter ReferenceColumn Filter Chunk Loop Start JSON to Table Loop End vector_store_name vector_store_name StringConfiguration ConstantValue Column ConstantValue Column

Nodes

Extensions

Links