Parse the PDFs and read the textual data in it using the Tika Parser node.
To specify the directory, you need to use the KNIME Protocol (hint: try knime://knime.workflow/../../data/Store_ID PDFs as the directory)
Select PDF and any other file formats of interest.
In the output table, the column Content contains the text from the PDF documents.
Extract the license status with the Extract License Status metanode. It outputs a table of Store_ID and Renewal Status (approved or rejected).
Connect to the MongoDB database with the MongoDB Connector node. The details on the connection setting are given in the first exercise of Session 4.
Access the MongoDB Collection using the MongoDB Reader node. Connect the output of MongoDB Connector node to the Input of MongoDB Reader node. Select Database : L2-DE_Session4 and Collection : License_Renewal. The output is in JSON format.
Convert the JSON Format to KNIME Table format with the JSON to Table node
Extract the license validity date with the Extract Validity metanode. The output table contains Store_ID and Licence Validity.
Join the tables with the license status and the license validity with the Joiner node, with Store_ID as the join column.
Convert the string column License Validity to Date&Time data by the String to Date&Time node.
Connect to the SQLite Database with the SQLite Connector node. The database file my_central_database.sqlite, located under the data folder, should be specified under Path using the KNIME Protocol*.
Update the Renewal Status and License Validity in the Inspection_Overall table in the SQLite Database with the DB Updater node. Select Store_ID as the identification column and Renewal Status & License Validity as the columns to write
* You can specify the database file by knime://knime.workflow/../../data/my_central_database.sqlite
Use the Extract Context Properties node, to read workflow context related properties, including workflow name and the logged in user. This information will be used for logging
Connect the output of Extract Context Properties node with the input of Try(Variable Port) node
Select the table Orders_table from the SQLite database with the DB Table Selector node
Join the Orders_table data with the Inspection_Overall table using the DB Joiner node. Use the column Store_ID as the joining column and "Left Outer Join" as the joining mode.
Calculate the number of orders for each store across different age groups with the DB Pivot node. In addition to Store_ID, include Renewal Status and License Validity under the Group tab. Use AgeGroup as the pivoting column, and aggregate OrderNumber by the method SUM.
Read the processed data into KNIME Analytics Platform with the DB Reader node.
Treat missing values in the data table with the Replace Missing Values metanode.
Write the treated table as a CSV file Store_Report.csv under the Current workflow data area
When the execution is a success: Create a success status message with the Variable Creator node. A string flow variable Status carries the status Success.
When the execution fails:
The Active Branch Inverter node activate the alternate branch of the workflow for a failure.
The alternate branch includes the Variable Creator node, producing a string flow variable Status carrying the status Failed
Closing the Try-Catch loop. The flow variable connections from both success and failure branches terminate at the Catch Errors (Var Ports) node.
Log events with the Logging Component component. The context properties, the execution status, and the execution time stamp are written to a log file Logger.xlsx under the current workflow data area.
Learning objectives:
Extract structured, unstructured, and semi-structured data from different sources
Use different ways to load data into a database
Perform in-database processing of data
Log events to a log file
Workflow description: A series of steps in this workflow outlines some basic concepts of data engineering
Data is extracted from a collection of PDF documents (Step 1)
Semi-structured data is loaded from a MongoDB database (Step 2)
A database is updated with the information from the other sources (Step 3)
Perform in-database processing (Step 5)
Log events during the execution of the workflow (Steps 6 & 7)
To use this workflow in KNIME, download it from the below URL and open it in KNIME:
Deploy, schedule, execute, and monitor your KNIME workflows locally, in the cloud or on-premises – with our brand new NodePit Runner.