Icon

Session4

Step 1. Extract Textual data from PDF Files

  1. Parse the PDFs and read the textual data in it using the Tika Parser node.

    • To specify the directory, you need to use the KNIME Protocol (hint: try knime://knime.workflow/../../data/Store_ID PDFs as the directory)

    • Select PDF and any other file formats of interest.

    • In the output table, the column Content contains the text from the PDF documents.

  2. Extract the license status with the Extract License Status metanode. It outputs a table of Store_ID and Renewal Status (approved or rejected).

Step 2. Extract data from NoSQL Database

  1. Connect to the MongoDB database with the MongoDB Connector node. The details on the connection setting are given in the first exercise of Session 4.

  2. Access the MongoDB Collection using the MongoDB Reader node. Connect the output of MongoDB Connector node to the Input of MongoDB Reader node. Select Database : L2-DE_Session4 and Collection : License_Renewal. The output is in JSON format.

  3. Convert the JSON Format to KNIME Table format with the JSON to Table node

  4. Extract the license validity date with the Extract Validity metanode. The output table contains Store_ID and Licence Validity.

Step 3. Update the Database Table

  1. Join the tables with the license status and the license validity with the Joiner node, with Store_ID as the join column.

  2. Convert the string column License Validity to Date&Time data by the String to Date&Time node.

  3. Connect to the SQLite Database with the SQLite Connector node. The database file my_central_database.sqlite, located under the data folder, should be specified under Path using the KNIME Protocol*.

  4. Update the Renewal Status and License Validity in the Inspection_Overall table in the SQLite Database with the DB Updater node. Select Store_ID as the identification column and Renewal Status & License Validity as the columns to write


* You can specify the database file by knime://knime.workflow/../../data/my_central_database.sqlite


Step 4. Extract the Context Properties for Logging

  1. Use the Extract Context Properties node, to read workflow context related properties, including workflow name and the logged in user. This information will be used for logging

  2. Connect the output of Extract Context Properties node with the input of Try(Variable Port) node

Step 5. Perform In-Database processing

  1. Select the table Orders_table from the SQLite database with the DB Table Selector node

  2. Join the Orders_table data with the Inspection_Overall table using the DB Joiner node. Use the column Store_ID as the joining column and "Left Outer Join" as the joining mode.

  3. Calculate the number of orders for each store across different age groups with the DB Pivot node. In addition to Store_ID, include Renewal Status and License Validity under the Group tab. Use AgeGroup as the pivoting column, and aggregate OrderNumber by the method SUM.

  4. Read the processed data into KNIME Analytics Platform with the DB Reader node.

Step 6. Export the Results

  1. Treat missing values in the data table with the Replace Missing Values metanode.

  2. Write the treated table as a CSV file Store_Report.csv under the Current workflow data area

Step 7. Logging the events

  1. When the execution is a success: Create a success status message with the Variable Creator node. A string flow variable Status carries the status Success.

  2. When the execution fails:

    • The Active Branch Inverter node activate the alternate branch of the workflow for a failure.

    • The alternate branch includes the Variable Creator node, producing a string flow variable Status carrying the status Failed

  3. Closing the Try-Catch loop. The flow variable connections from both success and failure branches terminate at the Catch Errors (Var Ports) node.

  4. Log events with the Logging Component component. The context properties, the execution status, and the execution time stamp are written to a log file Logger.xlsx under the current workflow data area.

L2-DE Demo Workflow for Session 4

Accessing Data from Various Sources, In-Database Processing, and Logging

Learning objectives:

  • Extract structured, unstructured, and semi-structured data from different sources

  • Use different ways to load data into a database

  • Perform in-database processing of data

  • Log events to a log file


Workflow description: A series of steps in this workflow outlines some basic concepts of data engineering

  • Data is extracted from a collection of PDF documents (Step 1)

  • Semi-structured data is loaded from a MongoDB database (Step 2)

  • A database is updated with the information from the other sources (Step 3)

  • Perform in-database processing (Step 5)

  • Log events during the execution of the workflow (Steps 6 & 7)

NoSQL Database
MongoDB Connector
Read MongoDB Collection
MongoDB Reader
DB Pivot
Join License Renewalwith Orders Data
DB Joiner
Logging
Transform MongoDB Documentsto Table
JSON to Table
Success message
Variable Creator
Parsing PDF files
Tika Parser
CSV Writer
Replace Missing Values
Orders Table
DB Table Selector
Extract License Status
Extract Validity
Date Conversion
String to Date&Time
Joiner
DB Reader
Update the Inspection Tablein Data Warehouse
DB Updater
Extract Context Properties
Failure Message
Variable Creator
Try (Variable Ports)
Active Branch Inverter
SQLite Connector
Catch Errors (Var Ports)

Nodes

Extensions

Links