Icon

04 Your First Data Pipeline - Solution

<p><strong>Learning objective: </strong>In this exercise you will practise extracting and transforming data from unstructured and semi-structured formats, updating records in a database, and logging the events.</p><p><strong>Workflow description:</strong> This workflow showcases a simple data pipeline that:</p><ul><li><p>extracts and transforms data from PDFs and JSON formats</p></li><li><p>updates records in a database</p></li><li><p>handles errors and logs the events.</p></li></ul><p><strong>You will find the instructions to the exercises in the yellow annotations.</strong></p>

Introduction to Data Engineering

04 - Your First Data Pipeline

Learning objective: In this exercise you will practise extracting and transforming data from unstructured and semi-structured formats, updating records in a database, and logging the events.


Workflow description: This workflow showcases a simple data pipeline that:

  • extracts and transforms data from PDFs and JSON formats

  • updates records in a database

  • handles errors and logs the events.


You will find the instructions to the exercises in the yellow annotations.

Activity I. Extract Textual Data from PDF Files

  1. Parse the PDFs and read the textual data in it using the Tika Parser node.

    • To specify the directory, you need to use the KNIME Protocol (hint: try knime://knime.workflow/../../data/Store_ID PDFs as the directory)

    • Select PDF file format.

    • Select only Content in the Metadata field.

    • In the output table, the column Content contains the text from the PDF documents.

  2. Connect the first output port to the input of the metanode Extract License Status which cleans and transforms the relevant fields into table structure.


Activity II. Extract NoSQL Database Data

  1. Read in the documents extracted from the MongoDB database* and saved as JSON files.

    • With the JSON Reader node, use the workflow relative path and import all the files in the ../../data/MongoDB documents folder. Use Files in folder mode

  2. Convert the JSON format to the KNIME table format with the JSON to Table node

  3. Connect the output of the JSON to Table node to the Extract Validity metanode


*Note. To extract the same data from the MongoDB directly, you would need MongoDB Connector and MongoDB Reader nodes. You would need to provide hostname and port of the instances and authenticate to connect, and then provide the database and the collection to import. We don't provide the access to the MongoDB database in this exercise.


Activity III. Update a Database Table

  1. Connect to a SQLite database with the SQLite Connector node

    • In the Location, select "Path" and Browse to the "my_central_database.sqlite" in the data folder or provide the relative path: knime://knime.workflow/../../data/my_central_database.sqlite

  2. Update the Renewal Status and License Validity of the stores in the "Inspection_Overall" table in the SQLite database with the DB Updater node.

    • Connect the Output of the String to Date&Time node to the first input of the DB Updater node

    • In the Configuration dialog, select the following settings

      • Table : Inspection_Overall

      • Column for row identification: Store_ID

      • Columns to write: Renewal Status, License Validity

    • Execute the node and the Inspection Overall table will be updated based on the data in the input of the DB Updater node


Activity IV. Extract the Context Properties for Logging

  1. Use the Extract Context Properties node to create variables for workflow name and the user. Include context.workflow.user,context.workflow.name, context.workflow.path

  2. Connect the Extract Context Properties node to the Try(Variable Port) node

  3. Connect the Try(Variable Port) node to SQLite Connector node


Activity V. Logging the Events

  1. Create success and failure status messages for logging as string variables:

    1. In the top (success) branch of the Try-Catch construction:

      • Connect the DB Update output variable port to the Variable Creator node

      • Create a variable Status = Success

      • Connect it to the upper port of the Catch Errors (Var Ports) node

    2. In the bottom (failure) branch of the Try-Catch construction:

      • Connect the DB Update output variable port to the Active Branch Inverter node and to a new Variable Creator node

      • Create a variable Status = Failure

      • Connect it to the bottom port of the Catch Errors (Var Ports) node

  2. In the Catch Errors (Var Ports) node enable Always populate error variables

  3. Open the Logging component to finalize the logging part.


Data warehouse
DB Updater
Date conversion
String to Date&Time
Logging
Transform MongoDB documents to a table
JSON to Table
Success message
Variable Creator
Parsing PDF files
Tika Parser
Active Branch Inverter
SQLite Connector
Extract License Status
Extract Validity
Extract Context Properties
Joiner
MongoDB documents
JSON Reader
Failure Message
Variable Creator
Try (Variable Ports)
Catch Errors (Var Ports)

Nodes

Extensions

Links