Icon

01_​Access_​data_​from_​NoSQL_​DB_​and_​PDFs - Solution

Activity I. Extract Textual Data from PDF Files1. Locate and list the PDF files with the List Files/Folders node- Use a workflow relative path- In the Folder section browse to the "data" folder and inside it select the"Store_ID PDFs" folder that contains all the PDFs- Execute the node and the Path column will be available at the output2. Convert the column Path to URI, so that the PDF can be accessed with aURI in the downstream nodes. Use a Path to URI node.- In the Path to URI node, specify the Input Column as Path- Execute the node and the URI column will be available at output3. Parse the PDFs and read the textual data using the Tika Parser URLInput node- In the Configuration dialog, select the "URI" as File path column- The Tika Parser URL Input node lets you decide which file extension tobe parsed. Include the "PDF" file type in this section- Execute the node. The first output port will contain all the fields extractedfrom these PDFs- Connect the first output port to the input of the metanode Extract LicenseStatus which cleans and transforms the relevant fields into table structure Session 4 - Introduction to Data EngineeringExercise 01 Access Data from NoSQL Database and PDF Files Learning objective: In this exercise you will practice the extraction of data from NoSQL database and unstructured data sources like PDF files. Theextracted data will be used to update the database tableWorkflow description: This workflow showcases how to access data from a NoSQL database and PDF files. The relevant fields are extracted fromthese files to transform the data in a tabular format, which is then used to update the database table. Find below the three major tasks performed in thisworkflow- The PDF files consisting of the renewal status of the business license are read from their respective locations. The textual data from these PDF files isextracted, then transformed to a standard table structure.- The validity of the renewed license is available in a NoSQL database - MongoDB. Connect to a MongoDB and read the semi-structured data from theMongoDB collection. We transform the extracted data to a standard table structure. The tables created from both data sources are joined to form onefinal table.- The "Inspection Overall" table in the database maintains the records for the stores and their license validity. This table is updated with the records fromthe final table created in the earlier stepYou will find the instructions to the exercises in the yellow annotations. Activity III. Update a Database Table1. Connect to a SQLite database with the SQLite Connector node- In the Connection Settings, select the Driver Name as "SQLite:[ID:SQLite]"- In the Location, select "Path" and Browse to the"my_central_database.sqlite" in the data folder*2. Update the Renewal Status and License Validity of the stores in the"Inspection_Overall" table in the SQLite database with the DB Updatenode.- Connect the Output of the String to Date&Time node to the first input ofthe DB Update node- Connect the Output from the SQLite Connector node to the second inputof the DB Update node- In the Configuration dialog, select the following settingsTable : Inspection_OverallSelect Columns to Update: Renewal Status, License ValiditySelect Identification Column: Store_ID- Execute the node and the Inspection Overall table will be updated basedon the data in the input of the DB Update node* If you cannot locate the "my_central_database.sqlite" in your data folder,simply paste this path "knime://knime.workflow/../../../data/my_central_database.sqlite" Activity II. Extract Data from NoSQL Database1. Connect to a MongoDB database with the MongoDB Connector node.In the configuration dialog- Select the Connection Type as "SRV"- In the Instances, Hostname as "cluster0.p0ag5ht.mongodb.net" and leavethe Port as blank- Enter the Authentication database name as "L2-DE_Session4"- For the Authentication, use the credentials - username and passwordgiven in the course slides2. Access the MongoDB collection using the MongoDB Reader node.- Connect the output of the MongoDB Connector node to the Input of theMongoDB Reader node.- In the configuration window, enter the following detailsDatabase : L2-DE_Session4Collection : License_Renewal- Execute the node and the output contains the data in JSON Format3. Convert the JSON format to the KNIME table format with the JSON toTable node4. Connect the output of the JSON to Table node to the Extract Validitymetanode Read Licenserenewal PDFsNoSQL DatabaseRead MongoDB CollectionTransform MongoDB Documentsto TableDate Conversiondata Warehouse List Files/Folders Path to URI Tika ParserURL Input Extract LicenseStatus MongoDB Connector MongoDB Reader JSON to Table Joiner String to Date&Time SQLite Connector DB Update Extract Validity Activity I. Extract Textual Data from PDF Files1. Locate and list the PDF files with the List Files/Folders node- Use a workflow relative path- In the Folder section browse to the "data" folder and inside it select the"Store_ID PDFs" folder that contains all the PDFs- Execute the node and the Path column will be available at the output2. Convert the column Path to URI, so that the PDF can be accessed with aURI in the downstream nodes. Use a Path to URI node.- In the Path to URI node, specify the Input Column as Path- Execute the node and the URI column will be available at output3. Parse the PDFs and read the textual data using the Tika Parser URLInput node- In the Configuration dialog, select the "URI" as File path column- The Tika Parser URL Input node lets you decide which file extension tobe parsed. Include the "PDF" file type in this section- Execute the node. The first output port will contain all the fields extractedfrom these PDFs- Connect the first output port to the input of the metanode Extract LicenseStatus which cleans and transforms the relevant fields into table structure Session 4 - Introduction to Data EngineeringExercise 01 Access Data from NoSQL Database and PDF Files Learning objective: In this exercise you will practice the extraction of data from NoSQL database and unstructured data sources like PDF files. Theextracted data will be used to update the database tableWorkflow description: This workflow showcases how to access data from a NoSQL database and PDF files. The relevant fields are extracted fromthese files to transform the data in a tabular format, which is then used to update the database table. Find below the three major tasks performed in thisworkflow- The PDF files consisting of the renewal status of the business license are read from their respective locations. The textual data from these PDF files isextracted, then transformed to a standard table structure.- The validity of the renewed license is available in a NoSQL database - MongoDB. Connect to a MongoDB and read the semi-structured data from theMongoDB collection. We transform the extracted data to a standard table structure. The tables created from both data sources are joined to form onefinal table.- The "Inspection Overall" table in the database maintains the records for the stores and their license validity. This table is updated with the records fromthe final table created in the earlier stepYou will find the instructions to the exercises in the yellow annotations. Activity III. Update a Database Table1. Connect to a SQLite database with the SQLite Connector node- In the Connection Settings, select the Driver Name as "SQLite:[ID:SQLite]"- In the Location, select "Path" and Browse to the"my_central_database.sqlite" in the data folder*2. Update the Renewal Status and License Validity of the stores in the"Inspection_Overall" table in the SQLite database with the DB Updatenode.- Connect the Output of the String to Date&Time node to the first input ofthe DB Update node- Connect the Output from the SQLite Connector node to the second inputof the DB Update node- In the Configuration dialog, select the following settingsTable : Inspection_OverallSelect Columns to Update: Renewal Status, License ValiditySelect Identification Column: Store_ID- Execute the node and the Inspection Overall table will be updated basedon the data in the input of the DB Update node* If you cannot locate the "my_central_database.sqlite" in your data folder,simply paste this path "knime://knime.workflow/../../../data/my_central_database.sqlite" Activity II. Extract Data from NoSQL Database1. Connect to a MongoDB database with the MongoDB Connector node.In the configuration dialog- Select the Connection Type as "SRV"- In the Instances, Hostname as "cluster0.p0ag5ht.mongodb.net" and leavethe Port as blank- Enter the Authentication database name as "L2-DE_Session4"- For the Authentication, use the credentials - username and passwordgiven in the course slides2. Access the MongoDB collection using the MongoDB Reader node.- Connect the output of the MongoDB Connector node to the Input of theMongoDB Reader node.- In the configuration window, enter the following detailsDatabase : L2-DE_Session4Collection : License_Renewal- Execute the node and the output contains the data in JSON Format3. Convert the JSON format to the KNIME table format with the JSON toTable node4. Connect the output of the JSON to Table node to the Extract Validitymetanode Read Licenserenewal PDFsNoSQL DatabaseRead MongoDB CollectionTransform MongoDB Documentsto TableDate Conversiondata Warehouse List Files/Folders Path to URI Tika ParserURL Input Extract LicenseStatus MongoDB Connector MongoDB Reader JSON to Table Joiner String to Date&Time SQLite Connector DB Update Extract Validity

Nodes

Extensions

Links