Icon

Merge Files

<p><strong>Merge Files - Unstructured Data</strong></p><p>This workflow merges different unstructured data sources to create a new table.</p><p>Data often comes from different sources, and datasets usually have different table structures and are not always structured identically.</p><p>In this example, we access data from three different Excel files. The goal is to combine the information contained in the three datasets, however, each file has a different, unstructured format that might be an obstacle when analyzing the data:</p><ul><li><p><strong>Olympic_Athlete_Bio.xlsx:</strong> Contains an American athlete whose nationality is missing.</p></li><li><p><strong>Olympic_Athlete_Event_Results.xlsx:</strong> Contains a header on top that needs to be excluded from the table.</p></li><li><p><strong>Olympic_Games.xlsx:</strong> Contains empty columns and rows.</p></li></ul><p>💡 <em>To view each node's configuration, select the node and see the configuration pane on the right side of the workflow editor.</em></p><p>Let's walk through the different nodes involved in this operation:</p><p><strong>Excel Reader nodes:</strong></p><ul><li><p>Since the folder with the data is already included when you download the workflow, in the "File and Sheet" tab, we choose to "Read from" the "Current workflow data area" and select the dataset.</p></li></ul><p><strong>Variable Creator + Cell Updater nodes:</strong></p><p>To fix the missing value in the <em>Olympic_Athlete_Bio.xlsx</em> file:</p><ul><li><p><strong>Variable Creator node:</strong> Create a flow variable with the value to replace the missing value with.</p></li><li><p><strong>Cell Updater node: </strong>Specify the cell which should be replaced by the flow variable (Row 10 in column "country_noc").</p></li></ul><p><strong>Table Cropper nodes:</strong></p><ul><li><p>To exclude the empty header rows of the <em>Olympic_Athlete_Event_Results.xlsx</em> file; and</p></li><li><p>to exclude the empty columns and rows of the <em>Olympic_Games.xlsx</em> file.</p></li></ul><p><strong>Value Lookup nodes:</strong></p><ul><li><p>To add the gender and date of birth from the athletes' bio information sheet to the event results data; and</p></li><li><p>to add the start and end date from the Summer Olympic Games information sheet to the event results data.</p></li></ul><p><strong>Missing Value node:</strong></p><ul><li><p>To exclude rows that have missing values in the columns "sex", "born", "start_date", or "end_date".</p></li><li><p>In fact, after a Value Lookup operation it is very likely that some missing rows might have originated when a correspondent key in the lookup table was not found.</p></li></ul><p><strong>Excel Reader node:</strong></p><ul><li><p>We create a new file called "Merged_table" and save it in the workflow data area.</p></li><li><p>After executing the node, the file will open automatically.</p></li></ul>

Merge Files - Unstructured Data


This workflow demonstrates how to merge data from different unstructured files and create a new, cleaned table. Data often comes from different sources and datasets usually have different table structures and are not always structured identically.

In this example, we access data from three different Excel files. The goal is to combine the information contained in the three datasets, however, each file has a different, unstructured format that might be an obstacle when analyzing the data:

  • Olympic_Athlete_Bio.xlsx: Contains personal information about Olympics athletes, however, for one athlete the nationality is missing (USA).

  • Olympic_Athlete_Event_Results.xlsx: Contains the athlete event results of the 1920-2020 Summer Olympic Games. The header on top needs to be excluded from the table.

  • Olympic_Games.xlsx: Contains additional information about the Summer Olympic Games, however, the empty columns and rows need to be cleaned.

We start by accessing each Excel file. Check the node monitor to observe how the data was imported into KNIME Analytics Platform. We proceed to fix the issues and bring the three files into a standardized format, so that we can then merge the tables. Lastly, we deal with missing values and then write the cleaned and transformed data table to a new Excel file.

For a detailed overview of each node in this workflow, refer to the workflow description in the Info panel.


💡 To view each node's configuration, select the node and see the configuration pane on the right side of the workflow editor.

Data access

  1. Athletes' bio information

  2. Athlete event results data of Summer Olympic Games

  3. Summer Olympic Games information

Fix missing nationality

Missing value ("counry_noc") in Row 10

Remove header

Remove empty columns and rows

Data merging

  • Add gender and date of birth to athlete event results data

  • Add start and end date of each game to athlete event results data

Data cleaning

Remove rows where either gender, date of birth, start date, or end date is missing

Data export

Save resulting table as new file

Read bio informationof athletes
Excel Reader
Variable to fixmissing value
Variable Creator
Read athlete event results
Excel Reader
Add gender anddate of birth
Value Lookup
Unstructured Excel filetransformed into tabular format
Table Cropper
Save resulting table as new file: Merged_table.xlsx
Excel Writer
Add start and end date of game
Value Lookup
Exclude unmatched rows
Missing Value
Read Summer OlympicGames information
Excel Reader
Replace missing nationalitywith value of variable
Cell Updater
Unstructured Excel filetransformed into tabular format
Table Cropper

Nodes

Extensions

Links