Merge Files - Unstructured Data
This workflow demonstrates how to merge data from different unstructured files and create a new, cleaned table. Data often comes from different sources and datasets usually have different table structures and are not always structured identically.
In this example, we access data from three different Excel files. The goal is to combine the information contained in the three datasets, however, each file has a different, unstructured format that might be an obstacle when analyzing the data:
Olympic_Athlete_Bio.xlsx: Contains personal information about Olympics athletes, however, for one athlete the nationality is missing (USA).
Olympic_Athlete_Event_Results.xlsx: Contains the athlete event results of the 1920-2020 Summer Olympic Games. The header on top needs to be excluded from the table.
Olympic_Games.xlsx: Contains additional information about the Summer Olympic Games, however, the empty columns and rows need to be cleaned.
We start by accessing each Excel file. Check the node monitor to observe how the data was imported into KNIME Analytics Platform. We proceed to fix the issues and bring the three files into a standardized format, so that we can then merge the tables. Lastly, we deal with missing values and then write the cleaned and transformed data table to a new Excel file.
For a detailed overview of each node in this workflow, refer to the workflow description in the Info panel.
💡 To view each node's configuration, select the node and see the configuration pane on the right side of the workflow editor.