Icon

Merge Files

This workflow merges different unstructured data sources to come up with a new table.

Since data come from different data files, we need to blend together information from three sources.

All data have been stored in the workflow data area, which means they are saved within the workflow, the consequence is that the reader nodes don't fail because they are configured to read these files in the data area.

These excel files have different unstructured formats that could be an obstacle when analyzing the data.

The Olympic_Athlete_Bio has an American athlete whose nationality is missing. The cell updater nodes allow substituting this missing value with the one specified in a flow variable. We need to set the "coordinates" of the cell that must be updated, hence the column name/number and the table row.

The Olympic_Athlete_Event_Results has a header on top that needs to be excluded from the table. First, we use the option to specify which row should be used to specify the column names then we use the table cropper to exclude useless columns and the header (that has been placed like the first row by the excel reader node).

Olympic_Games file presents missing columns and empty rows when read. The table cropper node allows reshaping this table and to get rid of the. The user must specify the interval of columns and rows that he wants to keep.

1) First information to retrieve is the date of birth of the athlete. We used a Value Lookup node to merge this information to the first data source. The key used to merge these files is the "athlete_id".

2)The second Value Lookup adds the starting date of the Olympics edition, which is required to compute the athlete's age. The key used to merge these files is the "edition_id".

3) The missing value node excludes the unmatched rows, in fact, after a Value Lookup operation it is very likely that some missing rows might be originated when a correspondent key in the look-up table is not found.

4)Finally, the table is saved in the workflow data area in the excel format.

Nodes

Extensions

Links