Non-standard format Spreadsheets

This workflow shows how to work with spreadsheets that come in a non-standard format using KNIME Analytics Platform.

The aim is to clean up and rearrange a spreadsheet in a non-standard format and then apply a row aggregation. We demonstrate this by finding the number of male and female athletes born in January 2020 using the Olympic Athlete Bio file.

Let's walk through the different nodes involved in this operation:

To view the configuration for each node, either Double-click or right-click > "configuration" to open the configuration dialogue.

Excel Reader node:

1 - In the "File and Sheet" tab, the file name and the sheet to read from are already set up (when you download the workflow, it's also included the folder with the data)

2 - Within the "Data Area" tab, we have selected read the "whole area" and unflagged skip "empty rows". This configuration allows us to have the sheet as it is. The intent is to respect its original structure.

Row to Column Header node:

We choose the second row of the input table as our new column header, and we also remove every row before the new column header by setting the appropriate checkbox.

Table Splitter node:

The input table has a separator row where ALL column values are equal to "-" (RowID = Row413).

We can't use that row in the aggregation at the end. That's why we split the table at this column by selecting an arbitrary column as a lookup column choosing the matching criteria as "Equals", and setting the search pattern to "-".
We also choose not to include the matching row in the top or the bottom output table.

Cell Extractor and Cell Updater node:

These nodes can be used to update specific cell values. We use them to update the cell value of a misplaced cell in the Top output table at "Row5" in the column "born".

Row Filter node:

We remove the unwanted row with RowID "Row5". By choosing "Exclude rows by row ID" and setting the Row ID pattern to "Row5".

Concatenate:

We concatenate the modified tables again.

Table Cropper node:

To get the athletes born in January 2020, we use the Table Cropper to get the desired row range and select the desired column range by name.

Column Filter node:

We remove unnecessary columns by manually excluding the column "Column3".

Row Aggregator | Column Renamer nodes:

We aggregate the genders by choosing the "Occurrence Count" as the aggregation method and setting the "sex" column as the category column. We then rename the resulting columns and give them appropriate names.

Column Appender node:

Before writing the results in the spreadsheet, we append the modified and cleaned-up table and the gender occurrence count table together by choosing the "Generate new row keys" option.

Excel Writer node:

In the last step, we append the modified input table to the Sheet "Sheet_1_modified".

As you can see from the output, we now have the modified and cleaned-up data table together with the gender occurrence count table next to it.

Non-standard format Spreadsheets

Nodes

Extensions

Links

Download