Icon

Non-standard format Spreadsheets

<p><strong>Non-standard Spreadsheet Formats</strong></p><p>This workflow shows how to work with spreadsheets that come in a non-standard format using KNIME Analytics Platform.</p><p>The aim is to clean up and re-arrange a spreadsheet in a non-standard format and then apply a row aggregation. In this example, we access a dataset containing personal information about Olympia athletes and we want to find the total number of male and female athletes born in January 2000.</p><p>💡 <em>To view each node's configuration, select the node and see the configuration pane on the right side of the workflow editor.</em></p><p>Let's walk through the different nodes involved in this operation:</p><p><strong>Excel Reader node:</strong></p><ul><li><p>Since the folder with the data is already included when you download the workflow, in the "File and Sheet" tab, we choose to "Read from" the "Current workflow data area" and select the dataset.</p></li><li><p>In the "Data Area" tab, we select to read the "Whole sheet" and unflag to skip "empty rows". This configuration allows us to read the sheet as it is. The intent is to respect its original structure.</p></li></ul><p><strong>Row to Column Header node:</strong></p><p>The column headers of the input table are not placed in the second row.</p><ul><li><p>We select the second row of the input table as our new column header.</p></li><li><p>We also remove every row before the new column header by setting the appropriate checkbox "Discard rows before header row".</p></li></ul><p><strong>Table Splitter node:</strong></p><p>The input table has a separator row where all column values are equal to "-" (RowID = Row413). We can't use that row in the aggregation later, which is why we need to remove that row.</p><ul><li><p>We split the table at this column. We select an arbitrary column as the lookup column and choose the matching criteria as "Equals" and setting the search pattern to "-".</p></li><li><p>We also choose to <strong>not</strong> include the matching row in the top or the bottom output table. That way, we filter out the separator row.</p></li></ul><p><strong>Cell Extractor + Cell Updater nodes:</strong></p><p>The table contains a misplaced date value which is placed in Row5 instead of Row4.</p><ul><li><p><strong>Cell Extractor node:</strong> We extract the date value from this cell as a flow variable ("extracted_cell").</p></li><li><p><strong>Cell Updater node:</strong> We use the previously created flow variable to overwrite the missing date value in row 2 (RowID = Row 4).</p></li></ul><p><strong>Row Filter node:</strong></p><p>The table contains two empty rows which we want to remove (i.e., rows with all values missing): Row3 and Row5.</p><ul><li><p>We define two filter criteria. As column to filter we set "RowID", as operator we choose "Equals", and as value we set "Row3" and "Row5" respectively.</p></li><li><p><strong>Note: It's important to choose "Match row if matched by any criterion".</strong></p></li><li><p>We select to "output non-matching rows".</p></li></ul><p><strong>Concatenate node:</strong></p><ul><li><p>After cleaning the upper table, we concatenate the two tables back together.</p></li></ul><p><strong>Column Filter node:</strong></p><p>The table contains an empty column (i.e., a column with all values missing) called "Column3".</p><ul><li><p>In the Include/Exclude panel we move "Column3" to the "Exclude" side to remove it from the table.</p></li></ul><p><strong>Table Cropper node:</strong></p><p>We need to crop the table so that only athletes born in January 2000 remain, so that we can aggregate the data accordingly.</p><ul><li><p>Select the column range ("athlete_id" to "country_noc") and row range (row nr. 1 to row nr. 69).</p></li></ul><p><strong>Row Aggregator + Column Renamer nodes:</strong></p><p>We cleaned and re-arranged the spreadsheet. Now we can aggregate the data.</p><ul><li><p><strong>Row Aggregator node:</strong> We choose "Sex" as the category column and "Occurrence count" as the aggregation method.</p></li><li><p><strong>Column Renamer node:</strong> We rename the columns to give them more distinctive names.</p></li></ul><p><strong>Column Appender node:</strong></p><ul><li><p>We append the gender occurrence count table (output of the Row Aggregator + Column Renamer node) to the modified athletes' bio information sheet.</p></li><li><p>Since the two tables have different table specs, it's important to select "Generate new RowIDs".</p></li></ul><p><strong>Excel Writer node:</strong></p><ul><li><p>We append the dataset to a new sheet called "Sheet_1_modified" in the existing Excel file located in the workflow data area.</p></li><li><p>After executing the node, the file will open automatically.</p></li></ul><p>As you can see from the output, we now have the modified and cleaned data table together with the gender occurrence count table next to it.</p>

Clean up and re-arrange the non-standard spreadsheet

Non-standard Spreadsheet Formats


This workflow demonstrates how to work with spreadsheets that come in a non-standard format. The aim is to clean and re-arrange a spreadsheet and then apply a row aggregation.

In this example, we access a dataset containing personal information about Olympia athletes. We want to find the total number of male and female athletes born in January 2000. However, in order to be able to apply a row aggregation, we first perform the following cleaning and manipulation steps:

  • Fix missing column header

  • Fix misplaced date value

  • Remove empty rows

  • Remove empty column

  • Crop table to desired data range (athletes born in 2000)

After the dataset is in a standard table format and the desired data range is selected, we count the number of male and female athletes. We rename the columns of the resulting aggregation table and add it back to the cleaned dataset using the Column Appender node. Lastly, we write the resulting data table to a new sheet in the existing Excel file.

For a detailed overview of each node in this workflow, refer to the workflow description in the Info panel.


💡 To view each node's configuration, select the node and see the configuration pane on the right side of the workflow editor.

Data access

Read file located in workflow data area

Fix misplaced date value

Remove empty rows

Crop table to athletes born in January 2000

Remove empty columns

Data aggregation

Count the number of male and female athletes

Data export

Save resulting table as new file

Fix column header

Read bio informationof athletes(non-standard format)
Excel Reader
Split table by the separator row
Table Splitter
Concatenate the transformed tables
Concatenate
Extract date from misplaced cell (Row5)
Cell Extractor
Crop table to desired data range
Table Cropper
Remove empty rows(Row3 and Row5)
Row Filter
Set second rowas column header
Row to Column Names
Remove unnecessary columns
Column Filter
Count number of males and females
Row Aggregator
Place extracted cell content in correct cell (Row4)
Cell Updater
Append new sheetto dataset:"Sheet_1_modified"
Excel Writer
Column Appender
Column Renamer

Nodes

Extensions

Links