Icon

Exercise3

<p><strong>Chapter 7/Exercise 3</strong></p><p>In this exercise, we implement a <em>Table Row to Variable</em> loop to clean a dataset from duplicates and missing values.</p><p>We have access to a dataset (<em>wrong_sales_file.txt</em>) which contains sales records. However, the dataset is incorrect as for some sales it contains two entries: An older record containing a few missing values and a more recent record with all values correctly filled in. The column "load_date" indicates the date of record creation. The goal of this workflow is to clean the dataset by removing the duplicates and missing values. Hence, for each order number, we want to remove the older record and keep only the most recent one.</p><p><strong><em>Note.</em></strong> There are easier ways to clean data from duplicates, for example, using a <em>Duplicate Row Filter</em> or <em>GroupBy </em>node. However, in this workflow we implement a <em>Table Row to Variable</em> loop to demonstrate how it works.</p>

Since KNIME has been thought in terms of data table, loops are rarely needed. Before using a loop make sure that a dedicated node for what you have in mind does not exist!

Workflow: Chapter 7/Exercise 3


In this exercise, we implement a Table Row to Variable loop to clean a dataset from duplicates and missing values.

We have access to a dataset (wrong_sales_file.txt) which contains sales records. However, the dataset is incorrect as for some sales it contains two entries: An older record containing a few missing values and a more recent record with all values correctly filled in. The column "load_date" indicates the date of record creation. The goal of this workflow is to clean the dataset by removing the duplicates and missing values. Hence, for each order number, we want to remove the older record and keep only the most recent one.

Note. There are easier ways to clean data from duplicates, for example, using a Duplicate Row Filter or GroupBy node. However, in this workflow we implement a Table Row to Variable loop to demonstrate how it works.

Reading data

Sales file with duplicated records and missing values.

Data preprocessing

Loop body

Duplicate removal with a single node

Count occurrence of each contract number in the dataset

Count occurrence of each contract number in the dataset

Filter to currentcontract number
Row Filter
String to Date&Time
Sort by "load_date"in descending order
Sorter
Keep only "contract nr"
Column Filter
Loop End
Check on uniquenessof contract numbers
Value Counter
Loop over listof contract numbers
Table Row to Variable Loop Start
Sort by "load_date"in descending order
Sorter
Keep only first row:Most recent load date
Row Filter
Check on uniquenessof contract numbers
Value Counter
wrong_sales_file.txt
CSV Reader
Remove duplicate rows;Keep first
Duplicate Row Filter
Remove duplicatecontract numbers
Duplicate Row Filter
Group by contract nrand use "First" for aggregation
GroupBy

Nodes

Extensions

Links