Challenge 28 - Data Quality Matters - Can You Trust Your Office Equipment Data

Challenge 28: Can You Trust Your Office Equipment Data?Level: HardDescription: Your company is planning to upgrade its office setup with new chairs, desks, monitors, and other essentials. To help with the decision, someone scraped product details and reviews from amazon.com — but before trusting this data, you’ve been asked to assess its quality. Your task is to evaluate two datasets (product details and product reviews) and create a data quality profile for each, focusing on three key factors: completeness, uniqueness, and conformity. Identify missing, duplicate, or inconsistent information to determine how reliable the data truly is. You can optionally explore further insights, but the main goal is to uncover whether this scraped data is good enough to guide the company’s purchasing decisions.Need help with data quality assessment? Check out Lesson 3 of our free self-paced [L4-DA] Data Analytics and Visualization: Specialization for guidance on measuring and visualizing data quality. Beginner-friendly objective(s): 1. Load and preprocess the product data from Excel files. 2. Perform basic data exploration to understand the structure and key attributes of the dataset. Intermediate-friendly objective(s): 1. Calculate key metrics like completeness, uniqueness, and conformity to assess data quality. Advanced objective(s): 1. Create a comprehensive quality profile for the datasets. Solution Summary: The solution to this challenge involves a comprehensive workflow that integrates multiple data science techniques to analyze product data. The workflow begins with data loading and preprocessing, followed by data transformation using unpivoting and grouping techniques. Key quality metrics such as completeness, uniqueness, and conformity are calculated to assess the data's integrity. Advanced expressions are used to derive additional insights, and the results are visualized using bar charts and heatmaps to provide a clear quality profile of the products. This solution demonstrates the power of KNIME in handling complex data analysis tasks through a combination of advanced nodes and configurations. Solution Details: The workflow begins with the use of Excel Reader nodes to import product data from Excel files, specifically focusing on product details and reviews. The data is then preprocessed using nodes like Column List Loop Start and Row Filter to ensure it is clean and structured. Key metrics are calculated using GroupBy nodes, which aggregate data based on specific criteria, such as product category and review ratings. Advanced string manipulation is performed using nodes like String Manipulation and Cell Splitter to extract and transform text data. The workflow employs Unpivot nodes to reshape the data, making it suitable for further analysis. Expressions are used to calculate advanced metrics like completeness and uniqueness, which are crucial for assessing data quality. These metrics are visualized using Bar Chart and Heatmap nodes, providing a clear and interactive view of the data's quality profile. Throughout the workflow, nodes like Column Renamer and Column Resorter are used to ensure the data is organized and easy to interpret. The final output is a comprehensive quality profile of the products, visualized using ECharts and Table View nodes, which allow for interactive exploration and analysis of the data. This detailed workflow showcases the versatility and power of KNIME in tackling complex data science challenges.

Challenge 28 - Data Quality Matters - Can You Trust Your Office Equipment Data

Nodes

Extensions

Links

Download