Icon

JKISeason 4-29 - Preparing Your Office Equipment Data

<p><strong>Challenge 2</strong>9: <strong>Preparing Your Office Equipment Data</strong></p><p><strong>Level: </strong>Hard<br><strong><br>Description:</strong> After assessing the quality of some scraped product and review data, your company decides to move forward with the office upgrade project. However, before any purchase decisions can be made, the data needs some serious cleaning and wrangling. The team wants to compare prices, identify the best products by category, and understand where the product catalog might be lacking — yet the current data is messy, inconsistent, and full of oddities. Your task is to transform this raw data into an analysis-ready format: you will clean and enrich both product and review datasets, engineer meaningful features, detect and remove outliers, and even use LLMs to help normalize product categories from titles against a curated list. You should end up with a polished dataset that a merchandising or analytics team could use to explore pricing trends, visualize category gaps, and confidently prepare for the big office upgrade.<br><br><em>Beginner-friendly objective(s):</em> 1. Load the two Excel sheets (product details and product reviews) and perform an initial cleaning: extract product IDs (ASINs) from URLs, convert price and ratings to numeric format, and normalize category labels. 2. Parse review metadata: split country and date from a combined field, convert dates to a proper <em>Date</em> type, derive year/month information and compute per-review text length.<br><br><em>Intermediate-friendly objective(s):</em> 1. Remove extreme text-length outliers. 2. Drop rows with missing price, fill missing brand values, and prepare a tidy table ready for analysis and visualization. 3. Visualize price by category and remove within-category price outliers using IQR-based rules; compute domain statistics and rank products by rating, rating count, and price. 4. Identify under-represented categories and assemble a categories list (as a flow variable) to guide reclassification of ambiguous items.<br><br><em>Advanced objective(s):</em> 1. Build LLM prompts from product titles and your categories list; generate a single best-fit category response per item. 2. Replace uncertain or targeted categories with the LLM’s output and recombine with the rest of the catalog to deliver a coherent, reclassified dataset for downstream reporting.<br><br><strong>Author: Armin Ghassemi Rudd</strong><br><br><strong>Dataset:</strong> <strong>Office product data on KNIME Community Hub</strong><br><br>Remember to upload your solution with tag&nbsp;<strong>JKISeason4-29</strong>&nbsp;to your public space on KNIME Community Hub. To increase the visibility of your solution, also post it to&nbsp;<strong>this challenge thread</strong>&nbsp;on KNIME Forum.t it to&nbsp;<strong>this challenge thread</strong>&nbsp;on KNIME Forum.</p>

URL: Just KNIME It! https://www.knime.com/just-knime-it

Nodes

Extensions

Links