Icon

Preparing Your Office Equipment Data_​PV Solution

<p><strong>Challenge 29: Preparing Your Office Equipment Data</strong></p><p><br><strong>Level:</strong> Hard<br><br><strong>Description:</strong> After assessing the quality of some scraped product and review data, your company decides to move forward with the office upgrade project. However, before any purchase decisions can be made, the data needs some serious cleaning and wrangling. The team wants to compare prices, identify the best products by category, and understand where the product catalog might be lacking — yet the current data is messy, inconsistent, and full of oddities. Your task is to transform this raw data into an analysis-ready format: you will clean and enrich both product and review datasets, engineer meaningful features, detect and remove outliers, and even use LLMs to help normalize product categories from titles against a curated list. You should end up with a polished dataset that a merchandising or analytics team could use to explore pricing trends, visualize category gaps, and confidently prepare for the big office upgrade.<br><br><em>Beginner-friendly objective(s):</em> 1. Load the two Excel sheets (product details and product reviews) and perform an initial cleaning: extract product IDs (ASINs) from URLs, convert price and ratings to numeric format, and normalize category labels. 2. Parse review metadata: split country and date from a combined field, convert dates to a proper <em>Date</em> type, derive year/month information and compute per-review text length.<br><br><em>Intermediate-friendly objective(s):</em> 1. Remove extreme text-length outliers. 2. Drop rows with missing price, fill missing brand values, and prepare a tidy table ready for analysis and visualization. 3. Visualize price by category and remove within-category price outliers using IQR-based rules; compute domain statistics and rank products by rating, rating count, and price. 4. Identify under-represented categories and assemble a categories list (as a flow variable) to guide reclassification of ambiguous items.<br><br><em>Advanced objective(s):</em> 1. Build LLM prompts from product titles and your categories list; generate a single best-fit category response per item. 2. Replace uncertain or targeted categories with the LLM’s output and recombine with the rest of the catalog to deliver a coherent, reclassified dataset for downstream reporting.</p>
Handle missing values, rename and sort

The products with missing price are removed. If the brand is missing, it is replaced with the value N/A. These nodes also rename the columns and sort the rows by rating and price.

Extract values from the Product table

The nodes in this section process the product table to to extract some values from the strings, such as the product number, price, rating, etc..

Review Date and Country

The Expression node parses the Review Date and Loc column to extract the date and location of each review and store them in single columns. The Review Date is then processed to obtain a string with the yyyy-mm format.

Review Score and Review Length

The Expression node contains the expression to extract the Review Score from the string. Additionally, it creates a new column containing the review length.

Outliers handling

These nodes remove the rows with outlier values in the Price and Review Length columns respectively,

Data Quality

Calculate data quality measures and visualize heatmaps.

Challenge 29: Preparing Your Office Equipment Data

Level: Hard

Description: After assessing the quality of some scraped product and review data, your company decides to move forward with the office upgrade project. However, before any purchase decisions can be made, the data needs some serious cleaning and wrangling. The team wants to compare prices, identify the best products by category, and understand where the product catalog might be lacking — yet the current data is messy, inconsistent, and full of oddities. Your task is to transform this raw data into an analysis-ready format: you will clean and enrich both product and review datasets, engineer meaningful features, detect and remove outliers, and even use LLMs to help normalize product categories from titles against a curated list. You should end up with a polished dataset that a merchandising or analytics team could use to explore pricing trends, visualize category gaps, and confidently prepare for the big office upgrade.

Beginner-friendly objective(s): 1. Load the two Excel sheets (product details and product reviews) and perform an initial cleaning: extract product IDs (ASINs) from URLs, convert price and ratings to numeric format, and normalize category labels. 2. Parse review metadata: split country and date from a combined field, convert dates to a proper Date type, derive year/month information and compute per-review text length.

Intermediate-friendly objective(s): 1. Remove extreme text-length outliers. 2. Drop rows with missing price, fill missing brand values, and prepare a tidy table ready for analysis and visualization. 3. Visualize price by category and remove within-category price outliers using IQR-based rules; compute domain statistics and rank products by rating, rating count, and price. 4. Identify under-represented categories and assemble a categories list (as a flow variable) to guide reclassification of ambiguous items.

Advanced objective(s): 1. Build LLM prompts from product titles and your categories list; generate a single best-fit category response per item. 2. Replace uncertain or targeted categories with the LLM’s output and recombine with the rest of the catalog to deliver a coherent, reclassified dataset for downstream reporting.

Challenge 29: Preparing Your Office Equipment Data

product details
Excel Reader
Sort by ratingand price
Sorter
Extract ASINfrom Link
String Replacer
Column Renamer
Parse Price
Expression
Extractyear and month
Date&Time Part Extractor
Regex extract:Review CountryReview Date
Expression
Price by category
Box Plot
Regex replace: Category,Rating, Global_Ratings
Expression
Remove outliersin Review Length
Numeric Outliers
Concatenateyear and month
Column Aggregator
Needs OpenAI API key
Use GenAI for invalid categories
Data Quality Profile
Remove rows with missing priceFill missing Brand with N/A
Missing Value
product reviews
Excel Reader
Domain Calculator
Data Quality + Heatmaps
Review ScoreReview Length
Expression
Remove outliersin Price
Numeric Outliers
String to Date&Time

Nodes

Extensions

Links