Challenge 29 - Preparing Your Office Equipment Data Is the Key

Challenge 29: Preparing Your Office Equipment DataLevel: HardDescription: After assessing the quality of some scraped product and review data, your company decides to move forward with the office upgrade project. However, before any purchase decisions can be made, the data needs some serious cleaning and wrangling. The team wants to compare prices, identify the best products by category, and understand where the product catalog might be lacking — yet the current data is messy, inconsistent, and full of oddities. Your task is to transform this raw data into an analysis-ready format: you will clean and enrich both product and review datasets, engineer meaningful features, detect and remove outliers, and even use LLMs to help normalize product categories from titles against a curated list. You should end up with a polished dataset that a merchandising or analytics team could use to explore pricing trends, visualize category gaps, and confidently prepare for the big office upgrade. Beginner-friendly objective(s): 1. Load the two Excel sheets (product details and product reviews) and perform an initial cleaning: extract product IDs (ASINs) from URLs, convert price and ratings to numeric format, and normalize category labels. 2. Parse review metadata: split country and date from a combined field, convert dates to a proper Date type, derive year/month information and compute per-review text length. Intermediate-friendly objective(s): 1. Remove extreme text-length outliers. 2. Drop rows with missing price, fill missing brand values, and prepare a tidy table ready for analysis and visualization. 3. Visualize price by category and remove within-category price outliers using IQR-based rules; compute domain statistics and rank products by rating, rating count, and price. 4. Identify under-represented categories and assemble a categories list (as a flow variable) to guide reclassification of ambiguous items. Advanced objective(s): 1. Build LLM prompts from product titles and your categories list; generate a single best-fit category response per item. 2. Replace uncertain or targeted categories with the LLM’s output and recombine with the rest of the catalog to deliver a coherent, reclassified dataset for downstream reporting. Solution Summary: The workflow reads product and review data, cleans and normalizes key fields (price, ratings, categories), engineers date and text features from reviews, and removes outliers to stabilize metrics. It then identifies low-frequency categories, composes an instruction-constrained prompt from product titles and a curated list of categories, and calls an OpenAI model to output a single best-fit category. The LLM output replaces the original category where applicable, and the final dataset is ranked and visualized to support pricing and assortment insights. Solution Details: - Data ingestion and ID extraction: - Two Excel Readers import Products.xlsx: one for “product details,” one for “Product reviews,” using the first row as headers and skipping empty rows. - From product details, a String Replacer uses the regex https://www.amazon.com/dp/(.*) to extract the ASIN into a new ASIN column (later renamed Product ID). - Product attribute cleaning and normalization: - A Column Expressions step parses Price by removing “$” and converting to a float. - Another Column Expressions node normalizes Category labels: strips the “Gaming ” prefix, collapses “… Chairs/Desks” to “Chairs/Desks,” and maps “Monitor Arms” to “Monitor Stands.” It also parses Global_Ratings to integer (e.g., “1,234 ratings” → 1234) and Rating to double (e.g., “4.5 out of 5” → 4.5). - Review feature engineering: - In reviews, a Column Expressions step splits “Review Date and Loc” into Review Country and Review Date via regex patterns. - String to Date&Time converts Review Date from “MMMM d, yyyy” (en-GB) to a Date type; Date&Time Part Extractor derives Year and Month (Number). - Column Aggregator concatenates Year and Month into “Review year-month” and drops the original parts. - Another Column Expressions step converts “Review Score” to numeric and adds “Review Length” as the character count of Review Content. - Numeric Outliers (Tukey fences, IQR×3) removes rows with extreme Review Length values. - Data quality and exploration for products: - Domain Calculator computes category domain values and min/max for Price, Rating, and Global_Ratings. - Missing Value removes rows with missing Price and fills missing Brand with “N/A.” - Column Renamer standardizes final labels: Rating → Average Rating, Global_Ratings → Ratings count, ASIN → Product ID, Title → Summary, Price → Price $. - Sorter ranks by Average Rating (desc), Ratings count (desc), then Price $ (asc) to surface best-value items. - A Box Plot view visualizes Price $ by Category to support exploratory analysis. - Numeric Outliers removes within-category price outliers (IQR×1.5), ensuring robust pricing insights by Category. - Identifying sparse categories and preparing LLM prompts: - GroupBy calculates product counts per Category from the products table. - Row Splitter isolates categories with Count(Link) < 5 (low-frequency classes). - A second GroupBy concatenates these categories into a single comma-separated string; Table Row to Variable converts that cell into a flow variable (the categories list). - LLM-assisted reclassification: - A separate Row Splitter targets a subset of products (e.g., Category == “Accessories”) for reclassification. - Inside an LLM component: Credentials Configuration requests the OpenAI API key and passes it to an OpenAI Authenticator (base URL https://api.openai.com/v1). - An LLM Selector chooses the “gpt-4o” model with conservative generation settings (low temperature/top_p) to encourage deterministic, single-label outputs. - A String Manipulation step constructs a prompt per row: it instructs the model to “only mention the right category from the list,” then injects the product Title as “product summary” and the flow-variable categories list as “list of categories.” - LLM Prompter generates a one-word Category response and appends it as Response. - Consolidation: - Column Merger overwrites Category with Response when available (otherwise keeps the original Category). - Concatenate stacks the reclassified subset back together with the non-targeted rows, yielding a unified, cleaned, and category-corrected product dataset ready for downstream analytics and reporting.

Handle missing values, rename and sort

Extract values from the Product table

Review Date and Country

Review Score and Review Length

Outliers handling

Data Quality

Challenge 29 - Preparing Your Office Equipment Data Is the Key

Handle missing values, rename and sort

Extract values from the Product table

Review Date and Country

Review Score and Review Length

Outliers handling

Data Quality

Nodes

Extensions

Links

Download