Icon

Main Workflow

Use Case


Unit PADU, Kementerian Ekonomi Malaysia needs to identify B40 household vulnerability using live open government data. This workflow pulls two datasets from data.gov.my, cleans and transforms the data, trains a Random Forest ML model to classify households as B40 / M40 / T20, evaluates model accuracy, then uses Google Gemini AI to autonomously generate a Bahasa Malaysia policy brief and flag high-risk income groups — all exported to CSV.

Data: api.data.gov.my (HIES household income + LFS labour force)
ML Model: Random Forest Classifier
Agentic AI: Google Gemini 2.5 Flash (native KNIME nodes, no Python)

Phase 1: Data Collection


Pulls 2 live datasets from data.gov.my Open API:

  • Node 1 — GET Request (Income): Calls api.data.gov.my/data-catalogue?id=hh_income. Returns mean and median monthly household income 1970–2022.
  • Node 2 — GET Request (Labour): Calls api.data.gov.my/data-catalogue?id=lfs_year. Returns annual unemployment and labour participation rates.
  • Node 3 & 4 — JSON Path (×2): Extracts the $.data[*] array from each JSON response into a KNIME table.
  • Node 5 — Joiner: Left outer join on date column — merges income and labour data into one wide table.

Phase 2: ETL & Data Transformation


Cleans and prepares the merged data for ML:

  • Node 6 — Missing Value: Fills empty cells using column mean. Handles gaps in historical income and unemployment records.
  • Node 7 — Math Formula: Creates new feature income_gap = income_mean - 6338 (difference from national median RM 6,338 per data.gov.my 2024).
  • Node 8 — Rule Engine: Creates classification label: income_mean ≤ 3000 → B40, ≤ 7000 → M40, else → T20.
  • Node 9 — Normalizer: Min-Max normalisation (0–1) on all numeric feature columns before ML training.

Phase 3: Model Training


Full machine learning pipeline:

  • Node 10 — Partitioning: 80% training / 20% test split. Stratified by income_group to balance B40/M40/T20. Seed = 42 for reproducibility.
  • Node 11 — Random Forest Learner: Trains ensemble classifier with 100 trees. Features: income_mean, u_rate, income_gap. Target: income_group.
  • Node 12 — Random Forest Predictor: Applies trained model to test partition. Adds Prediction column with B40/M40/T20 label.

Phase 4: Model Evaluation


Validates model quality before deployment:

  • Node 13 — Scorer: Calculates Accuracy, Precision, Recall, F1 score. Compares Prediction vs actual income_group. Right-click → Open View to see confusion matrix. Aim: Accuracy > 75%.
  • Node 14 — Statistics: Descriptive statistics of the prediction results — distribution of B40/M40/T20 predictions across the test set.
  • Node 15 — PMML Writer: Saves trained Random Forest model as PADU_RF_Model.pmml for reuse without retraining.

Phase 5: Agentic AI — Google Gemini


No Python. 100% native KNIME AI nodes:

  • Node 16 — Credentials Configuration: Gemini API key pre-filled. Just run.
  • Node 17 — Google AI Studio Authenticator: Validates key with Google API.
  • Node 18 — Gemini LLM Selector: Uses gemini-2.5-flash model.
  • Node 19 — LLM Prompter: Sends model accuracy + prediction data to Gemini. Prompt instructs Gemini to write a 3-sentence Bahasa Malaysia policy brief for Unit PADU and flag which income group needs urgent intervention.

Phase 6: Output & Reporting


Final outputs for PADU analysts:

  • Node 20 — CSV Writer: Exports full prediction results (date, income_mean, u_rate, income_gap, income_group, Prediction) to PADU_Predictions.csv. Ready for Power BI or Excel.
  • Node 21 — Table View: Live preview of Gemini AI policy brief results inside KNIME. Right-click → Open View.
  • Node 22 — Bar Chart: Visual chart showing B40/M40/T20 prediction distribution across the dataset.
Household Incomedata.gov.my API
GET Request
Labour Forcedata.gov.my API
GET Request
Merge Income& Labour data
Joiner
Export predictionsto CSV
CSV Writer (deprecated)
View GeminiAI Policy Brief
Table View (JavaScript) (legacy)
String to JSON
JSON to Table
String to JSON
JSON Path
Ungroup
Ungroup
80/20 TrainTest Split
Table Partitioner
JSON to Table
JSON Path
Label B40 / M40/ T20 target
Rule Engine
Accuracy / F1Confusion Matrix
Scorer (deprecated)
Google GeminiAPI Key
Credentials Configuration
Min-MaxNormalisation
Normalizer
DescriptiveStatistics
Statistics
LLM Prompter
Train RandomForest (100 trees)
Random Forest Learner
Missing Value
Predict B40/M40/T20 on test set
Random Forest Predictor
Math Formula
Authenticatewith Google
Google AI Studio Authenticator
Gemini 2.5 FlashLLM Model
Gemini LLM Selector
Save Modelas PMML file
PMML Writer (deprecated)
Expression
Bar Chart

Nodes

Extensions

Links