KNIME_ML_Project

Data Import and Cleaning<ul><li>Objective: Initialize the analysis pipeline by ingesting raw data, standardizing data types, and resolving missing values to prepare a clean dataset for modeling.</li><li>Internal Nodes and Parameters:<ul><li>String to Number: Forced conversion of all columns into Number format. This step was crucial to interpret the non-standard placeholder ? as valid Missing Values.</li><li>Statistics & Column Filter: Conducted a Data Audit which identified "STDs: Time since first diagnosis" and "STDs: Time since last diagnosis" as critical noise (787/858 missing values, >91%). These columns were removed to prevent model bias.</li><li>Missing Value: Applied Median Imputation to all remaining features. This strategy was selected to minimize outlier impact on continuous variables and to maintain the binary integrity of categorical variables (keeping them 0 or 1).</li></ul></li><li>Assumptions and Data Quality:<ul><li>The input CSV uses ? as null indicators, requiring explicit pre-processing.</li><li>Variables with >90% missingness provide insufficient signal for imputation and were safe to discard.</li></ul></li></ul>Data Strategy and Balancing<ul><li>Objective: Optimize the dataset for machine learning modeling through feature normalization, class balancing, and data partitioning.</li><li>Internal Nodes and Parameters:<ul><li>Normalizer: Applied the Min-Max (0-1) method to all numerical features to ensure equal weighting during model training.</li><li>SMOTE: Performed oversampling on the minority class "Biopsy" (K=5) to achieve a perfect 50/50 balance, resulting in 803 records per class.</li><li>Table Partitioner: Executed a 70% Training / 30% Test split using stratified sampling to maintain class proportions across both datasets.</li></ul></li><li>Assumptions and Missingness:<ul><li>The input data is assumed to be free of missing values, as imputation was finalized during the previous phase.</li><li>The target variable (Biopsy) was converted to String format to satisfy the algorithmic requirements of the SMOTE node.</li></ul></li></ul>

Data Import and Cleaning

Objective: Initialize the analysis pipeline by ingesting raw data, standardizing data types, and resolving missing values to prepare a clean dataset for modeling.
Internal Nodes and Parameters:
- String to Number: Forced conversion of all columns into Number format. This step was crucial to interpret the non-standard placeholder ? as valid Missing Values.
- Statistics & Column Filter: Conducted a Data Audit which identified "STDs: Time since first diagnosis" and "STDs: Time since last diagnosis" as critical noise (787/858 missing values, >91%). These columns were removed to prevent model bias.
- Missing Value: Applied Median Imputation to all remaining features. This strategy was selected to minimize outlier impact on continuous variables and to maintain the binary integrity of categorical variables (keeping them 0 or 1).
Assumptions and Data Quality:
- The input CSV uses ? as null indicators, requiring explicit pre-processing.
- Variables with >90% missingness provide insufficient signal for imputation and were safe to discard.

Data Strategy and Balancing

Objective: Optimize the dataset for machine learning modeling through feature normalization, class balancing, and data partitioning.
Internal Nodes and Parameters:
- Normalizer: Applied the Min-Max (0-1) method to all numerical features to ensure equal weighting during model training.
- SMOTE: Performed oversampling on the minority class "Biopsy" (K=5) to achieve a perfect 50/50 balance, resulting in 803 records per class.
- Table Partitioner: Executed a 70% Training / 30% Test split using stratified sampling to maintain class proportions across both datasets.
Assumptions and Missingness:
- The input data is assumed to be free of missing values, as imputation was finalized during the previous phase.
- The target variable (Biopsy) was converted to String format to satisfy the algorithmic requirements of the SMOTE node.

KNIME_​ML_​Project

Nodes

Extensions

Links

Download

KNIME_ML_Project