Icon

KNIME_​ML_​Project-2

<p><strong>Data Import and Cleaning</strong></p><ul><li><p><strong>Objective:</strong> Initialize the analysis pipeline by ingesting raw data, standardizing data types, and resolving missing values to prepare a clean dataset for modeling.</p></li><li><p><strong>Internal Nodes and Parameters:</strong></p><ul><li><p><strong>String to Number:</strong> Forced conversion of all columns into Number format. This step was crucial to interpret the non-standard placeholder ? as valid Missing Values.</p></li><li><p><strong>Statistics &amp; Column Filter:</strong> Conducted a Data Audit which identified "STDs: Time since first diagnosis" and "STDs: Time since last diagnosis" as critical noise (787/858 missing values, &gt;91%). These columns were removed to prevent model bias.</p></li><li><p><strong>Missing Value:</strong> Applied Median Imputation<strong> </strong>to all remaining features. This strategy was selected to minimize outlier impact on continuous variables and to maintain the binary integrity of categorical variables (keeping them 0 or 1).</p></li></ul></li><li><p><strong>Assumptions and Data Quality:</strong></p><ul><li><p>The input CSV uses ? as null indicators, requiring explicit pre-processing.</p></li><li><p>Variables with &gt;90% missingness provide insufficient signal for imputation and were safe to discard.</p></li></ul></li></ul><p></p><p><strong>Data Strategy and Balancing</strong></p><ul><li><p><strong>Objective</strong>: Optimize the dataset for machine learning modeling through feature normalization, class balancing, and data partitioning.</p></li><li><p><strong>Internal Nodes and Parameters</strong>:</p><ul><li><p><strong>Normalizer</strong>: Applied the <strong>Min-Max (0-1)</strong> method to all numerical features to ensure equal weighting during model training.</p></li><li><p><strong>SMOTE</strong>: Performed oversampling on the minority class "<strong>Biopsy</strong>" (K=5) to achieve a perfect <strong>50/50 balance</strong>, resulting in <strong>803 records per class</strong>.</p></li><li><p><strong>Table Partitioner</strong>: Executed a <strong>70% Training / 30% Test</strong> split using <strong>stratified sampling</strong> to maintain class proportions across both datasets.</p></li></ul></li><li><p><strong>Assumptions and Missingness</strong>:</p><ul><li><p>The input data is assumed to be free of missing values, as imputation was finalized during the previous phase.</p></li><li><p>The target variable (Biopsy) was converted to <strong>String</strong> format to satisfy the algorithmic requirements of the SMOTE node.</p></li></ul></li></ul><p><strong>Baseline Model (Decision Tree)</strong></p><ul><li><p><strong>Objective:</strong>&nbsp;Establish a performance benchmark by training an interpretable, single-tree classifier. This phase serves as a reference point to quantify the technical improvements achieved by advanced ensemble methods (Random Forest) in later stages.</p></li><li><p><strong>Internal Nodes and Parameters:</strong></p><ul><li><p><strong>Decision Tree Learner:</strong>&nbsp;Trained the model on the balanced Training Set (70%) using the&nbsp;<strong>Gini Index</strong>&nbsp;split criterion.&nbsp;<strong>MDL Pruning</strong>&nbsp;was kept active to prevent excessive overfitting and to maintain the "white-box" nature of the model, ensuring the decision rules remain interpretable. Target Class:&nbsp;<strong>Biopsy</strong>.</p></li><li><p><strong>Decision Tree Predictor:</strong>&nbsp;Applied the trained rules to the unseen&nbsp;<strong>Test Set (30%)</strong>. The node was configured to append&nbsp;<strong>prediction probabilities</strong>, which are essential for generating the ROC Curve and evaluating the model's confidence levels alongside the raw class predictions.</p></li></ul></li><li><p><strong>Model Role and Assumptions:</strong></p><ul><li><p>As a baseline, this model prioritizes&nbsp;<strong>speed and interpretability</strong>&nbsp;over maximum accuracy.</p></li><li><p>Any improvement in sensitivity/accuracy shown by the subsequent Random Forest model will be measured against the metrics produced by this Decision Tree.</p></li></ul></li><li><p><strong>Advanced Model (Random Forest)</strong></p><p><strong>Objective:</strong><br>Improve predictive performance and sensitivity in detecting positive cervical cancer cases by leveraging an ensemble-based learning approach. The Random Forest model is designed to reduce variance, capture non-linear interactions among features, and outperform the baseline Decision Tree while maintaining robustness to noise.</p><p><strong>Internal Nodes and Parameters:</strong></p><ul><li><p><strong>Random Forest Learner:</strong><br>Trained the model on the balanced Training Set (70%) using an ensemble of <strong>100 decision trees</strong>.<br>The <strong>Information Gain Ratio</strong> was selected as the split criterion to ensure consistent and comparable feature selection across trees.<br>Tree depth was left <strong>unrestricted</strong> to allow the model to fully capture complex interactions between demographic and lifestyle variables.<br>A <strong>fixed random seed (42)</strong> was used to guarantee reproducibility and experimental stability across runs.<br>Target Class: <strong>Biopsy</strong>.</p></li><li><p><strong>Random Forest Predictor:</strong><br>Applied the trained ensemble model to the unseen <strong>Test Set (30%)</strong>.<br>The predictor was configured to append <strong>class predictions and probability estimates</strong>, enabling detailed evaluation through performance metrics such as sensitivity, specificity, and ROC analysis.</p></li></ul><p><strong>Model Role and Assumptions:</strong></p><ul><li><p>The Random Forest model prioritizes <strong>predictive accuracy and sensitivity</strong> over interpretability, complementing the baseline Decision Tree.</p></li><li><p>By aggregating multiple trees trained on randomized feature subsets, the model mitigates overfitting and improves generalization on unseen data.</p></li><li><p>Performance gains achieved by this model are directly compared against the baseline to quantify the impact of ensemble learning and data balancing techniques.</p></li></ul></li></ul>

Data Import and Cleaning

  • Objective: Initialize the analysis pipeline by ingesting raw data, standardizing data types, and resolving missing values to prepare a clean dataset for modeling.

  • Internal Nodes and Parameters:

    • String to Number: Forced conversion of all columns into Number format. This step was crucial to interpret the non-standard placeholder ? as valid Missing Values.

    • Statistics & Column Filter: Conducted a Data Audit which identified "STDs: Time since first diagnosis" and "STDs: Time since last diagnosis" as critical noise (787/858 missing values, >91%). These columns were removed to prevent model bias.

    • Missing Value: Applied Median Imputation to all remaining features. This strategy was selected to minimize outlier impact on continuous variables and to maintain the binary integrity of categorical variables (keeping them 0 or 1).

  • Assumptions and Data Quality:

    • The input CSV uses ? as null indicators, requiring explicit pre-processing.

    • Variables with >90% missingness provide insufficient signal for imputation and were safe to discard.

Data Strategy and Balancing

  • Objective: Optimize the dataset for machine learning modeling through feature normalization, class balancing, and data partitioning.

  • Internal Nodes and Parameters:

    • Normalizer: Applied the Min-Max (0-1) method to all numerical features to ensure equal weighting during model training.

    • SMOTE: Performed oversampling on the minority class "Biopsy" (K=5) to achieve a perfect 50/50 balance, resulting in 803 records per class.

    • Table Partitioner: Executed a 70% Training / 30% Test split using stratified sampling to maintain class proportions across both datasets.

  • Assumptions and Missingness:

    • The input data is assumed to be free of missing values, as imputation was finalized during the previous phase.

    • The target variable (Biopsy) was converted to String format to satisfy the algorithmic requirements of the SMOTE node.

Baseline Model (Decision Tree)

  • Objective: Establish a performance benchmark by training an interpretable, single-tree classifier. This phase serves as a reference point to quantify the technical improvements achieved by advanced ensemble methods (Random Forest) in later stages.

  • Internal Nodes and Parameters:

    • Decision Tree Learner: Trained the model on the balanced Training Set (70%) using the Gini Index split criterion. MDL Pruning was kept active to prevent excessive overfitting and to maintain the "white-box" nature of the model, ensuring the decision rules remain interpretable. Target Class: Biopsy.

    • Decision Tree Predictor: Applied the trained rules to the unseen Test Set (30%). The node was configured to append prediction probabilities, which are essential for generating the ROC Curve and evaluating the model's confidence levels alongside the raw class predictions.

  • Model Role and Assumptions:

    • As a baseline, this model prioritizes speed and interpretability over maximum accuracy.

    • Any improvement in sensitivity/accuracy shown by the subsequent Random Forest model will be measured against the metrics produced by this Decision Tree.

  • Advanced Model (Random Forest)

    Objective:
    Improve predictive performance and sensitivity in detecting positive cervical cancer cases by leveraging an ensemble-based learning approach. The Random Forest model is designed to reduce variance, capture non-linear interactions among features, and outperform the baseline Decision Tree while maintaining robustness to noise.

    Internal Nodes and Parameters:

    • Random Forest Learner:
      Trained the model on the balanced Training Set (70%) using an ensemble of 100 decision trees.
      The Information Gain Ratio was selected as the split criterion to ensure consistent and comparable feature selection across trees.
      Tree depth was left unrestricted to allow the model to fully capture complex interactions between demographic and lifestyle variables.
      A fixed random seed (42) was used to guarantee reproducibility and experimental stability across runs.
      Target Class: Biopsy.

    • Random Forest Predictor:
      Applied the trained ensemble model to the unseen Test Set (30%).
      The predictor was configured to append class predictions and probability estimates, enabling detailed evaluation through performance metrics such as sensitivity, specificity, and ROC analysis.

    Model Role and Assumptions:

    • The Random Forest model prioritizes predictive accuracy and sensitivity over interpretability, complementing the baseline Decision Tree.

    • By aggregating multiple trees trained on randomized feature subsets, the model mitigates overfitting and improves generalization on unseen data.

    • Performance gains achieved by this model are directly compared against the baseline to quantify the impact of ensemble learning and data balancing techniques.

Baseline Model (Decision Tree)
File Reader
Data Cleaning
Data Strategy and Balancing
Advanced model (Random Forest)

Nodes

Extensions

Links