Icon

KNIME_​ML_​Project-2

<p><strong>Data Import and Cleaning</strong></p><ul><li><p><strong>Objective:</strong> Initialize the analysis pipeline by ingesting raw data, standardizing data types, and resolving missing values to prepare a clean dataset for modeling.</p></li><li><p><strong>Internal Nodes and Parameters:</strong></p><ul><li><p><strong>String to Number:</strong> Forced conversion of all columns into Number format. This step was crucial to interpret the non-standard placeholder ? as valid Missing Values.</p></li><li><p><strong>Statistics &amp; Column Filter:</strong> Conducted a Data Audit which identified "STDs: Time since first diagnosis" and "STDs: Time since last diagnosis" as critical noise (787/858 missing values, &gt;91%). These columns were removed to prevent model bias.</p></li><li><p><strong>Missing Value:</strong> Applied Median Imputation<strong> </strong>to all remaining features. This strategy was selected to minimize outlier impact on continuous variables and to maintain the binary integrity of categorical variables (keeping them 0 or 1).</p></li></ul></li><li><p><strong>Assumptions and Data Quality:</strong></p><ul><li><p>The input CSV uses ? as null indicators, requiring explicit pre-processing.</p></li><li><p>Variables with &gt;90% missingness provide insufficient signal for imputation and were safe to discard.</p></li></ul></li></ul><p></p><p><strong>Data Strategy and Balancing</strong></p><ul><li><p><strong>Objective</strong>: Optimize the dataset for machine learning modeling through feature normalization, class balancing, and data partitioning.</p></li><li><p><strong>Internal Nodes and Parameters</strong>:</p><ul><li><p><strong>Normalizer</strong>: Applied the <strong>Min-Max (0-1)</strong> method to all numerical features to ensure equal weighting during model training.</p></li><li><p><strong>SMOTE</strong>: Performed oversampling on the minority class "<strong>Biopsy</strong>" (K=5) to achieve a perfect <strong>50/50 balance</strong>, resulting in <strong>803 records per class</strong>.</p></li><li><p><strong>Table Partitioner</strong>: Executed a <strong>70% Training / 30% Test</strong> split using <strong>stratified sampling</strong> to maintain class proportions across both datasets.</p></li></ul></li><li><p><strong>Assumptions and Missingness</strong>:</p><ul><li><p>The input data is assumed to be free of missing values, as imputation was finalized during the previous phase.</p></li><li><p>The target variable (Biopsy) was converted to <strong>String</strong> format to satisfy the algorithmic requirements of the SMOTE node.</p></li></ul></li></ul><p><strong>Baseline Model (Decision Tree)</strong></p><ul><li><p><strong>Objective:</strong>&nbsp;Establish a performance benchmark by training an interpretable, single-tree classifier. This phase serves as a reference point to quantify the technical improvements achieved by advanced ensemble methods (Random Forest) in later stages.</p></li><li><p><strong>Internal Nodes and Parameters:</strong></p><ul><li><p><strong>Decision Tree Learner:</strong>&nbsp;Trained the model on the balanced Training Set (70%) using the&nbsp;<strong>Gini Index</strong>&nbsp;split criterion.&nbsp;<strong>MDL Pruning</strong>&nbsp;was kept active to prevent excessive overfitting and to maintain the "white-box" nature of the model, ensuring the decision rules remain interpretable. Target Class:&nbsp;<strong>Biopsy</strong>.</p></li><li><p><strong>Decision Tree Predictor:</strong>&nbsp;Applied the trained rules to the unseen&nbsp;<strong>Test Set (30%)</strong>. The node was configured to append&nbsp;<strong>prediction probabilities</strong>, which are essential for generating the ROC Curve and evaluating the model's confidence levels alongside the raw class predictions.</p></li></ul></li><li><p><strong>Model Role and Assumptions:</strong></p><ul><li><p>As a baseline, this model prioritizes&nbsp;<strong>speed and interpretability</strong>&nbsp;over maximum accuracy.</p></li><li><p>Any improvement in sensitivity/accuracy shown by the subsequent Random Forest model will be measured against the metrics produced by this Decision Tree.</p></li></ul></li></ul>

Data Import and Cleaning

  • Objective: Initialize the analysis pipeline by ingesting raw data, standardizing data types, and resolving missing values to prepare a clean dataset for modeling.

  • Internal Nodes and Parameters:

    • String to Number: Forced conversion of all columns into Number format. This step was crucial to interpret the non-standard placeholder ? as valid Missing Values.

    • Statistics & Column Filter: Conducted a Data Audit which identified "STDs: Time since first diagnosis" and "STDs: Time since last diagnosis" as critical noise (787/858 missing values, >91%). These columns were removed to prevent model bias.

    • Missing Value: Applied Median Imputation to all remaining features. This strategy was selected to minimize outlier impact on continuous variables and to maintain the binary integrity of categorical variables (keeping them 0 or 1).

  • Assumptions and Data Quality:

    • The input CSV uses ? as null indicators, requiring explicit pre-processing.

    • Variables with >90% missingness provide insufficient signal for imputation and were safe to discard.

Data Strategy and Balancing

  • Objective: Optimize the dataset for machine learning modeling through feature normalization, class balancing, and data partitioning.

  • Internal Nodes and Parameters:

    • Normalizer: Applied the Min-Max (0-1) method to all numerical features to ensure equal weighting during model training.

    • SMOTE: Performed oversampling on the minority class "Biopsy" (K=5) to achieve a perfect 50/50 balance, resulting in 803 records per class.

    • Table Partitioner: Executed a 70% Training / 30% Test split using stratified sampling to maintain class proportions across both datasets.

  • Assumptions and Missingness:

    • The input data is assumed to be free of missing values, as imputation was finalized during the previous phase.

    • The target variable (Biopsy) was converted to String format to satisfy the algorithmic requirements of the SMOTE node.

Baseline Model (Decision Tree)

  • Objective: Establish a performance benchmark by training an interpretable, single-tree classifier. This phase serves as a reference point to quantify the technical improvements achieved by advanced ensemble methods (Random Forest) in later stages.

  • Internal Nodes and Parameters:

    • Decision Tree Learner: Trained the model on the balanced Training Set (70%) using the Gini Index split criterion. MDL Pruning was kept active to prevent excessive overfitting and to maintain the "white-box" nature of the model, ensuring the decision rules remain interpretable. Target Class: Biopsy.

    • Decision Tree Predictor: Applied the trained rules to the unseen Test Set (30%). The node was configured to append prediction probabilities, which are essential for generating the ROC Curve and evaluating the model's confidence levels alongside the raw class predictions.

  • Model Role and Assumptions:

    • As a baseline, this model prioritizes speed and interpretability over maximum accuracy.

    • Any improvement in sensitivity/accuracy shown by the subsequent Random Forest model will be measured against the metrics produced by this Decision Tree.

Baseline Model (Decision Tree)
File Reader
Data Cleaning
Data Strategy and Balancing

Nodes

Extensions

Links