Icon

KNIME_​ML_​Project

<p><strong>Data Import and Cleaning</strong></p><ul><li><p><strong>Objective:</strong> Initialize the analysis pipeline by ingesting raw data, standardizing data types, and resolving missing values to prepare a clean dataset for modeling.</p></li><li><p><strong>Internal Nodes and Parameters:</strong></p><ul><li><p><strong>String to Number:</strong> Forced conversion of all columns into Number format. This step was crucial to interpret the non-standard placeholder ? as valid Missing Values.</p></li><li><p><strong>Statistics &amp; Column Filter:</strong> Conducted a Data Audit which identified "STDs: Time since first diagnosis" and "STDs: Time since last diagnosis" as critical noise (787/858 missing values, &gt;91%). These columns were removed to prevent model bias.</p></li><li><p><strong>Missing Value:</strong> Applied Median Imputation<strong> </strong>to all remaining features. This strategy was selected to minimize outlier impact on continuous variables and to maintain the binary integrity of categorical variables (keeping them 0 or 1).</p></li></ul></li><li><p><strong>Assumptions and Data Quality:</strong></p><ul><li><p>The input CSV uses ? as null indicators, requiring explicit pre-processing.</p></li><li><p>Variables with &gt;90% missingness provide insufficient signal for imputation and were safe to discard.</p></li></ul></li></ul><p></p><p><strong>Data Strategy and Balancing</strong></p><ul><li><p><strong>Objective</strong>: Optimize the dataset for machine learning modeling through feature normalization, class balancing, and data partitioning.</p></li><li><p><strong>Internal Nodes and Parameters</strong>:</p><ul><li><p><strong>Normalizer</strong>: Applied the <strong>Min-Max (0-1)</strong> method to all numerical features to ensure equal weighting during model training.</p></li><li><p><strong>SMOTE</strong>: Performed oversampling on the minority class "<strong>Biopsy</strong>" (K=5) to achieve a perfect <strong>50/50 balance</strong>, resulting in <strong>803 records per class</strong>.</p></li><li><p><strong>Table Partitioner</strong>: Executed a <strong>70% Training / 30% Test</strong> split using <strong>stratified sampling</strong> to maintain class proportions across both datasets.</p></li></ul></li><li><p><strong>Assumptions and Missingness</strong>:</p><ul><li><p>The input data is assumed to be free of missing values, as imputation was finalized during the previous phase.</p></li><li><p>The target variable (Biopsy) was converted to <strong>String</strong> format to satisfy the algorithmic requirements of the SMOTE node.</p></li></ul></li></ul>

Data Import and Cleaning

  • Objective: Initialize the analysis pipeline by ingesting raw data, standardizing data types, and resolving missing values to prepare a clean dataset for modeling.

  • Internal Nodes and Parameters:

    • String to Number: Forced conversion of all columns into Number format. This step was crucial to interpret the non-standard placeholder ? as valid Missing Values.

    • Statistics & Column Filter: Conducted a Data Audit which identified "STDs: Time since first diagnosis" and "STDs: Time since last diagnosis" as critical noise (787/858 missing values, >91%). These columns were removed to prevent model bias.

    • Missing Value: Applied Median Imputation to all remaining features. This strategy was selected to minimize outlier impact on continuous variables and to maintain the binary integrity of categorical variables (keeping them 0 or 1).

  • Assumptions and Data Quality:

    • The input CSV uses ? as null indicators, requiring explicit pre-processing.

    • Variables with >90% missingness provide insufficient signal for imputation and were safe to discard.

Data Strategy and Balancing

  • Objective: Optimize the dataset for machine learning modeling through feature normalization, class balancing, and data partitioning.

  • Internal Nodes and Parameters:

    • Normalizer: Applied the Min-Max (0-1) method to all numerical features to ensure equal weighting during model training.

    • SMOTE: Performed oversampling on the minority class "Biopsy" (K=5) to achieve a perfect 50/50 balance, resulting in 803 records per class.

    • Table Partitioner: Executed a 70% Training / 30% Test split using stratified sampling to maintain class proportions across both datasets.

  • Assumptions and Missingness:

    • The input data is assumed to be free of missing values, as imputation was finalized during the previous phase.

    • The target variable (Biopsy) was converted to String format to satisfy the algorithmic requirements of the SMOTE node.

File Reader
Data Strategy and Balancing
Data Cleaning

Nodes

Extensions

Links