Icon

Projekt_​DMMLBreastCancer

Scales numerical features to a normalized range (0–1) to ensure comparable feature magnitudes and improve model stability.

Removes non-informative attributes such as the sample identifier (ID) and an empty column to prevent them from influencing the machine learning model.

Random forest model: Construction of a decision tree forest for:

  • Classification of tumours (Random Forest Learner)

  • Prediction based on test data (Random Forest Predictor)

Loads the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository containing 569 tumor samples and 30 numerical features describing cell nuclei characteristics.

Computes descriptive statistics for all numerical variables including mean, standard deviation, minimum and maximum values to better understand feature distributions.

Color Manger converts the diagnosis variable into a consistent colour code (blue = benign, red = malignant) to facilitate visual interpretation of graphs and classification errors.

Scatter Plot projects the data onto two key dimensions to visually show that benign and malignant tumours form distinct clusters, confirming the feasibility of classification prior to model training.

Detects and handles missing values in the dataset to ensure data completeness before analysis and model training.

Counts the number of benign and malignant samples in order to analyze class distribution and detect potential class imbalance.

Shows the distribution of individual numerical features.

Visualizes feature distributions and identifies possible outliers.

Displays relationships between features and possible class separation.

Calculates Pearson correlation coefficients between numerical features to identify potential redundancy and relationships between variables.

Applies K-Means clustering to explore whether natural groupings exist within the dataset and to evaluate whether benign and malignant tumors form separable clusters.

Trains a Random Forest classification model to predict tumor diagnosis based on the extracted cell nucleus features.

Performs hyperparameter optimization for the Random Forest model by testing multiple combinations of parameters such as the number of trees and maximum tree depth.

Splits the dataset into multiple folds to perform k-fold cross-validation, ensuring a robust and unbiased evaluation of model performance.

Aggregates the evaluation results across all folds and computes the average performance metrics.

Scorer: Calculation of the confusion matrix and the final accuracy rate of the model

Evaluates the classification performance using the Receiver Operating Characteristic (ROC) curve and calculates the Area Under the Curve (AUC) to measure the model's discriminative ability.

CSV Reader
Normalizer
Random Forest Learner
Scatter Plot
GroupBy
Column Filter
Missing Value
Scatter Plot
Parameter Optimization Loop Start
Linear Correlation
k-Means
X-Partitioner
Statistics
X-Aggregator
Parameter Optimization Loop End
Random Forest Predictor
Scorer
Scorer
Color Manager
Color Manager
Random Forest Learner
Scatter Plot
Random Forest Predictor
Histogram
Box Plot
ROC Curve

Nodes

Extensions

Links