Icon

Chemical grouping workflow 1.1.14 - course

K-MEANS Clustering - Hyperparameters Search If projection cluster If projection cluster - YES If visualization mode UMAP Feature importance for each cluster Data processing Download/Report Clustering options Unlabeled/Unsupervised Labeled/Supervised Input files SDF CSV, XLS, XLSX # **Flow variables**# **Upload**new/past_analysis - Start New Analysis OR View Past Results OR Start New Analysis with Prior Configurationnew/past_analysis (index) - 0 = Start New Analysis 1= View Past Results 2= Start New Analysis with Prior Configuration****file-upload - Path to the uploaded filefile-upload_variables - Path to the variables uploaded file in .variables extensionfilePath - path to the local input filestruct_label_file_name - name of the chemical structures/labels uploaded filefile_extension - extension of the o the uploaded fileextension - extension of the uploaded file("sdf")=>1("smi")=>2("csv")=>3("xls","lsx")=>4input-data-type - Labeled or Unlabeledinput-data-type (index) - 0 = labeled, 1 = unlabeledis_prior - New analysis OR Visualize a past analysis OR Start New Analysis with Prior Configurationis_prior (index) - 0 = New analysis, > 0 = Visualize a past analysis OR Start New Analysis with Prior Configurationimage - MoVIZ logopast_input-data-type - 0 = labeled, 1 = unlabeledpast_supervised/unsupervised-selection - 0 = Supervised | 1= Unsupervisednew_input-data-type - 0 = labeled, 1 = unlabeledpast_SHAP-selection - 0 = Yes | 1 = No## **SDF/SMI/CSV/XLS/XLSX**activity_column - column containing the labelsactivity_column_type - Classification OR Multi-class OR Continuousactivity_column_type (index) - 0= Classification 1= Multi-class 2= ContinuoussmiColumn - column containing SMILES## **Exploratory analysis**missing_values_count - number of missing valuesshow_missing_values_table - 0 = no show 1 = showunreadable_count - number of unreadable chemicalstotal_chemicals_count - total number of chemicals# **Select molecular descriptors**binary/cont descriptor-selection - Binary descriptors OR Continuous descriptorsbinary/cont descriptor-selection (index) - 0= Binary descriptors 1= Continuous descriptorsmorgan - 0= unselected 1= selectedfeatmorgan - 0= unselected 1= selectedmaccs - 0= unselected 1= selectedpadel-descriptors - 0= unselected 1= selectedrdkit-descriptors - 0= unselected 1= selectedmordred - 0= unselected 1= selected## **Calculate Binary molecular descriptors**morgan-fplen - morgan number of bitsradius-morgan - Morgan radiusfeatmorgan-fplen - featmorgan number of bitsradius-featmorgan - featmorgan radius## **Calculate Continuous molecular descriptors**normalize - 0= No normalize the descriptors 1= normalize the descriptorsnormalization_method - Min-max scaling / Z-score normalization / Normalization by decimal scalingpadel_2d_select - selected PaDEL 2D descriptors# **Dimensionality reduction**aut/man_dimensionality-selection - automated OR manual dimensionality reductionhigh_correlated_filter - 0= unselected 1= selectedlow_variance_filter - 0= unselected 1= selectedlow_variance_threshold - The low-variance threshold valuehigh_correlated_threshold - The correlation threshold valuedataset_split_method - Single split (80/20) | Cross-validation (5-fold)## **ML method**ml_algorithm-selection - LightGBM | Random Forest | Support Vector Machine | Logistic Regression | K-Nearest Neighbors | Naive Bayessupervised_ML_selected_before_yes_no - Var to check if the supervised ML method was selected before - 0=No | 1=Yes# **IF Switch Supervised/Unsupervised**supervised/unsupervised-selection - Supervised OR Unsupervisednew_supervised/unsupervised-selection - Supervised OR Unsupervised# **Variable Selection - Labeled**aut/man_feature-selection - Automated | Manual | Use all variablesautomated_var_method-selection - Recursive Feature Elimination | Genetic Algorithm | Simulated Annealingcount_columns_after_dimensionaty - number of columns after dimensionality reduction.ga_cv - genetic algorithm cross-validation generator or an iterable - determines the cross-validation splitting strategy.ga_max_features - genetic algorithm maximum number of features selected.ga_n_population - number of population for the genetic algorithm.ga_crossover_proba - Probability of crossover for the genetic algorithm.ga_n_generations - Number of generations for the genetic algorithm.# **Variable selection Unlabeled or Unsupervised**aut/man_feature-selection - Manual | Use all variables# **IF SHAP Yes/No**SHAP-selection - Yes | Nonum_features_viz_global - Number of features to visualize in SHAP graphNumber Columns - number of features (descriptors) available# **Select clustering options**clustering-selection - Select the clustering algorithm - K-means | Hierarchical clustering | K-medoids | HDBSCAN | DBSCANclustering-selection - (index) - index of the selected clustering algorithmvisualization-selection - Select the visualization method - UMAP | PCA | t-SNEvisualization-selection - (index) - index of the selected visualization methodprojection-selection - Select projection cluster - yes | noprojection-selection - (index) - index of the selected projection cluster yes or no# **K-means**## **UMAP**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-means hyperparameters:*min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*UMAP hyperparameters:*min_dist_max - Maximum number of min_distmin_dist_min - Minimum number of min_distn_neighbors_max - Minimum number of n_neighborsn_neighbors_min - Maximum number of n_neighborsumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-means hyperparameters*:n_clusters - best number of clusters identified*UMAP hyperparameters*:min_dist - best value identifiedn_neighbors - best value identifiedsilhouette_score - silhouette score for the combination of hyperparametersumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice### **Manual***K-means hyperparameters*:n_clusters - manually entered value for the number of clusters*UMAP hyperparameters*:min_dist - manually entered valuen_neighbors - manually entered valueumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanumap_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**umap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## **PCA**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-means hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-means hyperparameters*:n_clusters - best number of clusters identifiedsilhouette_score - silhouette score for the combination of hyperparameters### **Manual***K-means hyperparameters*:n_clusters - manually entered value for the number of clusters### **Silhouette Analysis**silhouette_score - silhouette score for the combination of hyperparameters## **t-SNE**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-means hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*t-SNE hyperparameters*:perplexity_max - Maximum value for the perplexity hyperparameterperplexity_min - Minimum value for the perplexity hyperparameterlearning_rate_max - Maximum value for the learning rate hyperparameterlearning_rate_min - Minimum value for the learning rate hyperparametertsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-means hyperparameters*:n_clusters - best number of clusters identified*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***K-means hyperparameters*:n_clusters - best number of clusters identified*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_man_selection_metric_cont - 0 = Euclidean 1 = Manhattantsne_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**tsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2# Hierarchical clustering## **UMAP**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***Hierarchical clustering hyperparameters:*min-clusters - Minimum number of clustersmax-clusters - Maximum number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_aut_checkbox_affinity_euclidean - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_affinity_manhattan - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_complete - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_single - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_average- 1 = checked 0 = unchecked*UMAP hyperparameters:*min_dist_max - Maximum number of min_distmin_dist_min - Minimum number of min_distn_neighbors_max - Minimum number of n_neighborsn_neighbors_min - Maximum number of n_neighborsumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***Hierarchical clustering hyperparameters*:n_clusters - value for the number of clustersaffinity - euclidean | manhattanlinkage - single | complete | average*UMAP hyperparameters*:min_dist - best value identifiedn_neighbors - best value identifiedsilhouette_score - silhouette score for the combination of hyperparametersumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice### **Manual***Hierarchical clustering hyperparameters*:n_clusters - manually entered value for the number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_man_selection_affinity_cont - Euclidean | Manhattanhierarchical_man_selection_affinity_bin - Tanimoto | Dicehierarchical_man_selection_linkage - Single | Complete | Average*UMAP hyperparameters*:min_dist - manually entered valuen_neighbors - manually entered valueumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanumap_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**linkage_python - “single” | “complete” | “average”linkage_wf - "Single Linkage” | "Complete Linkage”| "Average Linkage”affinity_python - euclidean | manhattanaffinity_wf_select - 0 = euclidean | 1 = manhattanaffinity - euclidean | manhattanlinkage - single | complete | averageumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## **PCA**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***Hierarchical clustering hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_aut_checkbox_affinity_euclidean - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_affinity_manhattan - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_complete - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_single - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_average- 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***Hierarchical clustering hyperparameters*:n_clusters - value for the number of clustersaffinity - euclidean | manhattanlinkage - single | complete | averagesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***Hierarchical clustering hyperparameters*:n_clusters - manually entered value for the number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_man_selection_affinity_cont - Euclidean | Manhattanhierarchical_man_selection_affinity_bin - Tanimoto | Dicehierarchical_man_selection_linkage - Single | Complete | Average**Silhouette Analysis**affinity - euclidean | manhattanlinkage - single | complete | averagesilhouette_score - silhouette score for the combination of hyperparameters## **t-SNE**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***Hierarchical clustering hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_aut_checkbox_affinity_euclidean - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_affinity_manhattan - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_complete - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_single - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_average- 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization*t-SNE hyperparameters*:perplexity_max - Maximum value for the perplexity hyperparameterperplexity_min - Minimum value for the perplexity hyperparameterlearning_rate_max - Maximum value for the learning rate hyperparameterlearning_rate_min - Minimum value for the learning rate hyperparametertsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***Hierarchical clustering hyperparameters*:n_clusters - value for the number of clustersaffinity - euclidean | manhattanlinkage - single | complete | averagesilhouette_score - silhouette score for the combination of hyperparameters*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***Hierarchical clustering hyperparameters*:n_clusters - manually entered value for the number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_man_selection_affinity_cont - Euclidean | Manhattanhierarchical_man_selection_affinity_bin - Tanimoto | Dicehierarchical_man_selection_linkage - Single | Complete | Average*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_man_selection_metric_cont - 0 = Euclidean 1 = Manhattantsne_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**affinity - euclidean | manhattanlinkage - single | complete | averagetsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2# **K-medoids**## **UMAP**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-medoids hyperparameters:*min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*UMAP hyperparameters:*min_dist_max - Maximum number of min_distmin_dist_min - Minimum number of min_distn_neighbors_max - Minimum number of n_neighborsn_neighbors_min - Maximum number of n_neighborsumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-medoids hyperparameters*:n_clusters - best number of clusters identified*UMAP hyperparameters*:min_dist - best value identifiedn_neighbors - best value identifiedsilhouette_score - silhouette score for the combination of hyperparametersumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice### **Manual***K-medoids hyperparameters*:n_clusters - manually entered value for the number of clusters*UMAP hyperparameters*:min_dist - manually entered valuen_neighbors - manually entered valueumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanumap_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**umap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## **PCA**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-medoids hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-medoids hyperparameters*:n_clusters - best number of clusters identifiedsilhouette_score - silhouette score for the combination of hyperparameters### **Manual***K-medoids hyperparameters*:n_clusters - manually entered value for the number of clusters### **Silhouette Analysis**silhouette_score - silhouette score for the combination of hyperparameters## **t-SNE**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-medoids hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*t-SNE hyperparameters*:perplexity_max - Maximum value for the perplexity hyperparameterperplexity_min - Minimum value for the perplexity hyperparameterlearning_rate_max - Maximum value for the learning rate hyperparameterlearning_rate_min - Minimum value for the learning rate hyperparametertsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-medoids hyperparameters*:n_clusters - best number of clusters identified*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***K-medoids hyperparameters*:n_clusters - best number of clusters identified*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_man_selection_metric_cont - 0 = Euclidean 1 = Manhattantsne_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**tsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2# HDBSCAN## **UMAP**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***HDBSCAN hyperparameters:*max-min_cluster_size - Maximum value of min_cluster_sizemin-min_cluster_size - Minimum value of min_cluster_sizemax-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-cluster_selection_epsilon - Maximum value of cluster_selection_epsilonmin-cluster_selection_epsilon - Minimum value of cluster_selection_epsilonmax-alpha - Maximum value of alphamin-alpha - Minimum value of alphahdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*UMAP hyperparameters:*min_dist_max - Maximum number of min_distmin_dist_min - Minimum number of min_distn_neighbors_max - Minimum number of n_neighborsn_neighbors_min - Maximum number of n_neighborsumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphacluster_selection_method - eom | leafhdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice*UMAP hyperparameters*:min_dist - best value identifiedn_neighbors - best value identifiedsilhouette_score - silhouette score for the combination of hyperparametersumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice### **Manual***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphahdscan_man_selection_cluster_selection_method - eom | leafhdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanhdbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice*UMAP hyperparameters*:min_dist - manually entered valuen_neighbors - manually entered valueumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanumap_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**hdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicecluster_selection_method - eom | leafumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## **PCA**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***HDBSCAN hyperparameters*:max-min_cluster_size - Maximum value of min_cluster_sizemin-min_cluster_size - Minimum value of min_cluster_sizemax-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-cluster_selection_epsilon - Maximum value of cluster_selection_epsilonmin-cluster_selection_epsilon - Minimum value of cluster_selection_epsilonmax-alpha - Maximum value of alphamin-alpha - Minimum value of alphahdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphacluster_selection_method - eom | leafhdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphahdscan_man_selection_cluster_selection_method - eom | leafhdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanhdbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**hdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicecluster_selection_method - eom | leafsilhouette_score - silhouette score for the combination of hyperparameters## **t-SNE**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***HDBSCAN hyperparameters*:max-min_cluster_size - Maximum value of min_cluster_sizemin-min_cluster_size - Minimum value of min_cluster_sizemax-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-cluster_selection_epsilon - Maximum value of cluster_selection_epsilonmin-cluster_selection_epsilon - Minimum value of cluster_selection_epsilonmax-alpha - Maximum value of alphamin-alpha - Minimum value of alphahdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*t-SNE hyperparameters*:perplexity_max - Maximum value for the perplexity hyperparameterperplexity_min - Minimum value for the perplexity hyperparameterlearning_rate_max - Maximum value for the learning rate hyperparameterlearning_rate_min - Minimum value for the learning rate hyperparametertsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphacluster_selection_method - eom | leafhdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphahdscan_man_selection_cluster_selection_method - eom | leafhdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanhdbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_man_selection_metric_cont - 0 = Euclidean 1 = Manhattantsne_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**hdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicecluster_selection_method - eom | leaftsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2# DBSCAN## **UMAP**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***DBSCAN hyperparameters:*max-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-eps - Maximum value of epsmin-eps - Minimum value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*UMAP hyperparameters:*min_dist_max - Maximum number of min_distmin_dist_min - Minimum number of min_distn_neighbors_max - Minimum number of n_neighborsn_neighbors_min - Maximum number of n_neighborsumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epsdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice*UMAP hyperparameters*:min_dist - best value identifiedn_neighbors - best value identifiedsilhouette_score - silhouette score for the combination of hyperparametersumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice### **Manual***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattandbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice*UMAP hyperparameters*:min_dist - manually entered valuen_neighbors - manually entered valueumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanumap_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**dbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Diceumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## **PCA**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***DBSCAN hyperparameters*:max-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-eps - Maximum value of epsmin-eps - Minimum value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epssilhouette_score - silhouette score for the combination of hyperparameters### **Manual***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattandbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**dbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2## **t-SNE**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***DBSCAN hyperparameters*:max-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-eps - Maximum value of epsmin-eps - Minimum value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*t-SNE hyperparameters*:perplexity_max - Maximum value for the perplexity hyperparameterperplexity_min - Minimum value for the perplexity hyperparameterlearning_rate_max - Maximum value for the learning rate hyperparameterlearning_rate_min - Minimum value for the learning rate hyperparametertsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epsdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattandbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_man_selection_metric_cont - 0 = Euclidean 1 = Manhattantsne_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**dbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicetsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2# OpenAI Use and Keyuse_gpt_interpretation - Yes | Nostring-input-test - OpenAI API Key# Visualize Feature importancen_clusters - Number os clustersnew_cluster_number - Convert -1 (Outliers) to a number higher than the higher cluster numbersupervised_ML_selection_for_unsupervised - variable to select the supervised ML algorithm for cluster interpretation - for cases of manual dimensionality reduction and unsupervised clustering - LightGBM will be usedml_algorithm-selection (index) - 0= LightGBM | 1= Random Forest | 2= Support Vector Machine | 3= Logistic Regression | 4= K-Nearest Neighbors | 5= Naive Bayescluster_with_outlier_selection - Processing column name if clustering algorithm has outlier outputNumber Columns - total number of descriptors availableNumber Rows - total number of chemicalsbinary_or_multiclass - define if the supervised ML algorithm for the interpretation will be binary (less than 3 clusters) or multiclass (more than 2 clusters)num_features_viz - selected number of descriptor to visualize the importancecheck_combination_binary_descriptors: (1) -> Nothing (2) -> Only MACCS (3) -> Only FeatMorgan (4) -> FeatMorgan and MACCS (5) -> Only Morgan (6) -> Morgan and MACCS (7) -> Morgan and FeatMorgan (8) ->Morgan and FeatMorgan and MACCS If Labeled/Unlabeled If Labeled/Unlabeled Molecular descriptors Dimensionality reduction Cluster training and visualization IF clustering method (SHAP values ororiginal descriptors)Output 1: K-meansOutput 2: Hierarchical clusteringOutput 3: K-medoidsOutput 4: HDBSCANOutput 5: DBSCAN Binary descriptors Continuous descriptors IF Switch Labeleb/UnlabeledOutput 1 - LabelebOutput 2- Unlabeled IF end Unlabeled orUnsupervised Variable Selection - Labeled IF Switch Supervised/UnsupervisedPort 1 - SupervisedPort 2 - Unsupervised Variable selection Unlabeled orUnsupervised IF Switch SHAP END IF Labeled/Unlabeled Select clustering options Original descriptors Handling error with global featureimportance for unsupervised ▼ STEP ONE ▼ ▼ STEP TWO ▼ ▼ STEP THREE ▼ ▼STEP FOUR▼ ▼STEP FOUR▼ ▼STEP SIX▼ SMI If Labeled/Unlabeled Unreadable structures All columns and rows of the input Exploratory Analysis ▼STEP SEVEN▼ ▼STEP EIGHT▼ IF New analysis past config IF New analysis past config IF New analysis past config <h6>IF New analysis past config</h6> <p>IF Switch past_input-data-type<br>Port 1- labeled<br>Port 2 - unlabeled</p> <p>IF Switch past_supervised/unsupervised-selection<br>Port 1 -Supervised<br>Port 2 - Unsupervised</p> <p>IF Switch past_SHAP-selection<br>Port 1 - Yes<br>Port 2 - No</p> Structures column QSAR-Ready Standardization IF New analysis past config ▼STEP FIVE▼ IF New analysis past config SDFNode 3615Node 17443Node 17525Node 17547Node 17548Node 17549Node 17551Node 17583Node 17598Node 17605Node 17650Node 17652Node 17892Node 17967Node 17971Node 17972Node 18241Node 18242CHATGPTNode 18368New/Past AnalysisSMI/CSV or XLSNode 18432Node 18436Node 18437Node 18440Node 18443Node 18444Manual/automatedNode 18448Node 18642Node 18643Node 18644Node 18646Node 18647Node 18648Node 18649Node 18652Node 18653Node 18755Node 18756Node 18759Node 18761Node 18762Node 18763Node 18764Node 18776Node 18798Node 18799Read-SDFNode 18802xls - Structure checkerNode 18805Node 18806 Flow Variable IF Switch (FlowVariable Value) (deprecated) End IF select activitycolumn Visualize clusters RowID CASE Switch Start Select SMILES andactivity columns CASE Switch Start CASE Switch End CASE Switch Start CASE Switch End CASE Switch End CASE Switch Start Select SMILEScolumn CASE Switch Start RowID Column Filter Select numberof clusters Hyperparametersearch Select clusteringoptions Variable selection CASE Switch End Variable selectionlabeled IFSupervised/Unsupervised IF SHAP Yes/No CASE Switch End CASE Switch End Data processing Metanode OpenAI Use and Key Calculate Binarymolecular descriptors Dimensionalityreduction CASE Switch Start Flow Variable IF Switch (FlowVariable Value) (deprecated) Eploratory analysis Flow Variable IF Switch (FlowVariable Value) (deprecated) CASE Switch End Select SMILESactivity columns Select SMILEScolumns CASE Switch Start SMILES Visualize Featureimportance End IF Calculate Continuousmolecular descriptors End IF End IF SelectHyperparameters CASE Switch Start CASE Switch End Silhouette Analysis Report Upload CASE Switch End CASE Switch Start CASE Switch End CASE Switch Start CASE Switch End CASE Switch Start CASE Switch End CASE Switch Start CASE Switch End CASE Switch End CASE Switch Start CASE Switch Start CASE Switch Start CASE Switch Start CASE Switch End CASE Switch End Joiner CASE Switch End CASE Switch Start Select moleculardescriptors Call WorkflowService CSV-XLS_read Call WorkflowService Call WorkflowService Call WorkflowService Calculate Binarymolecular descriptors Calculate Continuousmolecular descriptors K-MEANS Clustering - Hyperparameters Search If projection cluster If projection cluster - YES If visualization mode UMAP Feature importance for each cluster Data processing Download/Report Clustering options Unlabeled/Unsupervised Labeled/Supervised Input files SDF CSV, XLS, XLSX # **Flow variables**# **Upload**new/past_analysis - Start New Analysis OR View Past Results OR Start New Analysis with Prior Configurationnew/past_analysis (index) - 0 = Start New Analysis 1= View Past Results 2= Start New Analysis with Prior Configuration****file-upload - Path to the uploaded filefile-upload_variables - Path to the variables uploaded file in .variables extensionfilePath - path to the local input filestruct_label_file_name - name of the chemical structures/labels uploaded filefile_extension - extension of the o the uploaded fileextension - extension of the uploaded file("sdf")=>1("smi")=>2("csv")=>3("xls","lsx")=>4input-data-type - Labeled or Unlabeledinput-data-type (index) - 0 = labeled, 1 = unlabeledis_prior - New analysis OR Visualize a past analysis OR Start New Analysis with Prior Configurationis_prior (index) - 0 = New analysis, > 0 = Visualize a past analysis OR Start New Analysis with Prior Configurationimage - MoVIZ logopast_input-data-type - 0 = labeled, 1 = unlabeledpast_supervised/unsupervised-selection - 0 = Supervised | 1= Unsupervisednew_input-data-type - 0 = labeled, 1 = unlabeledpast_SHAP-selection - 0 = Yes | 1 = No## **SDF/SMI/CSV/XLS/XLSX**activity_column - column containing the labelsactivity_column_type - Classification OR Multi-class OR Continuousactivity_column_type (index) - 0= Classification 1= Multi-class 2= ContinuoussmiColumn - column containing SMILES## **Exploratory analysis**missing_values_count - number of missing valuesshow_missing_values_table - 0 = no show 1 = showunreadable_count - number of unreadable chemicalstotal_chemicals_count - total number of chemicals# **Select molecular descriptors**binary/cont descriptor-selection - Binary descriptors OR Continuous descriptorsbinary/cont descriptor-selection (index) - 0= Binary descriptors 1= Continuous descriptorsmorgan - 0= unselected 1= selectedfeatmorgan - 0= unselected 1= selectedmaccs - 0= unselected 1= selectedpadel-descriptors - 0= unselected 1= selectedrdkit-descriptors - 0= unselected 1= selectedmordred - 0= unselected 1= selected## **Calculate Binary molecular descriptors**morgan-fplen - morgan number of bitsradius-morgan - Morgan radiusfeatmorgan-fplen - featmorgan number of bitsradius-featmorgan - featmorgan radius## **Calculate Continuous molecular descriptors**normalize - 0= No normalize the descriptors 1= normalize the descriptorsnormalization_method - Min-max scaling / Z-score normalization / Normalization by decimal scalingpadel_2d_select - selected PaDEL 2D descriptors# **Dimensionality reduction**aut/man_dimensionality-selection - automated OR manual dimensionality reductionhigh_correlated_filter - 0= unselected 1= selectedlow_variance_filter - 0= unselected 1= selectedlow_variance_threshold - The low-variance threshold valuehigh_correlated_threshold - The correlation threshold valuedataset_split_method - Single split (80/20) | Cross-validation (5-fold)## **ML method**ml_algorithm-selection - LightGBM | Random Forest | Support Vector Machine | Logistic Regression | K-Nearest Neighbors | Naive Bayessupervised_ML_selected_before_yes_no - Var to check if the supervised ML method was selected before - 0=No | 1=Yes# **IF Switch Supervised/Unsupervised**supervised/unsupervised-selection - Supervised OR Unsupervisednew_supervised/unsupervised-selection - Supervised OR Unsupervised# **Variable Selection - Labeled**aut/man_feature-selection - Automated | Manual | Use all variablesautomated_var_method-selection - Recursive Feature Elimination | Genetic Algorithm | Simulated Annealingcount_columns_after_dimensionaty - number of columns after dimensionality reduction.ga_cv - genetic algorithm cross-validation generator or an iterable - determines the cross-validation splitting strategy.ga_max_features - genetic algorithm maximum number of features selected.ga_n_population - number of population for the genetic algorithm.ga_crossover_proba - Probability of crossover for the genetic algorithm.ga_n_generations - Number of generations for the genetic algorithm.# **Variable selection Unlabeled or Unsupervised**aut/man_feature-selection - Manual | Use all variables# **IF SHAP Yes/No**SHAP-selection - Yes | Nonum_features_viz_global - Number of features to visualize in SHAP graphNumber Columns - number of features (descriptors) available# **Select clustering options**clustering-selection - Select the clustering algorithm - K-means | Hierarchical clustering | K-medoids | HDBSCAN | DBSCANclustering-selection - (index) - index of the selected clustering algorithmvisualization-selection - Select the visualization method - UMAP | PCA | t-SNEvisualization-selection - (index) - index of the selected visualization methodprojection-selection - Select projection cluster - yes | noprojection-selection - (index) - index of the selected projection cluster yes or no# **K-means**## **UMAP**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-means hyperparameters:*min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*UMAP hyperparameters:*min_dist_max - Maximum number of min_distmin_dist_min - Minimum number of min_distn_neighbors_max - Minimum number of n_neighborsn_neighbors_min - Maximum number of n_neighborsumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-means hyperparameters*:n_clusters - best number of clusters identified*UMAP hyperparameters*:min_dist - best value identifiedn_neighbors - best value identifiedsilhouette_score - silhouette score for the combination of hyperparametersumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice### **Manual***K-means hyperparameters*:n_clusters - manually entered value for the number of clusters*UMAP hyperparameters*:min_dist - manually entered valuen_neighbors - manually entered valueumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanumap_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**umap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## **PCA**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-means hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-means hyperparameters*:n_clusters - best number of clusters identifiedsilhouette_score - silhouette score for the combination of hyperparameters### **Manual***K-means hyperparameters*:n_clusters - manually entered value for the number of clusters### **Silhouette Analysis**silhouette_score - silhouette score for the combination of hyperparameters## **t-SNE**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-means hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*t-SNE hyperparameters*:perplexity_max - Maximum value for the perplexity hyperparameterperplexity_min - Minimum value for the perplexity hyperparameterlearning_rate_max - Maximum value for the learning rate hyperparameterlearning_rate_min - Minimum value for the learning rate hyperparametertsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-means hyperparameters*:n_clusters - best number of clusters identified*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***K-means hyperparameters*:n_clusters - best number of clusters identified*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_man_selection_metric_cont - 0 = Euclidean 1 = Manhattantsne_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**tsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2# Hierarchical clustering## **UMAP**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***Hierarchical clustering hyperparameters:*min-clusters - Minimum number of clustersmax-clusters - Maximum number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_aut_checkbox_affinity_euclidean - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_affinity_manhattan - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_complete - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_single - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_average- 1 = checked 0 = unchecked*UMAP hyperparameters:*min_dist_max - Maximum number of min_distmin_dist_min - Minimum number of min_distn_neighbors_max - Minimum number of n_neighborsn_neighbors_min - Maximum number of n_neighborsumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***Hierarchical clustering hyperparameters*:n_clusters - value for the number of clustersaffinity - euclidean | manhattanlinkage - single | complete | average*UMAP hyperparameters*:min_dist - best value identifiedn_neighbors - best value identifiedsilhouette_score - silhouette score for the combination of hyperparametersumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice### **Manual***Hierarchical clustering hyperparameters*:n_clusters - manually entered value for the number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_man_selection_affinity_cont - Euclidean | Manhattanhierarchical_man_selection_affinity_bin - Tanimoto | Dicehierarchical_man_selection_linkage - Single | Complete | Average*UMAP hyperparameters*:min_dist - manually entered valuen_neighbors - manually entered valueumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanumap_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**linkage_python - “single” | “complete” | “average”linkage_wf - "Single Linkage” | "Complete Linkage”| "Average Linkage”affinity_python - euclidean | manhattanaffinity_wf_select - 0 = euclidean | 1 = manhattanaffinity - euclidean | manhattanlinkage - single | complete | averageumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## **PCA**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***Hierarchical clustering hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_aut_checkbox_affinity_euclidean - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_affinity_manhattan - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_complete - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_single - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_average- 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***Hierarchical clustering hyperparameters*:n_clusters - value for the number of clustersaffinity - euclidean | manhattanlinkage - single | complete | averagesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***Hierarchical clustering hyperparameters*:n_clusters - manually entered value for the number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_man_selection_affinity_cont - Euclidean | Manhattanhierarchical_man_selection_affinity_bin - Tanimoto | Dicehierarchical_man_selection_linkage - Single | Complete | Average**Silhouette Analysis**affinity - euclidean | manhattanlinkage - single | complete | averagesilhouette_score - silhouette score for the combination of hyperparameters## **t-SNE**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***Hierarchical clustering hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_aut_checkbox_affinity_euclidean - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_affinity_manhattan - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_complete - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_single - 1 = checked 0 = uncheckedhierarchical_aut_checkbox_linkage_average- 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization*t-SNE hyperparameters*:perplexity_max - Maximum value for the perplexity hyperparameterperplexity_min - Minimum value for the perplexity hyperparameterlearning_rate_max - Maximum value for the learning rate hyperparameterlearning_rate_min - Minimum value for the learning rate hyperparametertsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***Hierarchical clustering hyperparameters*:n_clusters - value for the number of clustersaffinity - euclidean | manhattanlinkage - single | complete | averagesilhouette_score - silhouette score for the combination of hyperparameters*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***Hierarchical clustering hyperparameters*:n_clusters - manually entered value for the number of clustershierarchical_affinity_bin_or_cont - define if the affinity is binary (= 0) or continuous (= 1)hierarchical_man_selection_affinity_cont - Euclidean | Manhattanhierarchical_man_selection_affinity_bin - Tanimoto | Dicehierarchical_man_selection_linkage - Single | Complete | Average*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_man_selection_metric_cont - 0 = Euclidean 1 = Manhattantsne_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**affinity - euclidean | manhattanlinkage - single | complete | averagetsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2# **K-medoids**## **UMAP**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-medoids hyperparameters:*min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*UMAP hyperparameters:*min_dist_max - Maximum number of min_distmin_dist_min - Minimum number of min_distn_neighbors_max - Minimum number of n_neighborsn_neighbors_min - Maximum number of n_neighborsumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-medoids hyperparameters*:n_clusters - best number of clusters identified*UMAP hyperparameters*:min_dist - best value identifiedn_neighbors - best value identifiedsilhouette_score - silhouette score for the combination of hyperparametersumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice### **Manual***K-medoids hyperparameters*:n_clusters - manually entered value for the number of clusters*UMAP hyperparameters*:min_dist - manually entered valuen_neighbors - manually entered valueumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanumap_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**umap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## **PCA**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-medoids hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-medoids hyperparameters*:n_clusters - best number of clusters identifiedsilhouette_score - silhouette score for the combination of hyperparameters### **Manual***K-medoids hyperparameters*:n_clusters - manually entered value for the number of clusters### **Silhouette Analysis**silhouette_score - silhouette score for the combination of hyperparameters## **t-SNE**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***K-medoids hyperparameters*:min-clusters - Minimum number of clustersmax-clusters - Maximum number of clusters*t-SNE hyperparameters*:perplexity_max - Maximum value for the perplexity hyperparameterperplexity_min - Minimum value for the perplexity hyperparameterlearning_rate_max - Maximum value for the learning rate hyperparameterlearning_rate_min - Minimum value for the learning rate hyperparametertsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***K-medoids hyperparameters*:n_clusters - best number of clusters identified*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***K-medoids hyperparameters*:n_clusters - best number of clusters identified*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_man_selection_metric_cont - 0 = Euclidean 1 = Manhattantsne_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**tsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2# HDBSCAN## **UMAP**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***HDBSCAN hyperparameters:*max-min_cluster_size - Maximum value of min_cluster_sizemin-min_cluster_size - Minimum value of min_cluster_sizemax-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-cluster_selection_epsilon - Maximum value of cluster_selection_epsilonmin-cluster_selection_epsilon - Minimum value of cluster_selection_epsilonmax-alpha - Maximum value of alphamin-alpha - Minimum value of alphahdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*UMAP hyperparameters:*min_dist_max - Maximum number of min_distmin_dist_min - Minimum number of min_distn_neighbors_max - Minimum number of n_neighborsn_neighbors_min - Maximum number of n_neighborsumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphacluster_selection_method - eom | leafhdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice*UMAP hyperparameters*:min_dist - best value identifiedn_neighbors - best value identifiedsilhouette_score - silhouette score for the combination of hyperparametersumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice### **Manual***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphahdscan_man_selection_cluster_selection_method - eom | leafhdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanhdbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice*UMAP hyperparameters*:min_dist - manually entered valuen_neighbors - manually entered valueumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanumap_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**hdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicecluster_selection_method - eom | leafumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## **PCA**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***HDBSCAN hyperparameters*:max-min_cluster_size - Maximum value of min_cluster_sizemin-min_cluster_size - Minimum value of min_cluster_sizemax-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-cluster_selection_epsilon - Maximum value of cluster_selection_epsilonmin-cluster_selection_epsilon - Minimum value of cluster_selection_epsilonmax-alpha - Maximum value of alphamin-alpha - Minimum value of alphahdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphacluster_selection_method - eom | leafhdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphahdscan_man_selection_cluster_selection_method - eom | leafhdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanhdbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**hdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicecluster_selection_method - eom | leafsilhouette_score - silhouette score for the combination of hyperparameters## **t-SNE**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***HDBSCAN hyperparameters*:max-min_cluster_size - Maximum value of min_cluster_sizemin-min_cluster_size - Minimum value of min_cluster_sizemax-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-cluster_selection_epsilon - Maximum value of cluster_selection_epsilonmin-cluster_selection_epsilon - Minimum value of cluster_selection_epsilonmax-alpha - Maximum value of alphamin-alpha - Minimum value of alphahdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedhdbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*t-SNE hyperparameters*:perplexity_max - Maximum value for the perplexity hyperparameterperplexity_min - Minimum value for the perplexity hyperparameterlearning_rate_max - Maximum value for the learning rate hyperparameterlearning_rate_min - Minimum value for the learning rate hyperparametertsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphacluster_selection_method - eom | leafhdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***HDBSCAN hyperparameters*:min_cluster_size - value of min_cluster_sizemin_samples - value of min_samplescluster_selection_epsilon - value of cluster_selection_epsilonalpha - value of alphahdscan_man_selection_cluster_selection_method - eom | leafhdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)hdbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanhdbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_man_selection_metric_cont - 0 = Euclidean 1 = Manhattantsne_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**hdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicecluster_selection_method - eom | leaftsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2# DBSCAN## **UMAP**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***DBSCAN hyperparameters:*max-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-eps - Maximum value of epsmin-eps - Minimum value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*UMAP hyperparameters:*min_dist_max - Maximum number of min_distmin_dist_min - Minimum number of min_distn_neighbors_max - Minimum number of n_neighborsn_neighbors_min - Maximum number of n_neighborsumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedumap_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epsdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice*UMAP hyperparameters*:min_dist - best value identifiedn_neighbors - best value identifiedsilhouette_score - silhouette score for the combination of hyperparametersumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice### **Manual***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattandbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice*UMAP hyperparameters*:min_dist - manually entered valuen_neighbors - manually entered valueumap_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)umap_man_selection_metric_cont - 0 = Euclidean 1 = Manhattanumap_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**dbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Diceumap_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## **PCA**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***DBSCAN hyperparameters*:max-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-eps - Maximum value of epsmin-eps - Minimum value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epssilhouette_score - silhouette score for the combination of hyperparameters### **Manual***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattandbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**dbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2## **t-SNE**### **Select Hyperparamenters**### **Automated/Manual**aut/man_hyperparameter_selection - automated or manual hyperparameter tuning - 0 = Manual 1 = automated### **Automated***DBSCAN hyperparameters*:max-min_samples - Maximum value of min_samplesmin-min_samples - Minimum value of min_samplesmax-eps - Maximum value of epsmin-eps - Minimum value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckeddbscan_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*t-SNE hyperparameters*:perplexity_max - Maximum value for the perplexity hyperparameterperplexity_min - Minimum value for the perplexity hyperparameterlearning_rate_max - Maximum value for the learning rate hyperparameterlearning_rate_min - Minimum value for the learning rate hyperparametertsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_aut_checkbox_metric_cont_euclidean - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_cont_manhattan - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_jaccard - 1 = checked 0 = uncheckedtsne_aut_checkbox_metric_bin_dice - 1 = checked 0 = unchecked*Optuna*:n_trials - Number of trials for optuna hyperparameter optimization### **Select number of clusters***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epsdbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dice*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters### **Manual***DBSCAN hyperparameters*:min_samples - value of min_sampleseps- value of epsdbscan_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)dbscan_man_selection_metric_cont - 0 = Euclidean 1 = Manhattandbscan_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice*t-SNE hyperparameters*:learning_rate - learning rate for the selected trial in the hyperparameter search or manual enteringperplexity - perplexity for the selected trial in the hyperparameter search or manual enteringtsne_metric_bin_or_cont - define if the metric is binary (= 0) or continuous (= 1)tsne_man_selection_metric_cont - 0 = Euclidean 1 = Manhattantsne_man_selection_metric_bin - 0 = Tanimoto/Jaccard 1 = Dice### **Silhouette Analysis**dbscan_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicetsne_metric - Euclidean | Manhattan | Tanimoto/Jaccard | Dicesilhouette_score - silhouette score for the combination of hyperparameters## Visualize clustersColumn 0 - First coordinate of a dimensionality reduction method - UMAP_1 | PCA dimension 0 | t-SNE_1Column 1 - Second coordinate of a dimensionality reduction method - UMAP_2 | PCA dimension 1 | t-SNE_2# OpenAI Use and Keyuse_gpt_interpretation - Yes | Nostring-input-test - OpenAI API Key# Visualize Feature importancen_clusters - Number os clustersnew_cluster_number - Convert -1 (Outliers) to a number higher than the higher cluster numbersupervised_ML_selection_for_unsupervised - variable to select the supervised ML algorithm for cluster interpretation - for cases of manual dimensionality reduction and unsupervised clustering - LightGBM will be usedml_algorithm-selection (index) - 0= LightGBM | 1= Random Forest | 2= Support Vector Machine | 3= Logistic Regression | 4= K-Nearest Neighbors | 5= Naive Bayescluster_with_outlier_selection - Processing column name if clustering algorithm has outlier outputNumber Columns - total number of descriptors availableNumber Rows - total number of chemicalsbinary_or_multiclass - define if the supervised ML algorithm for the interpretation will be binary (less than 3 clusters) or multiclass (more than 2 clusters)num_features_viz - selected number of descriptor to visualize the importancecheck_combination_binary_descriptors: (1) -> Nothing (2) -> Only MACCS (3) -> Only FeatMorgan (4) -> FeatMorgan and MACCS (5) -> Only Morgan (6) -> Morgan and MACCS (7) -> Morgan and FeatMorgan (8) ->Morgan and FeatMorgan and MACCS If Labeled/Unlabeled If Labeled/Unlabeled Molecular descriptors Dimensionality reduction Cluster training and visualization IF clustering method (SHAP values ororiginal descriptors)Output 1: K-meansOutput 2: Hierarchical clusteringOutput 3: K-medoidsOutput 4: HDBSCANOutput 5: DBSCAN Binary descriptors Continuous descriptors IF Switch Labeleb/UnlabeledOutput 1 - LabelebOutput 2- Unlabeled IF end Unlabeled orUnsupervised Variable Selection - Labeled IF Switch Supervised/UnsupervisedPort 1 - SupervisedPort 2 - Unsupervised Variable selection Unlabeled orUnsupervised IF Switch SHAP END IF Labeled/Unlabeled Select clustering options Original descriptors Handling error with global featureimportance for unsupervised ▼ STEP ONE ▼ ▼ STEP TWO ▼ ▼ STEP THREE ▼ ▼STEP FOUR▼ ▼STEP FOUR▼ ▼STEP SIX▼ SMI If Labeled/Unlabeled Unreadable structures All columns and rows of the input Exploratory Analysis ▼STEP SEVEN▼ ▼STEP EIGHT▼ IF New analysis past config IF New analysis past config IF New analysis past config <h6>IF New analysis past config</h6> <p>IF Switch past_input-data-type<br>Port 1- labeled<br>Port 2 - unlabeled</p> <p>IF Switch past_supervised/unsupervised-selection<br>Port 1 -Supervised<br>Port 2 - Unsupervised</p> <p>IF Switch past_SHAP-selection<br>Port 1 - Yes<br>Port 2 - No</p> Structures column QSAR-Ready Standardization IF New analysis past config ▼STEP FIVE▼ IF New analysis past config SDFNode 3615Node 17443Node 17525Node 17547Node 17548Node 17549Node 17551Node 17583Node 17598Node 17605Node 17650Node 17652Node 17892Node 17967Node 17971Node 17972Node 18241Node 18242CHATGPTNode 18368New/Past AnalysisSMI/CSV or XLSNode 18432Node 18436Node 18437Node 18440Node 18443Node 18444Manual/automatedNode 18448Node 18642Node 18643Node 18644Node 18646Node 18647Node 18648Node 18649Node 18652Node 18653Node 18755Node 18756Node 18759Node 18761Node 18762Node 18763Node 18764Node 18776Node 18798Node 18799Read-SDFNode 18802xls - Structure checkerNode 18805Node 18806 Flow Variable IF Switch (FlowVariable Value) (deprecated) End IF select activitycolumn Visualize clusters RowID CASE Switch Start Select SMILES andactivity columns CASE Switch Start CASE Switch End CASE Switch Start CASE Switch End CASE Switch End CASE Switch Start Select SMILEScolumn CASE Switch Start RowID Column Filter Select numberof clusters Hyperparametersearch Select clusteringoptions Variable selection CASE Switch End Variable selectionlabeled IFSupervised/Unsupervised IF SHAP Yes/No CASE Switch End CASE Switch End Data processing Metanode OpenAI Use and Key Calculate Binarymolecular descriptors Dimensionalityreduction CASE Switch Start Flow Variable IF Switch (FlowVariable Value) (deprecated) Eploratory analysis Flow Variable IF Switch (FlowVariable Value) (deprecated) CASE Switch End Select SMILESactivity columns Select SMILEScolumns CASE Switch Start SMILES Visualize Featureimportance End IF Calculate Continuousmolecular descriptors End IF End IF SelectHyperparameters CASE Switch Start CASE Switch End Silhouette Analysis Report Upload CASE Switch End CASE Switch Start CASE Switch End CASE Switch Start CASE Switch End CASE Switch Start CASE Switch End CASE Switch Start CASE Switch End CASE Switch End CASE Switch Start CASE Switch Start CASE Switch Start CASE Switch Start CASE Switch End CASE Switch End Joiner CASE Switch End CASE Switch Start Select moleculardescriptors Call WorkflowService CSV-XLS_read Call WorkflowService Call WorkflowService Call WorkflowService Calculate Binarymolecular descriptors Calculate Continuousmolecular descriptors

Nodes

Extensions

Links