Hierarchical Clustering Report

Hierarchical Clustering based on molecular fingerprints

Available linkage types:

  • single
  • complete
  • average
  • centroid
  • mcquitty
  • ward
  • weightedcentroid
  • flexiblebeta
  • schrodinger

Reports details for each level of hierarchical clustering based on a pairwise distance matrix. The output table can be used to determine what level of clustering is needed (i.e. the number of clusters that should be used in the Hierarchical Clustering node).

The statsFile contains data relating to the cluster efficiency for each possible number of clusters (n).

Definition of each statistics used in statsFile

R-Squared(RSQ) represents 1.0-(W/T) where:

W is the sum of variance between all n clusters and

T is the total variance

Semipartial R-Squared(SPRSQ) represents the gradient of the above metric.

SPRSQRank is the rank of SPRSQ values over all possible choices of n (for clarity only the top sqrt(n) ranks are listed). Useful for choosing a locally optimal n within a desired range.

Kelley Penalty is Kelley's clustering efficiency metric. (Kelley et al. Protein Engineering (9) 11. pp. 1063-1065(1996))

IsKelleyMinimum represents whether the cluster is the global minimum of the above function. Useful for choosing globally optimal n.

Backend implementation

utilities/canvasHCBuild
canvasHCBuild is used to implement this node.

Input Ports

Icon
Pairwise distance matrix in binary format

Output Ports

Icon
Report designed to help select an appropriate number of clusters.

Views

Std output/error of Build Hierarchical Clustering
Std output/error of Build Hierarchical Clustering

Workflows

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.