Icon

HOE_​7_​StarterFile

Clustering - Solution

- Filter rows
- Train a k-Means model
- Visualize clustered entries on Scatter plot and OSM Map
- Calculate Silhouette Coefficients

URL: Guide to Intelligent Data Science https://www.datascienceguide.org/

HOE_8 - Clustering In this HOE, we will use the k-Means algorithm to cluster census data (census2000.csv). We have used this dataset in our in-class exercises as well. You are provided a KNIME starter workflow file (HOE_8_StarterFile.knwf). you have also been provided a corresponding solution file (HOE_8_SolutionFile.knwf) for this HOE which you can use as a reference to tackle the below objectives. You can also use the class example to help you to complete this HOE. Once you have finished all of the below tasks (1-8), add your name to the annotation box on the right of the solution file. 1) Read the dataset census2000.csv 2) Add a normalizer node to normalize the interval values so that they are on the same scale. As k-Means clustering algorithm uses distance metrics at its core, we need all variable to be on the same scale. 3) Train a k-means model with k=3. Use all variables other than the position data for clustering (latitude: locX and longitude: locY). The k-means node outputs two tables. The first table is a combination of the input table and the cluster assignment for each observation in the input table. We will use this table to conduct further analysis of each cluster. 4) Add a denormalizer node to the workflow and connect the first table from k-means node to it. Also supply it with the model output from the Normalizer node. 5) Next, add a Color Manager node and assign each cluster a unique cluster. This node passes the color information to all subsequent visualization nodes which will be helpful in differentiating between different clusters. 6) Use the Scatter Plot node to plot the observations (colored by cluster) by the latitude and longitude variables. 7) Use 4 Conditional Box Plot nodes to visualize the distribution of the 4 clustering variables by each cluster. This will give you a good idea of contents of each cluster and can help with naming the clusters. 8) Lastly, we will generate a table of mean values of each variable by cluster. This table is very helpful in viewing the variable means of each cluster side-by-side. This table is also helpful in defining the clusters according to their contents (observations). For doing so, we will use a Group By node to group by cluster and then manually aggregate each of the four variables by their mean. For example: Cluster_2 is defined as the cluster with observations that have the highest Density, Population, Household Income.
Box Plot
Box Plot
Box Plot
GroupBy
CSV Reader
k-Means
Normalizer
Color Manager
Denormalizer
Box Plot
Scatter Plot

Nodes

Extensions

Links