HOE_7_StarterFile

HOE_8 - Clustering In this HOE, we will use the k-Means algorithm to cluster census data (census2000.csv). We have used this dataset in our in-class exercises as well. You are provided a KNIME starter workflow file (HOE_8_StarterFile.knwf). you have also been provided a corresponding solution file (HOE_8_SolutionFile.knwf) for this HOE which you can use as a reference to tackle the below objectives. You can also use the class example to help you to complete this HOE. Once you have finished all of the below tasks (1-8), add your name to the annotation box on the right of the solution file. 1) Read the dataset census2000.csv 2) Add a normalizer node to normalize the interval values so that they are on the same scale. As k-Means clustering algorithm uses distance metrics at its core, we need all variable to be on the same scale. 3) Train a k-means model with k=3. Use all variables other than the position data for clustering (latitude: locX and longitude: locY). The k-means node outputs two tables. The first table is a combination of the input table and the cluster assignment for each observation in the input table. We will use this table to conduct further analysis of each cluster. 4) Add a denormalizer node to the workflow and connect the first table from k-means node to it. Also supply it with the model output from the Normalizer node. 5) Next, add a Color Manager node and assign each cluster a unique cluster. This node passes the color information to all subsequent visualization nodes which will be helpful in differentiating between different clusters. 6) Use the Scatter Plot node to plot the observations (colored by cluster) by the latitude and longitude variables. 7) Use 4 Conditional Box Plot nodes to visualize the distribution of the 4 clustering variables by each cluster. This will give you a good idea of contents of each cluster and can help with naming the clusters. 8) Lastly, we will generate a table of mean values of each variable by cluster. This table is very helpful in viewing the variable means of each cluster side-by-side. This table is also helpful in defining the clusters according to their contents (observations). For doing so, we will use a Group By node to group by cluster and then manually aggregate each of the four variables by their mean. For example: Cluster_2 is defined as the cluster with observations that have the highest Density, Population, Household Income.

HOE_​7_​StarterFile

Nodes

Extensions

Links

Download

HOE_7_StarterFile