Clustering Models

Clustering Models

Clustering models focus on identifying groups of similar records and labeling the records according to the group to which they belong. This is done without the benefit of prior knowledge about the groups and their characteristics. In fact, you may not even know exactly how many groups to look for. This is what distinguishes clustering models from the other machine-learning techniques available in Clementine--there is no predefined output or target field for the model to predict. These models are often referred to as unsupervised learning models, since there is no external standard by which to judge the model's classification performance. There are no right or wrong answers for these models. Their value is determined by their ability to capture interesting groupings in the data and provide useful descriptions of those groupings.

Clustering methods are based on measuring distances between records and between clusters. Records are assigned to clusters in a way that tends to minimize the distance between records belonging to the same cluster.

Clementine includes three methods for clustering. You have already seen how Kohonen networks can be used for clustering. See Kohonen Networks for more information. K-Means clustering works by defining a fixed number of clusters and iteratively assigning records to clusters and adjusting the cluster centers. This process of reassignment and cluster center adjustment continues until further refinement can no longer improve the model appreciably. TwoStep clustering works by first compressing the data into a manageable number of small subclusters, then using a statistical clustering method to progressively merge the subclusters into clusters, then merging the clusters into larger clusters, and so on, until the minimum desired number of clusters is reached. TwoStep clustering has the advantage of automatically estimating the optimal number of clusters for the training data.

Clustering models are often used to create clusters or segments that are then used as inputs in subsequent analyses. A common example of this is the market segments used by marketers to partition their overall market into homogeneous subgroups. Each segment has special characteristics that affect the success of marketing efforts targeted toward it. If you are using data mining to optimize your marketing strategy, you can usually improve your model significantly by identifying the appropriate segments and using that segment information in your predictive models.