William D. Shannon, Tsvika Klein, and Robert Culverhouse (2003), A Likelihood Approach for Determining Cluster Number, Computing Science and Statistics, 35, I2003Proceedings/ShannonBill/ShannonBill.paper.pdf
Deciding where to cut the dendrogram produced by a hierarchical cluster analysis is known as as the stopping rule problem. Heuristic approaches proposed for solving this problem have been based on statistics such as the proportion of variance accounted for by the clusters. Such measures are based on reasonable ad hoc measures, not on a probability model of cluster distributions. The statistic is calculated on each of the sets of clusters produced by cutting the dendrogram at successive heights. The number of clusters in the set that optimizes the statistic estimates the true number of clusters. In this presentation we propose a novel stopping rule based on a probability model for graphical objects. The application of probability models to hierarchical trees is highly speculative, but is based on prior published work (Shannon and Banks 1999; Banks and Constantine 1999; McMorris and Major 1990). We propose to extend this prior work to derive a likelihood or likelihood-ratio test (LRT) for determining the number of clusters in a dataset. We are aware that the criteria for the LRT (Lehman 1999) are not fully met so that P values based on it will be approximations at best, though bootstrap P values might easily be estimated. We are beginning to contrast the likelihood and likelihood-ratio test stopping rule with other exsiting ad hoc approaches. In our talk we present this method for the first time and show some very preliminary results.