Effects of Some Design Factors on the Distribution of Similarity Indices in Cluster Analysis
Khan, H. M.
MetadataShow full item record
This article investigates the effects of number of clusters, cluster size, and correction for chance agreement on the distribution of two similarity indices, namely, Jaccard and Rand indices. Skewness and kurtosis are calculated for the two indices and their corrected forms then compared with those of the normal distribution. Three clustering algorithms are implemented: complete linkage, Ward, and K-means. Data were randomly generated from bivariate normal distributions with specified means and variance covariance matrices. Three-way ANOVA is performed to assess the significance of the design factors using skewness and kurtosis of the indices as responses. Test statistics for testing skewness and kurtosis and observed power are calculated. Simulation results showed that independent of the clustering algorithms or the similarity indices used, the interaction effect cluster size x number of clusters and the main effects of cluster size and number of clusters were found always significant for skewness and kurtosis. The three way interaction of cluster size x correction x number of clusters was significant for skewness of Rand and Jaccard indices using all clustering algorithms, but was not significant using Ward's method for both Rand and Jaccard indices, while significant for Jaccard only using complete linkage and K-means algorithms. The correction for chance agreement was significant for skewness and kurtosis using Rand and Jaccard indices when complete linkage method is used. Hence, such design factors must be taken into consideration when studying distribution of such indices.