For Future Students link
For Current Students link
For Faculty and Staff link
About The Graduate College

Events Listing link
Policies/Guidelines link
Dissertation Defenses
Forms link


Dissertation Defense


Candidate: Ahmed N. Albatineh

Degree of: Doctor of Philosophy

Department: Statistics

Title: On Similarity Measures for Cluster Analysis

Date: Monday, May 17, 2004 4:00-6:00 p.m.
Alavi Commons Room, Everett Tower


Committee: Dr. Daniel Mihalko, Chair
Dr. Mgdalena Niewiadomska-Bugaj
Dr. Michael Stoline
Dr. Jung Chao Wang
Dr. Robert Buck

Abstract: This study discusses the relationship between measures of similarity which quantify the agreement between two clusterings of the same set of data. This study identifies a family ? of similarity measures which are identical when corrected for chance agreement. In particular, this study proves that the measures of Rand (R), Hubert (H), and Czekanowski (CZ) are identical when corrected for chance agreement. It also proves that the measures of McConnaughey (MC) and Kulczynski (K) are identical when corrected for chance agreement. Moreover, if the number of clusters produced by each algorithm are the same with equal cluster sizes, then all the similarity measures in the family ? who attain a maximum value of one are identical when corrected for chance agreement.

Fowlkes and Mallows (FM) derived the mean and variance of their measure and that of R under the assumptions of fixed marginal totals mi. and m.j and independence of the clustering algorithms. This study generalizes the derivation of the mean and variance of the FM and R similarity measure to any member in the family ? under the same assumptions. A simulation study which shows that not only the corrected R for chance agreement is recommended for use in clustering structure recovery, but also the corrected FM and Wallace (W) also can do as good and generally any measure should be corrected for chance agreement. The two expectation formulas proposed by Morey & Agresti (MA) and Hubert and Arabie (HA) will be investigated while changing the sample size, number of clusters and clustering algorithm.

Finally, a method for determining the number of clusters in a given data set will be investigated through simulations and its performance will be compared to some existing methods. Two real data examples, namely the protein consumption in 25 European countries and the birth and death rates for 74 countries in 1974, will be discussed to show the effectiveness of the proposed method.




 

 



Related Topics

Main List of Archives:
Dissertation Defenses

Current Dissertation Defenses


For Future Students | For Current Students | For Faculty and Staff | About The Graduate College
Events | Policies/Guidelines | Dissertation Defenses | ETD | Forms


Updated May 14, 2004
Copyright © 2002-2004, Western Michigan University
Contact
The Graduate College, 260 W. Walwood Hall, Kalamazoo, MI 49008-5456 Phone: 269 387-8212
Research text only home page WMU home page link Contact Research link WMU Graduate College link WMU home page link WMU Centennial link
Graduate College Home link WMU homepage link Contact Us link