In this work we propose to evaluate the effect of missing data on a k-means method used for variables partitioning. The partition method is the following: we start bya finding a dissimilarity matrix between variables; a multidimensional scaling ([BG05]) provides components and we use this components as input in a k-means method.Data are generated with aim of obtaining different types of patititions from twenty-five variavles (the data have a multinormal distribution). Then we simulate the missing data as in [Sil05], in different percentages.We determine the new partitions in presence of missing data using three methods: listwise method, simple imputations methods and multiple imputation method.We compare the partitions obtained in the three situations with those obtained with the original complete data, using a Rand index as in [YS04] and an affinity coefficient.We conclude on the effect of the missing data and imputation methods in this partition method under the established conditions.
@article{hal-01125931,
author = {Lorga da Silva, Ana and Saporta, Gilbert and Bacelar-Nicolau, Helena},
title = {Dealing with missing data in a k-means method - A simulation based approach},
journal = {HAL},
volume = {2006},
number = {0},
year = {2006},
language = {en},
url = {http://dml.mathdoc.fr/item/hal-01125931}
}
Lorga da Silva, Ana; Saporta, Gilbert; Bacelar-Nicolau, Helena. Dealing with missing data in a k-means method - A simulation based approach. HAL, Tome 2006 (2006) no. 0, . http://gdmltest.u-ga.fr/item/hal-01125931/