Dealing with missing data in a k-means method - A simulation based approach
Lorga da Silva, Ana ; Saporta, Gilbert ; Bacelar-Nicolau, Helena
HAL, hal-01125931 / Harvested from HAL
In this work we propose to evaluate the effect of missing data on a k-means method used for variables partitioning. The partition method is the following: we start bya finding a dissimilarity matrix between variables; a multidimensional scaling ([BG05]) provides components and we use this components as input in a k-means method.Data are generated with aim of obtaining different types of patititions from twenty-five variavles (the data have a multinormal distribution). Then we simulate the missing data as in [Sil05], in different percentages.We determine the new partitions in presence of missing data using three methods: listwise method, simple imputations methods and multiple imputation method.We compare the partitions obtained in the three situations with those obtained with the original complete data, using a Rand index as in [YS04] and an affinity coefficient.We conclude on the effect of the missing data and imputation methods in this partition method under the established conditions.
Publié le : 2006-08-28
Classification:  Imputation methods,  Missing data,  Multidimensional Scaling,  k-means,  Données manquantes,  Une échelle multidimensionnelle,  Méthodes d'imputation,  [INFO]Computer Science [cs],  [MATH.MATH-ST]Mathematics [math]/Statistics [math.ST]
@article{hal-01125931,
     author = {Lorga da Silva, Ana and Saporta, Gilbert and Bacelar-Nicolau, Helena},
     title = {Dealing with missing data in a k-means method - A simulation based approach},
     journal = {HAL},
     volume = {2006},
     number = {0},
     year = {2006},
     language = {en},
     url = {http://dml.mathdoc.fr/item/hal-01125931}
}
Lorga da Silva, Ana; Saporta, Gilbert; Bacelar-Nicolau, Helena. Dealing with missing data in a k-means method - A simulation based approach. HAL, Tome 2006 (2006) no. 0, . http://gdmltest.u-ga.fr/item/hal-01125931/