Lookahead selective sampling for incomplete data
Loai Abdallah ; Ilan Shimshoni
International Journal of Applied Mathematics and Computer Science, Tome 26 (2016), p. 871-884 / Harvested from The Polish Digital Mathematics Library

Missing values in data are common in real world applications. There are several methods that deal with this problem. In this paper we present lookahead selective sampling (LSS) algorithms for datasets with missing values. We developed two versions of selective sampling. The first one integrates a distance function that can measure the similarity between pairs of incomplete points within the framework of the LSS algorithm. The second algorithm uses ensemble clustering in order to represent the data in a cluster matrix without missing values and then run the LSS algorithm based on the ensemble clustering instance space (LSS-EC). To construct the cluster matrix, we use the k-means and mean shift clustering algorithms especially modified to deal with incomplete datasets. We tested our algorithms on six standard numerical datasets from different fields. On these datasets we simulated missing values and compared the performance of the LSS and LSS-EC algorithms for incomplete data to two other basic methods. Our experiments show that the suggested selective sampling algorithms outperform the other methods.

Publié le : 2016-01-01
EUDML-ID : urn:eudml:doc:287176
@article{bwmeta1.element.bwnjournal-article-amcv26i4p871bwm,
     author = {Loai Abdallah and Ilan Shimshoni},
     title = {Lookahead selective sampling for incomplete data},
     journal = {International Journal of Applied Mathematics and Computer Science},
     volume = {26},
     year = {2016},
     pages = {871-884},
     language = {en},
     url = {http://dml.mathdoc.fr/item/bwmeta1.element.bwnjournal-article-amcv26i4p871bwm}
}
Loai Abdallah; Ilan Shimshoni. Lookahead selective sampling for incomplete data. International Journal of Applied Mathematics and Computer Science, Tome 26 (2016) pp. 871-884. http://gdmltest.u-ga.fr/item/bwmeta1.element.bwnjournal-article-amcv26i4p871bwm/

[000] Abdallah, L. and Shimshoni, I. (2013). An ensemble-clustering-based distance metric and its applications, International Journal of Business Intelligence and Data Mining 8(3): 264-287.

[001] Abdallah, L. and Shimshoni, I. (2014). Mean shift clustering algorithm for data with missing values, 14th International Conference of DaWaK, Munich, Germany, pp. 426-438.

[002] Abdallah, L. and Shimshoni, I. (2016). k-means over incomplete datasets using mean Euclidean distance, 12th International Conference on Machine Learning and Data Mining, New York, NY, pp. 113-127.

[003] Bai, X., Zhang, M., Wu, Q., Zheng, R., Zhao, H. and Wei, W. (2015). A novel data filling algorithm for incomplete information system based on valued limited tolerance relation, International Journal of Database Theory and Application 8(6): 149-164.

[004] Clark, P.G., Grzymala-Busse, J.W. and Rzasa, W. (2013). Consistency of incomplete data, 2nd International Conference on Data Technologies and Applications, Marrakech, Morocco, pp. 80-87.

[005] Clustering datasets (2008). http://cs.joensuu.fi/sipu/ datasets/, University of Eastern Finland, Joensuu.

[006] Dasgupta, S. and Hsu, D. (2008). Hierarchical sampling for active learning, 25th International Conference on Machine Learning, Helsinki, Finland, pp. 208-215.

[007] Dekel, O., Gentile, C. and Sridharan, K. (2012). Selective sampling and active learning from single and multiple teachers, Journal of Machine Learning Research 13(1): 2655-2697. | Zbl 06276195

[008] Donders, A.R.T., van der Heijden, G.J., Stijnen, T. and Moons, K.G. (2006). Review: A gentle introduction to imputation of missing values, Journal of Clinical Epidemiology 59(10): 1087-1091.

[009] Grzymala-Busse, J. and Hu, M. (2001). A comparison of several approaches to missing attribute values in data mining, in W. Ziarko et al. (Eds.), Rough Sets and Current Trends in Computing, Springer, Berlin/Heidelberg, pp. 378-385. | Zbl 1014.68558

[010] Grzymala-Busse, J.W. (2006). A rough set approach to data with missing attribute values, in J.F. Peters and Y. Yao (Eds.), Rough Sets and Knowledge Technology, Springer, Berlin/Heidelberg, pp. 58-67.

[011] Hospedales, T.M., Gong, S. and Xiang, T. (2013). Finding rare classes: Active learning with generative and discriminative models, IEEE Transactions on Knowledge and Data Engineering 25(2): 374-386.

[012] Ibrahim, J.G., Chen, M.-H., Lipsitz, S.R. and Herring, A.H. (2005). Missing-data methods for generalized linear models: A comparative review, Journal of the American Statistical Association 100(469): 332-346. | Zbl 1117.62360

[013] Lewis, D. and Gale, W. (1994). A sequential algorithm for training text classifiers, 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 3-12.

[014] Li, H., Shi, Y., Liu, Y., Hauptmann, A.G. and Xiong, Z. (2012). Cross-domain video concept detection: A joint discriminative and generative active learning approach, Expert Systems with Applications 39(15): 12220-12228.

[015] Lindenbaum, M., Markovitch, S. and Rusakov, D. (2004). Selective sampling for nearest neighbor classifiers, Machine Learning 54(2): 125-152. | Zbl 1057.68087

[016] Little, R.J. (1988). Missing-data adjustments in large surveys, Journal of Business & Economic Statistics 6(3): 287-296.

[017] Little, R.J. and Rubin, D.B. (2014). Statistical Analysis with Missing Data, John Wiley & Sons. Hoboken, NJ. | Zbl 0665.62004

[018] Lughofer, E. (2012). Hybrid active learning for reducing the annotation effort of operators in classification systems, Pattern Recognition 45(2): 884-896.

[019] MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations, 5th Symposium on Math, Statistics, and Probability, Berkeley, CA, USA, pp. 281-297. | Zbl 0214.46201

[020] Magnani, M. (2004). Techniques for dealing with missing data in knowledge discovery tasks, Obtido 15(01): 2007.

[021] Nowicki, R.K. (2010). On classification with missing data using rough-neuro-fuzzy systems, International Journal of Applied Mathematics and Computer Science 20(1): 55-67, doi: 10.2478/v10006-010-0004-8. | Zbl 1300.93106

[022] Nowicki, R.K., Nowak, B.A. and Woźniak, M. (2016). Application of rough sets in k nearest neighbours algorithm for classification of incomplete samples, in S. Kunifuji et al. (Eds.), Knowledge, Information and Creativity Support Systems, Springer, Berlin/Heidelberg, pp. 243-257.

[023] Stefanowski, J. and Tsoukias, A. (2001). Incomplete information tables and rough classification, Computational Intelligence 17(3): 545-566.

[024] Strehl, A. and Ghosh, J. (2002). Cluster ensembles-A knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research 3: 583-617. | Zbl 1084.68759

[025] Tan, M. and Schlimmer, J. (1990). Two case studies in cost-sensitive concept acquisition, 8th National Conference on Artificial Intelligence, Boston, MA, USA, pp. 854-860.

[026] Turney, P. (1995). Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm, Journal of Artificial Intelligence Research 2(1): 369-409.

[027] Xu, Z., Akella, R. and Zhang, Y. (2007). Incorporating diversity and density in active learning for relevance feedback, in G. Amati et al. (Eds.), Advances in Information Retrieval, Springer, Berlin/Heidelberg, pp. 246-257.

[028] Zhang, S., Qin, Z., Ling, C. and Sheng, S. (2005). Missing is useful: Missing values in cost-sensitive decision trees, IEEE Transactions on Knowledge and Data Engineering 17(12): 1689-1693.

[029] Zhang, Y., Wen, J., Wang, X. and Jiang, Z. (2014). Semi-supervised learning combining co-training with active learning, Expert Systems with Applications 41(5): 2372-2378.