Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms
Bogdan Trawiński ; Magdalena Smętek ; Zbigniew Telec ; Tadeusz Lasota
International Journal of Applied Mathematics and Computer Science, Tome 22 (2012), p. 867-881 / Harvested from The Polish Digital Mathematics Library

In the paper we present some guidelines for the application of nonparametric statistical tests and post-hoc procedures devised to perform multiple comparisons of machine learning algorithms. We emphasize that it is necessary to distinguish between pairwise and multiple comparison tests. We show that the pairwise Wilcoxon test, when employed to multiple comparisons, will lead to overoptimistic conclusions. We carry out intensive normality examination employing ten different tests showing that the output of machine learning algorithms for regression problems does not satisfy normality requirements. We conduct experiments on nonparametric statistical tests and post-hoc procedures designed for multiple 1 × N and N × N comparisons with six different neural regression algorithms over 29 benchmark regression data sets. Our investigation proves the usefulness and strength of multiple comparison statistical procedures to analyse and select machine learning algorithms.

Publié le : 2012-01-01
EUDML-ID : urn:eudml:doc:244548
@article{bwmeta1.element.bwnjournal-article-amcv22z4p867bwm,
     author = {Bogdan Trawi\'nski and Magdalena Sm\k etek and Zbigniew Telec and Tadeusz Lasota},
     title = {Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms},
     journal = {International Journal of Applied Mathematics and Computer Science},
     volume = {22},
     year = {2012},
     pages = {867-881},
     zbl = {1283.93331},
     language = {en},
     url = {http://dml.mathdoc.fr/item/bwmeta1.element.bwnjournal-article-amcv22z4p867bwm}
}
Bogdan Trawiński; Magdalena Smętek; Zbigniew Telec; Tadeusz Lasota. Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms. International Journal of Applied Mathematics and Computer Science, Tome 22 (2012) pp. 867-881. http://gdmltest.u-ga.fr/item/bwmeta1.element.bwnjournal-article-amcv22z4p867bwm/

[000] Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L. and Herrera, F. (2011). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of MultipleValued Logic and Soft Computing 17(2-3): 255-287.

[001] Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M., Ventura, S., Garrell, J., Otero, J., Romero, C., Bacardit, J., Rivas, V., Fernández, J. and Herrera, F. (2009). KEEL: A software tool to assess evolutionary algorithms to data mining problems, Soft Computing 13(3): 307-318.

[002] Anderson, T. and Darling, D. (1954). A test of goodness-of-fit, Journal of the American Statistical Association 49(268): 765-769. | Zbl 0059.13302

[003] Anscombe, F. and Glynn, W. (1983). Distribution of the kurtosis statistic b2 for normal samples, Biometrika 70(1): 227-234. | Zbl 0509.62014

[004] Baruque, B., Porras, S. and Corchado, E. (2011). Hybrid classification ensemble using topology-preserving clustering, New Generation Computing 29(3): 329-344.

[005] Bergmann, G. and Hommel, G. (1988). Improvements of general multiple test procedures for redundant systems of hypotheses, in P. Bauer, G. Hommel and E. Sonnemann (Eds.), Multiple Hypotheses Testing, Springer-Verlag, Berlin, pp. 100-115.

[006] Broomhead, D. and Lowe, D. (1998). Multivariable functional interpolation and adaptive networks, Complex Systems 11: 321-355. | Zbl 0657.68085

[007] Czarnowski, I. and Jędrzejowicz, P. (2011). Application of agent-based simulated annealing and tabu search procedures to solving the data reduction problem, International Journal of Applied Mathematics and Computer Science 21(1): 57-68, DOI: 10.2478/v10006-011-0004-3. | Zbl 1221.68191

[008] D'Agostino, R. (1970). Transformation to normality of the null distribution of g1, Biometrika 57(3): 679-681.

[009] D'Agostino, R., Belanger, A. and D'Agostino Jr., R. (1990). A suggestion for using powerful and informative tests of normality, The American Statistician 44(4): 316-321.

[010] Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7: 1-30. | Zbl 1222.68184

[011] Derrac, J., García, S., Molina, D. and Herrera, F. (2011). A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms, Swarm and Evolutionary Computation 1: 3-18.

[012] Dunn, O. (1961). Multiple comparisons among means, Journal of the American Statistical Association 56(238): 52-64. | Zbl 0103.37001

[013] Finner, H. (1993). On a monotonicity problem in step-down multiple test procedures, Journal of the American Statistical Association 88(423): 920-923. | Zbl 0799.62077

[014] Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the American Statistical Association 32(200): 675-701.

[015] García, S., Fernández, A., Luengo, J. and Herrera, F. (2009). A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability, Soft Computing 10(13): 959-977.

[016] García, S., Fernández, A. and Luengo, J.and Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power, Information Sciences 180: 2044-2064.

[017] García, S. and Herrera, F. (2008). An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons, Journal of Machine Learning Research 9: 2677-2694. | Zbl 1225.68178

[018] Graczyk, M., Lasota, T., Telec, Z. and Trawiński, B. (2010). Nonparametric statistical analysis of machine learning algorithms for regression problems, in R. Setchi, I. Jordanov, R.J. Howlett and L.C. Jain (Eds.), KES 2010, Lecture Notes in Artificial Intelligence, Vol. 6276, Springer, Heidelberg, pp. 111-120.

[019] Graczyk, M., Lasota, T. and Trawiński, B. (2009). Comparative analysis of premises valuation models using KEEL, RapidMiner, and WEKA, in N.T. Nguyen, R. Kowalczyk and S.-M. Chen (Eds.), ICCCI 2009, Lecture Notes in Artificial Intelligence, Vol. 5796, Springer, Heidelberg, pp. 800-812.

[020] Hill, T. and Lewicki, P. (2007). Statistics: Methods and Applications, StatSoft, Tulsa.

[021] Hochberg, Y. (1988). A Sharper Bonferroni procedure for multiple tests of significance, Biometrika 75(4): 800-802. | Zbl 0661.62067

[022] Hodges, J. and Lehmann, E. (1962). Ranks methods for combination of independent experiments in analysis of variance, Annals of Mathematical Statistics 33: 482-497. | Zbl 0112.10303

[023] Holland, B. and Copenhaver, M. (1987). An improved sequentially rejective Bonferroni test procedure, Biometrics 43(2): 417-423. | Zbl 0654.62068

[024] Holm, S. (1979). A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics 6: 65-70. | Zbl 0402.62058

[025] Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test, Biometrika 75(2): 383-386. | Zbl 0639.62025

[026] Hommel, G.and Bernhard, G. (1994). A rapid algorithm and a computer program for multiple test procedures using procedures using logical structures of hypotheses, Computer Methods and Programs in Biomedicine 43: 213-216.

[027] Igel, C. and Hüsken, M. (2003). Empirical evaluation of the improved RPROP learning algorithm, Neurocomputing 50: 105-123. | Zbl 1006.68811

[028] Iman, R. and Davenport, J. (1980). Approximations of the critical region of the Friedman statistic, Communications in Statistics 18: 571-595. | Zbl 0451.62061

[029] Jackowski, K. and Woźniak, M. (2010). Method of classifier selection using the genetic approach, Expert Systems 27(2): 114-128.

[030] Jarque, C. and Bera, A. (1987). A test for normality of observations and regression residuals, International Statistical Review 55(2): 163-172. | Zbl 0616.62092

[031] Kajdanowicz, T. and Kazienko, P. (2011). Boosting-based sequential output prediction, New Generation Computing 29(3): 293-307. | Zbl 1251.68180

[032] Keskin, S. (2006). Comparison of several univariate normality tests regarding type I error rate and power of the test in simulation based small samples, Journal of Applied Science Research 2(5): 296-300.

[033] Król, D., Lasota, T., Trawiński, B. and Trawiński, K. (2008). Investigation of evolutionary optimization methods of TSK fuzzy model for real estate appraisal, International Journal of Hybrid Intelligent Systems 5(3): 111-128. | Zbl 1154.90639

[034] Krzystanek, M., Lasota, T. and Trawiński, B. (2009). Comparative analysis of evolutionary fuzzy models for premises valuation using KEEL, in N.T. Nguyen, R. Kowalczyk and S.-M. Chen (Eds.), ICCCI 2009, Lecture Notes in Artificial Intelligence, Vol. 5796, Springer, Heidelberg, pp. 838-849.

[035] Lasota, T., Mazurkiewicz, J., Trawiński, B. and Trawiński, K. (2010). Comparison of data driven models for the validation of residential premises using KEEL, International Journal of Hybrid Intelligent Systems 7(1): 3-16. | Zbl 1200.68193

[036] Lasota, T., Telec, Z., Trawiński, B. and Trawiński, K. (2011). Investigation of the ets evolving fuzzy systems applied to real estate appraisal, Journal of Multiple-Valued Logic and Soft Computing 17(2-3): 229-253.

[037] Li, J. (2008). A two-step rejection procedure for testing multiple hypotheses, Journal of Statistical Planning and Inference 138(6): 1521-1527. | Zbl 1131.62067

[038] Lilliefors, H. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown, Journal of the American Statistical Association 62(318): 399-402.

[039] Luengo, J., García, S. and Herrera, F. (2009). A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests, Expert Systems with Applications 36: 7798-7808.

[040] Lughofer, E., Trawiński, B., Trawiński, K., Kempa, O. and Lasota, T. (2011). On employing fuzzy modeling algorithms for the valuation of residential premises, Information Sciences 181: 5123-5142.

[041] Moller, F. (1990). A scaled conjugate gradient algorithm for fast supervised learning, Neural Networks 6: 525-533.

[042] Motulsky, H. (2010). Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking, 2nd Edn., Oxford University Press, New York, NY.

[043] Nemenyi, P.B. (1963). Distribution-free Multiple Comparisons, Ph.D. thesis, Princeton University, Princeton, NJ.

[044] Plackett, R. (1983). Karl Pearson and the chi-squared test, International Statistical Review 51(1): 59-72. | Zbl 0501.62001

[045] Plat, J. (1991). A resource allocating network for function interpolation, Neural Computation 3(2): 213-225.

[046] Quade, D. (1979). Using weighted rankings in the analysis of complete blocks with additive block effects, Journal of the American Statistical Association 74: 680-683. | Zbl 0416.62037

[047] Romão, X., Delgado, R. and Costa, A. (2010). An empirical power comparison of univariate goodness-of-fit tests for normality, Journal of Statistical Computation and Simulation 80(5): 545-591. | Zbl 1195.62056

[048] Rom, D. (1990). A sequentially rejective test procedure based on a modified Bonferroni inequality, Biometrika 77(3): 663-665.

[049] Royston, P. (1993). A pocket-calculator algorithm for the Shapiro-Francia test for non-normality: An application to medicine, Statistics in Medicine 12(2): 181-184.

[050] Salzberg, S. (1997). On comparing classifiers: Pitfalls to avoid and a recommended approach, Data Mining and Knowledge Discovery 1: 317-327.

[051] Shaffer, J. (1986). Modified sequentially rejective multiple test procedures, Journal of the American Statistical Association 81(395): 826-831. | Zbl 0603.62087

[052] Shapiro, S. and Wilk, M. (1965). An analysis of variance test for normality (complete samples), Biometrika 52(3/4): 591-611. | Zbl 0134.36501

[053] Sheskin, D. (2011). Handbook of Parametric and Nonparametric Statistical Procedures, 5th Edn., Chapman & Hall/CRC, Boca Raton, FL. | Zbl 1269.62032

[054] Smętek, M. and Trawiński, B. (2011). Investigation of genetic algorithms with self-adaptive crossover, mutation, and selection, in E. Corchado, M. Kurzyński and M. Woźniak (Eds.), HAIS 2011, Lecture Notes in Artificial Intelligence, Vol. 6678, Springer, Heidelberg, pp. 116-123.

[055] Smotroff, I., Friedman, D. and Connolly, D. (1991). Self organizing modular neural networks, IEEE International Joint Conference on Neural Networks, IJCNN'91, Seattle, WA, USA, pp. 187-192.

[056] Székely, G.J. and Rizzo, M. (2005). A new test for multivariate normality, Journal of Multivariate Analysis 93(1): 58-80. | Zbl 1087.62070

[057] Tanweeer-Ul-Islam (2011). Normality testing-A new direction, International Journal of Business and Social Science 2(3): 115-118.

[058] Thode, H. (2002). Testig for Normality, Marcel Dekker, New York, NY. | Zbl 1032.62040

[059] Troć, M. and Unold, O. (2010). Self-adaptation of parameters in a learning classifier system ensemble machine, International Journal of Applied Mathematics and Computer Science 20(1): 157-174, DOI: 10.2478/v10006-010-0012-8. | Zbl 1300.68047

[060] Wilcoxon, F. (1945). Individual comparisons by ranking methods, Biometrics 1: 80-83.

[061] Wright, S. (1992). Adjusted p-values for simultaneous inference, Biometrics 48: 1005-1013.

[062] Yazici, B. and Yolacan, S. (2007). A comparison of various tests of normality, Journal of Statistical Computation and Simulation 77(2): 175-183. | Zbl 1112.62039

[063] Zaman, M. and Hirose, H. (2011). Classification performance of bagging and boosting type ensemble methods with small training sets, New Generation Computing 29(3): 277-292.

[064] Zar, J. (2009). Biostatistical Analysis, 5th Edn., Prentice Hall, Upper Saddle River, NJ.