An algorithm for reducing the dimension and size of a sample for data exploration procedures
Piotr Kulczycki ; Szymon Łukasik
International Journal of Applied Mathematics and Computer Science, Tome 24 (2014), p. 133-149 / Harvested from The Polish Digital Mathematics Library

The paper deals with the issue of reducing the dimension and size of a data set (random sample) for exploratory data analysis procedures. The concept of the algorithm investigated here is based on linear transformation to a space of a smaller dimension, while retaining as much as possible the same distances between particular elements. Elements of the transformation matrix are computed using the metaheuristics of parallel fast simulated annealing. Moreover, elimination of or a decrease in importance is performed on those data set elements which have undergone a significant change in location in relation to the others. The presented method can have universal application in a wide range of data exploration problems, offering flexible customization, possibility of use in a dynamic data environment, and comparable or better performance with regards to the principal component analysis. Its positive features were verified in detail for the domain's fundamental tasks of clustering, classification and detection of atypical elements (outliers).

Publié le : 2014-01-01
EUDML-ID : urn:eudml:doc:271922
@article{bwmeta1.element.bwnjournal-article-amcv24i1p133bwm,
     author = {Piotr Kulczycki and Szymon \L ukasik},
     title = {An algorithm for reducing the dimension and size of a sample for data exploration procedures},
     journal = {International Journal of Applied Mathematics and Computer Science},
     volume = {24},
     year = {2014},
     pages = {133-149},
     zbl = {1292.93044},
     language = {en},
     url = {http://dml.mathdoc.fr/item/bwmeta1.element.bwnjournal-article-amcv24i1p133bwm}
}
Piotr Kulczycki; Szymon Łukasik. An algorithm for reducing the dimension and size of a sample for data exploration procedures. International Journal of Applied Mathematics and Computer Science, Tome 24 (2014) pp. 133-149. http://gdmltest.u-ga.fr/item/bwmeta1.element.bwnjournal-article-amcv24i1p133bwm/

[000] Aarts, E., Korst, J. and van Laarhoven, P. (1997). Simulated annealing, in E. Aarts and J. Lenstra (Eds.), Local Search in Combinatorial Optimization, Wiley, Chichester, pp. 91-120. | Zbl 0905.90140

[001] Alba, E. (2005). Parallel Metaheuristics: A New Class of Algorithms, Wiley, New York, NY. | Zbl 1094.90052

[002] Aswani Kumar, C. and Srinivas, S. (2006). Latent semantic indexing using eigenvalue analysis for efficient information retrieval, International Journal of Applied Mathematics and Computer Science 16(4): 551-558. | Zbl 1122.68047

[003] Aswani Kumar, C. (2009). Analysis of unsupervised dimensionality techniques, Computer Science and Information Systems 6(2): 217-227.

[004] Azencot, R. (1992). Simulated Annealing: Parallelization Techniques, Wiley, New York, NY.

[005] Bartenhagen, C., Klein, H.-U., Ruckert, C., Jiang, X. and Dugas, M. (2010). Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data, BMC Bioinformatics 11, paper no. 567.

[006] Bartkuté, V. and Sakalauskas, L. (2009). Statistical inferences for termination of Markov type random search algorithms, Journal of Optimization Theory and Applications 141(3): 475-493. | Zbl 1183.62081

[007] Ben-Ameur, W. (2004). Computing the initial temperature of simulated annealing, Computational Optimization and Applications 29(3): 367-383. | Zbl 1062.90073

[008] Borg, I. and Groenen, P. (2005). Modern Multidimensional Scaling. Theory and Applications, Springer-Verlag, Berlin. | Zbl 1085.62079

[009] Camastra, F. (2003). Data dimensionality estimation methods: A survey, Pattern Recognition 36(12): 2945-2954. | Zbl 1059.68100

[010] Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P., Łukasik, S. and Żak, S. (2010). Complete gradient clustering algorithm for features analysis of x-ray images, in E. Piątka and J. Kawa (Eds.), Information Technologies in Biomedicine, Vol. 2, Springer-Verlag, Berlin, pp. 15-24.

[011] Cortez, P., Cerdeira, A., Almeida, F., Matos, T. and Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties, Decision Support Systems 47(4): 547-553.

[012] Cox, T. and Cox, M. (2000). Multidimensional Scaling, Chapman and Hall, Boca Raton, FL. | Zbl 1147.68460

[013] Cunningham, P. (2007). Dimension reduction, Technical report, UCD School of Computer Science and Informatics, Dublin.

[014] Czarnowski, I. and Jędrzejowicz, P. (2011). Application of agent-based simulated annealing and tabu search procedures to solving the data reduction problem, International Journal of Applied Mathematics and Computer Science 21(1): 57-68, DOI: 10.2478/v10006-011-0004-3. | Zbl 1221.68191

[015] David, H. and Nagaraja, H. (2003). Order Statistics, Wiley, New York, NY. | Zbl 1053.62060

[016] Deng, Z., Chung, F.-L. and Wang, S. (2008). FRSDE: Fast reduced set density estimator using minimal enclosing ball approximation, Pattern Recognition 41(4): 1363-1372. | Zbl 1131.68086

[017] François, D., Wertz, V. and Verleysen, M. (2007). The concentration of fractional distances, IEEE Transactions on Knowledge and Data Engineering 19(7): 873-886.

[018] Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration in images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6: 721-741. | Zbl 0573.62030

[019] Gendreau, M. and Potvin, J.-Y. (2010). Handbook of Metaheuristics, Springer, New York, NY. | Zbl 1198.90002

[020] Han, J. and Kamber, M. (2006). Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA. | Zbl 05951239

[021] Ingber, L. (1996). Adaptive simulated annealing (ASA): Lessons learned, Control and Cybernetics 25(1): 33-54. | Zbl 0860.93035

[022] Inza, I., Larranaga, P., Etxeberria, R. and Sierra, B. (2000). Feature subset selection by Bayesian network-based optimization, Artificial Intelligence 123(1-2): 157-184. | Zbl 0952.68118

[023] Ishibuchi, H., Nakashima, T. and Murata, T. (2001). Three-objective genetics-based machine learning for linguistic rule extraction, Information Sciences 136(1-4): 109-133. | Zbl 0996.68175

[024] Kerdprasop, K., Kerdprasop, N. and Sattayatham, P. (2005). Weighted k-means for density-biased clustering, in A. Tjoa and J. Trujillo (Eds.), Data Warehousing and Knowledge Discovery, Lecture Notes in Computer Science, Vol. 3589, Springer-Verlag, Berlin pp. 488-497.

[025] Kulczycki, P. (2005). Kernel Estimators in System Analysis, WNT, Warsaw, (in Polish). | Zbl 1122.93077

[026] Kulczycki, P. (2008). Kernel estimators in industrial applications, in B. Prasad (Ed.), Soft Computing Applications in Industry, Springer-Verlag, Berlin, pp. 69-91.

[027] Kulczycki, P. and Charytanowicz, M. (2010). A complete gradient clustering algorithm formed with kernel estimators, International Journal of Applied Mathematics and Computer Science 20(1): 123-134, DOI: 10.2478/v10006-010-0009-3. | Zbl 1300.62043

[028] Kulczycki, P. and Kowalski, P. (2011). Bayes classification of imprecise information of interval type, Control and Cybernetics 40(1): 101-123. | Zbl 1318.62212

[029] Kulczycki, P. and Łukasik, S. (2014). Reduction of dimension and size of data set by parallel fast simulated annealing, in L.T. Koczy, C.R. Pozna, R. Claudiu and J. Kacprzyk (Eds.), Issues and Challenges of Intelligent Systems and Computational Intelligence, Springer-Verlag, Berlin, pp. 273-292.

[030] Kuo, Y. (2010). Using simulated annealing to minimize fuel consumption for the time-dependent vehicle routing problem, Computers & Industrial Engineering 59(1): 157-165.

[031] Łukasik, S. and Kulczycki, P. (2011). An algorithm for sample and data dimensionality reduction using fast simulated annealing, in J. Tang, I. King, L. Chen and J. Wang (Eds.), Advanced Data Mining and Applications, Lecture Notes in Computer Science, Vol. 7120, Springer-Verlag, Berlin, pp. 152-161.

[032] Łukasik, S. and Kulczycki, P. (2013). Using topology preservation measures for multidimensional intelligent data analysis in the reduced feature space, in L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. Zadeh and J. Zurada (Eds.), Artificial Intelligence and Soft Computing, Lecture Notes in Computer Science, Vol. 7895, Springer-Verlag, Berlin, pp. 184-193.

[033] Maaten, van der, L. (2009). Feature Extraction from Visual Data, Ph.D. thesis, Tilburg University, Tilburg.

[034] Mangasarian, O. and Wolberg, W. (1990). Cancer diagnosis via linear programming, SIAM News 23(5): 1-18.

[035] Mitra, P., Murthy, C. and Pal, S. (2002). Density-based multiscale data condensation, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6): 734-747.

[036] Nam, D., Lee, J.-S. and Park, C. (2004). n-dimensional Cauchy neighbor generation for the fast simulated annealing, IEICE Transactions on Information and Systems E87D(11): 2499-2502.

[037] Oliveira, J. and Pedrycz, W. (Eds.) (2007). Advances in Fuzzy Clustering and Its Applications, Wiley, Chichester.

[038] Pal, S. and Mitra, P. (2004). Pattern Recognition Algorithms for Data Mining, Chapman and Hall, Boca Raton, FL. | Zbl 1099.68091

[039] Parvin, H., Alizadeh, H. and Minati, B. (1971). Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66(336): 846-850.

[040] Parvin, H., Alizadeh, H. and Minati, B. (2010). A modification on k-nearest neighbor classifier, Global Journal of Computer Science and Technology 10(14): 37-41.

[041] Sait, S. and Youssef, H. (2000). Iterative Computer Algorithms with Applications in Engineering: Solving Combinatorial Optimization Problems, IEEE Computer Society Press, Los Alamitos, CA. | Zbl 0933.68151

[042] Sammon, J. (1969). A nonlinear mapping for data structure analysis, IEEE Transactions on Computers 18(5): 401-409.

[043] Saxena, A., Pal, N. and Vora, M. (2010). Evolutionary methods for unsupervised feature selection using Sammon's stress function, Fuzzy Information and Engineering 2(3): 229-247.

[044] Strickert, M., Teichmann, S., Sreenivasulu, N. and Seiffert, U. (2005). DIPPP online self-improving linear map for distance-preserving data analysis, 5th Workshop on SelfOrganizing Maps, WSOM'05, Paris, France, pp. 661-668.

[045] Sumi, S.M., Zaman, M.F. and Hirose, H. (2012). A rainfall forecasting method using machine learning models and its application to the Fukuoka city case, International Journal of Applied Mathematics and Computer Science 22(4): 841-854, DOI: 10.2478/v10006-012-0062-1. | Zbl 1283.68305

[046] Szu, H. and Hartley, R. (1987). Fast simulated annealing, Physics Letters A 122(3-4): 157-162.

[047] Tian, T., Wilcox, R. and James, G. (2010). Data reduction in classification: A simulated annealing based projection method, Statistical Analysis and Data Mining 3(5): 319-331.

[048] UC Irvine Machine Learning Repository http://archive.ics.uci.edu/ml/. (2013).

[049] Vanstrum, M. and Starks, S. (1981). An algorithm for optimal linear maps, Southeastcon Conference, Huntsville, AL, USA, pp. 106-110.

[050] Wand, M. and Jones, M. (1995). Kernel Smoothing, Chapman and Hall, London. | Zbl 0854.62043

[051] Wilson, D. and Martinez, T. (2000). Reduction techniques for instance-based learning algorithms, Machine Learning 38(3): 257-286. | Zbl 0954.68126

[052] Xu, R. and Wunsch, D. (2009). Clustering, Wiley, Hoboken, NJ.

[053] Zhigljavsky, A. and Žilinskas, A. (2008). Stochastic Global Optimization, Springer-Verlag, Berlin. | Zbl 1136.90003