Data mining methods for gene selection on the basis of gene expression arrays

Michał Muszyński; Stanisław Osowski

International Journal of Applied Mathematics and Computer Science, Tome 24 (2014), p. 657-668 / Harvested from The Polish Digital Mathematics Library

Access to full text
Full (PDF)
Access to full text

Résumé

The paper presents data mining methods applied to gene selection for recognition of a particular type of prostate cancer on the basis of gene expression arrays. Several chosen methods of gene selection, including the Fisher method, correlation of gene with a class, application of the support vector machine and statistical hypotheses, are compared on the basis of clustering measures. The results of applying these individual selection methods are combined together to identify the most often selected genes forming the required pattern, best associated with the cancerous cases. This resulting pattern of selected gene lists is treated as the input data to the classifier, performing the task of the final recognition of the patterns. The numerical results of the recognition of prostate cancer from normal (reference) cases using the selected genes and the support vector machine confirm the good performance of the proposed gene selection approach.

Publié le : 2014-01-01

Zbl 1322.92036

EUDML-ID : urn:eudml:doc:271873

@article{bwmeta1.element.bwnjournal-article-amcv24i3p657bwm,
     author = {Micha\l\ Muszy\'nski and Stanis\l aw Osowski},
     title = {Data mining methods for gene selection on the basis of gene expression arrays},
     journal = {International Journal of Applied Mathematics and Computer Science},
     volume = {24},
     year = {2014},
     pages = {657-668},
     zbl = {1322.92036},
     language = {en},
     url = {http://dml.mathdoc.fr/item/bwmeta1.element.bwnjournal-article-amcv24i3p657bwm}
}

Michał Muszyński; Stanisław Osowski. Data mining methods for gene selection on the basis of gene expression arrays. International Journal of Applied Mathematics and Computer Science, Tome 24 (2014) pp. 657-668. http://gdmltest.u-ga.fr/item/bwmeta1.element.bwnjournal-article-amcv24i3p657bwm/

Bibliographie

[000] Baldi, P. and Long, A. (2001). A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inference of gene changes, Bioinformatics 17(4): 509-519.

[001] Chang, C.-C. and Lin, C.-J. (2011). LibSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 1(27): 1-27.

[002] De Rinaldis, E. (2007). DNA Microarrays: Current Applications, Horizon Scientific Press, Norfolk.

[003] Duda, R., Hart, P. and Stork, P. (2003). Pattern Classification and Scene Analysis, John Wiley, New York, NY.

[004] Eisen, M., Spellman, P. and Brown, P. (1998). Cluster analysis and display of genome wide expression patterns, Proceedings of the National Academy of Sciences 95(25): 14863-14868.

[005] Fan, R.-E., Chen, P.-H. and Lin, C.-J. (2005). Working set selection using second order information for training SVM, Journal of Machine Learning Research 6(12): 1889-1918. | Zbl 1222.68198

[006] Furey, T., Cristianini, N., Duffy, N., Bednarski, D., Schummer, M. and Haussler, D. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics 16(10): 906-914.

[007] Golub, T., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A. and Bloomfield, C.D. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science 286(5439): 531-537.

[008] Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection, Journal of Machine Learning Research 3(3): 1158-1182. | Zbl 1102.68556

[009] Guyon, I., Weston, A., Barnhill, S. and Vapnik, V. (2002). Gene selection for cancer classification using SVM, Machine Learning 46(1-3): 389-422. | Zbl 0998.68111

[010] Haykin, S. (1999). Neural Networks. A Comprehensive Foundation, 2nd Edition, Prentice-Hall, Englewood Cliffs, NJ. | Zbl 0934.68076

[011] Herrero, J., Valencia, A. and Dopazon, A. (2001). A hierarchical unsupervised growing neural network for clustering gene expression patterns, Bioinformatics 17(2): 126-136.

[012] Hewett, R. and Kijsanayothin, P. (2008). Tumor classification ranking from microarray data, BMC Genomics 9(2): 1-11.

[013] Huang, T.M. and Kecman, V. (2005). Gene extraction for cancer diagnosis by support vector machines-an improvement, Artificial Intelligence in Medicine 9(35): 185-194.

[014] Huang, X. and Pan, W. (2003). Linear regression and two-class classification with gene expression data, Bioinformatics 19(16): 2072-2078.

[015] Makinaci, M. (2007). Support vector machine approach for classification of cancerous prostate regions, World Academy of Science, Engineering and Technology 1(7): 166-169.

[016] Matlab (2012). Matlab User Manual-Statistics Toolbox, MathWorks, Natic.

[017] Mitsubayashi, H., Aso, S., Nagashima, T. and Okada, Y. (2008). Accurate and robust gene selection for desease classification using a simple statistics, Biomedical Informatics 3(2): 68-71.

[018] Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J., Poggio, T., Gerald, W., Loda, M., Lander, E. and Golub, T. (2001). Multiclass cancer diagnosis using tumor gene expression signatures, Proceedings of the National Academy of Sciences 98(26): 15149-15154.

[019] Sabo, K. (2014). Center-based l₁-clustering method, International Journal of Applied Mathematics and Computer Science 24(1): 151-163, DOI: 10.2478/amcs-2014-0012. | Zbl 1292.62097

[020] Scholkopf, B. and Smola, A. (2002). Learning with Kernels, MIT Press, Cambridge, MA. | Zbl 1019.68094

[021] Sprent, P. and Smeeton, N. (2007). Applied Nonparametric Statistical Methods, Chapman and Hall-CRC, Boca Raton, FL. | Zbl 1141.62020

[022] Świniarski, R.W. (2001). Rough sets methods in feature reduction and classification, International Journal of Applied Mathematics and Computer Science 11(3): 565-582. | Zbl 0990.68130

[023] Tan, P.N., Steinbach, M. and Kumar, V. (2006). Introduction to Data Mining, Pearson Education, Boston, MA.

[024] Vanderbilt (2002). Data base of prostate cancer, Vanderbilt University, http://discover1.mc.vanderbilt.edu/discover/public/mcsvm.

[025] Vert, J. (2007). Kernel methods in genomics and computational biology, in G. Camps-Valls, J.L. Rojo-Alvarez and M. Martinez-Ramon (Eds.), Kernel Methods in Bioengineering, Signal and Image Processing, Idea Group, London, pp. 42-64.

[026] Wang, X. and Gotoh, O. (2009). Cancer classification using single genes, Genom Informatics 23(1): 179-188.

[027] Wang, X. and Gotoh, O. (2010). A robust gene selection method for microarray-based cancer classification, Cancer Informatics 9(2): 15-30.

[028] Wiliński, A. and Osowski, S. (2012). Ensemble of data mining methods for gene ranking, Bulletin of the Polish Academy of Sciences 60(3): 461-471.

[029] Woolf, P.J. and Wang, Y. (2000). A fuzzy logic approach to analyzing gene expression data, Physiological Genomics 3(1): 9-15.

[030] Yang, F. (2011). Robust feature selection for microarray data based on multicriterion fusion, IEEE Transactions on Computational Biology and Bioinformatics 8(4): 1080-1092.