Classification problem for imbalanced datasets is pervasive in a lot of data mining domains. Imbalanced classification has been a hot topic in the academic community. From data level to algorithm level, a lot of solutions have been proposed to tackle the problems resulted from imbalanced datasets. SMOTE is the most popular data-level method and a lot of derivations based on it are developed to alleviate the problem of class imbalance. Our investigation indicates that there are severe flaws in SMOTE. We propose a new oversampling method SNOCC that can compensate the defects of SMOTE. In SNOCC, we increase the number of seed samples and that renders the new samples not confine in the line segment between two seed samples in SMOTE. We employ a novel algorithm to find the nearest neighbors of samples, which is different to the previous ones. These two improvements make the new samples created by SNOCC naturally reproduce the distribution of original seed samples. Our experiment results show that SNOCC outperform SMOTE and CBSO (a SMOTE-based method).
Publié le : 2016-03-01
Classification:  Knowledge and Information Engineering,  Classification, imbalanced dataset, oversampling, SMOTE, SNOCC,  68T10
@article{cai1277,
     author = {Zhuoyuan Zheng; Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin and Yunpeng Cai; Shenzhen Institutes of Advanced Technology, Key Laboratory for Biomedical Informatics and Health Engineering, Chinese Academy of Sciences, Shenzhen and Ye Li; Shenzhen Institutes of Advanced Technology, Key Laboratory for Biomedical Informatics and Health Engineering, Chinese Academy of Sciences, Shenzhen},
     title = {Oversampling Method for Imbalanced Classification},
     journal = {Computing and Informatics},
     volume = {34},
     number = {4},
     year = {2016},
     language = {en},
     url = {http://dml.mathdoc.fr/item/cai1277}
}
Zhuoyuan Zheng; Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin; Yunpeng Cai; Shenzhen Institutes of Advanced Technology, Key Laboratory for Biomedical Informatics and Health Engineering, Chinese Academy of Sciences, Shenzhen; Ye Li; Shenzhen Institutes of Advanced Technology, Key Laboratory for Biomedical Informatics and Health Engineering, Chinese Academy of Sciences, Shenzhen. Oversampling Method for Imbalanced Classification. Computing and Informatics, Tome 34 (2016) no. 4, . http://gdmltest.u-ga.fr/item/cai1277/