首页 | 官方网站   微博 | 高级检索  
     


An adaptive rule-based classifier for mining big biological data
Affiliation:1. Computational Modeling Lab, Department of Computer Science, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels, Belgium;2. Department of Population Medicine & Diagnostic Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY 14850, USA;1. Dipartimento di Ingegneria dell’Informazione, Università Politecnica delle Marche, via Brecce Bianche, 60131 Ancona, Italy;2. Universitá degli Studi eCampus, via Isimbardi 10, 22060 Novedrate, Italy;1. Universidade Federal Fluminense, Brazil;2. Departamento de Computação, R. Recife s/n, Jardim Bela Vista, Rio das Ostras-RJ, Brazil;3. Departamento de Ciência de Computação, Av. Gal. Milton Tavares de Souza, s/n, Sao Domingos, Niterói-RJ, Brazil;1. Electronic Engineering Department/Graduate School at Shenzhen, Tsinghua University, Beijing 100084, China;2. Biometrics Research Centre and the Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong;3. Biocomputing Research Center, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen 518055, China
Abstract:In this paper, we introduce a new adaptive rule-based classifier for multi-class classification of biological data, where several problems of classifying biological data are addressed: overfitting, noisy instances and class-imbalance data. It is well known that rules are interesting way for representing data in a human interpretable way. The proposed rule-based classifier combines the random subspace and boosting approaches with ensemble of decision trees to construct a set of classification rules without involving global optimisation. The classifier considers random subspace approach to avoid overfitting, boosting approach for classifying noisy instances and ensemble of decision trees to deal with class-imbalance problem. The classifier uses two popular classification techniques: decision tree and k-nearest-neighbor algorithms. Decision trees are used for evolving classification rules from the training data, while k-nearest-neighbor is used for analysing the misclassified instances and removing vagueness between the contradictory rules. It considers a series of k iterations to develop a set of classification rules from the training data and pays more attention to the misclassified instances in the next iteration by giving it a boosting flavour. This paper particularly focuses to come up with an optimal ensemble classifier that will help for improving the prediction accuracy of DNA variant identification and classification task. The performance of proposed classifier is tested with compared to well-approved existing machine learning and data mining algorithms on genomic data (148 Exome data sets) of Brugada syndrome and 10 real benchmark life sciences data sets from the UCI (University of California, Irvine) machine learning repository. The experimental results indicate that the proposed classifier has exemplary classification accuracy on different types of biological data. Overall, the proposed classifier offers good prediction accuracy to new DNA variants classification where noisy and misclassified variants are optimised to increase test performance.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号