首页 | 官方网站   微博 | 高级检索  
     

一种采用聚类技术改进的KNN文本分类方法
引用本文:张孝飞,黄河燕.一种采用聚类技术改进的KNN文本分类方法[J].模式识别与人工智能,2009,22(6).
作者姓名:张孝飞  黄河燕
作者单位:中国科学院计算机语言信息工程研究中心,北京,100097
基金项目:国家自然科学基金,国家高技术研究发展计划(863计划)
摘    要:KNN算法稳定性好、准确率高,但由于其时间复杂度与样本数量成正比,导致其分类速度慢,难以在大规模海量信息处理中得到有效应用.文中提出一种改进的KNN文本分类方法.其基本思路是,通过文本聚类将样本中的若干相似文档合并成一个中心文档,并用这些中心文档代替原始样本建立分类模型,这样就减少了需要进行相似计算的文档数,从而达到提高分类速度的目的.实验表明,以分类准确率、召回率和F-score为评价指标,文中方法在与经典KNN算法相当的情况下,分类速度得到较大提高.

关 键 词:k-最近邻(KNN)  文本分类  文本聚类  聚类中心  自然语言处理

An Improved KNN Text Categorization Algorithm by Adopting Cluster Technology
ZHANG Xiao-Fei,HUANG He-Yan.An Improved KNN Text Categorization Algorithm by Adopting Cluster Technology[J].Pattern Recognition and Artificial Intelligence,2009,22(6).
Authors:ZHANG Xiao-Fei  HUANG He-Yan
Abstract:k-Nearest Neighbor (KNN) algorithm has the advantage of high accuracy and stability. But the time complexity of KNN is directly proportional to the sample size, its classification speed is low and it is problematic to be put into practice in large-scale information processing. An improved KNN text categorization algorithm is proposed which classifies faster than the traditional KNN does. Firstly, some similar sample documents are combined into a center document through adopting automatic text clustering technology. Then, a large number of original samples are replaced with the small amount of sample cluster centers. Therefore, the calculation amount of KNN is reduced greatly and the classification is speeded up. The experimental results show that the time complexity of the proposed algorithm is decreased by one order of magnitude and its accuracy is approximately equal to those of the SVM and traditional KNN.
Keywords:k-Nearest Neighbor (KNN)  Text Categorization  Text Clustering  Cluster Center  Natural Language Processing (NLP)
本文献已被 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号