首页 | 官方网站   微博 | 高级检索  
     

一种符号型增量数据标签算法
引用本文:李艳红,李德玉,王素格.一种符号型增量数据标签算法[J].计算机科学,2015,42(6):223-227.
作者姓名:李艳红  李德玉  王素格
作者单位:1. 山西大学计算机与信息技术学院 太原030006
2. 计算智能与中文信息处理教育部重点实验室 太原030006
基金项目:本文受国家自然科学基金(61272095,61175067,61303091,61202365,61100138,61403238),山西省自然科学基金(2012061015),山西省科技攻关项目(20110321027-02),山西省回国留学人员科研项目(2013-014)资助
摘    要:数据标签是一种提高增量数据聚类效率的简单而有效的方法.数据标签就是分配每个新增数据点到与之最相似的簇的过程.符号数据分析的难点之一在于缺少一种恰当的方法来定义数据点与数据簇之间的相似性.为此,将簇代表定义为簇中所有属性的属性值及其在簇中的频率构成的列表,用信息熵的变化来定义“点-簇”不相似性.基于此不相似性度量,设计了一个符号型增量数据标签算法来分配无标记数据到恰当的簇.在公开数据集和文本语料上的对比实验表明,该数据标签算法不但数据标记精度高、时间开销小,而且有较好的可伸缩性.

关 键 词:聚类  数据标签  增量数据  符号数据  信息熵

Categorical Incremental Data Labeling Algorithm
LI Yan-hong,LI De-yu and WANG Su-ge.Categorical Incremental Data Labeling Algorithm[J].Computer Science,2015,42(6):223-227.
Authors:LI Yan-hong  LI De-yu and WANG Su-ge
Affiliation:School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Taiyuan 030006,China,School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Taiyuan 030006,China and School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Taiyuan 030006,China
Abstract:Data labeling has become a simple but efficient solution to improve the efficiency of incremental data clustering.This process of data labeling is performed by assigning each new coming data point to some cluster that is closest to the new data point.One of the main difficulties in categorical data analysis is,however,lacking an appropriate way to define the similarity between data point and cluster.To overcome this difficulty,in this paper,we defined the representative of a cluster as a list of all attribute values with their frequencies in each attribute domain of the cluster,and then,defined the point-cluster dissimilarity measure by means of the change of information entropy.Based on the dissimilarity measure,we designed a categorical incremental data labeling algorithm,to allocate each unlabeled data point into the appropriate cluster.Comparative experiments on several public data sets and a text corpus show that the proposed algorithm has not only the higher labeling accuracy and the less execution time,but also better scalability.
Keywords:Clustering  Data labeling  Incremental data  Categorical data  Information entropy
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号