一种符号型增量数据标签算法 Categorical Incremental Data Labeling Algorithm期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种符号型增量数据标签算法

引用本文：	李艳红,李德玉,王素格.一种符号型增量数据标签算法[J].计算机科学,2015,42(6):223-227.

作者姓名：	李艳红李德玉王素格

作者单位：	1. 山西大学计算机与信息技术学院太原030006 2. 计算智能与中文信息处理教育部重点实验室太原030006

基金项目：	本文受国家自然科学基金(61272095,61175067,61303091,61202365,61100138,61403238),山西省自然科学基金(2012061015),山西省科技攻关项目(20110321027-02),山西省回国留学人员科研项目(2013-014)资助

摘要：	数据标签是一种提高增量数据聚类效率的简单而有效的方法.数据标签就是分配每个新增数据点到与之最相似的簇的过程.符号数据分析的难点之一在于缺少一种恰当的方法来定义数据点与数据簇之间的相似性.为此,将簇代表定义为簇中所有属性的属性值及其在簇中的频率构成的列表,用信息熵的变化来定义“点-簇”不相似性.基于此不相似性度量,设计了一个符号型增量数据标签算法来分配无标记数据到恰当的簇.在公开数据集和文本语料上的对比实验表明,该数据标签算法不但数据标记精度高、时间开销小,而且有较好的可伸缩性.
关键词：	聚类数据标签增量数据符号数据信息熵
Categorical Incremental Data Labeling Algorithm

LI Yan-hong,LI De-yu and WANG Su-ge.Categorical Incremental Data Labeling Algorithm[J].Computer Science,2015,42(6):223-227.

Authors:	LI Yan-hong LI De-yu and WANG Su-ge

Affiliation:	School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Taiyuan 030006,China,School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Taiyuan 030006,China and School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Taiyuan 030006,China

Abstract:	Data labeling has become a simple but efficient solution to improve the efficiency of incremental data clustering.This process of data labeling is performed by assigning each new coming data point to some cluster that is closest to the new data point.One of the main difficulties in categorical data analysis is,however,lacking an appropriate way to define the similarity between data point and cluster.To overcome this difficulty,in this paper,we defined the representative of a cluster as a list of all attribute values with their frequencies in each attribute domain of the cluster,and then,defined the point-cluster dissimilarity measure by means of the change of information entropy.Based on the dissimilarity measure,we designed a categorical incremental data labeling algorithm,to allocate each unlabeled data point into the appropriate cluster.Comparative experiments on several public data sets and a text corpus show that the proposed algorithm has not only the higher labeling accuracy and the less execution time,but also better scalability.

Keywords:	Clustering Data labeling Incremental data Categorical data Information entropy
本文献已被万方数据等数据库收录！
	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏