首页 | 官方网站   微博 | 高级检索  
     

核密度估计的聚类算法*
引用本文:朱杰,陈黎飞.核密度估计的聚类算法*[J].模式识别与人工智能,2017,30(5):439-447.
作者姓名:朱杰  陈黎飞
作者单位:1. 中国西南电子技术研究所 成都 610036
2.福建师范大学 数学与计算机科学学院 福州 350117
基金项目:国家自然科学基金项目(No.61672157)、福建省自然科学基金项目(No.2015J01238)资助
摘    要:相似性度量是聚类分析的重要基础,如何有效衡量类属型符号间的相似性是相似性度量的一个难点.文中根据离散符号的核概率密度衡量符号间的相似性,与传统的简单符号匹配及符号频度估计方法不同,该相似性度量在核函数带宽的作用下,不再依赖同一属性上符号间独立性假设.随后建立类属型数据的贝叶斯聚类模型,定义基于似然的类属型对象-簇间相似性度量,给出基于模型的聚类算法.采用留一估计和最大似然估计,提出3种求解方法在聚类过程中动态确定最优的核带宽.实验表明,相比使用特征加权或简单匹配距离的聚类算法,文中算法可以获得更高的聚类精度,估计的核函数带宽在重要特征识别等应用中具有实际意义.

关 键 词:类属型数据聚类    概率模型    相似性度量    核密度估计(KDE)    带宽估计  
收稿时间:2016-09-30

Clustering Algorithm with Kernel Density Estimation
ZHU Jie,CHEN Lifei.Clustering Algorithm with Kernel Density Estimation[J].Pattern Recognition and Artificial Intelligence,2017,30(5):439-447.
Authors:ZHU Jie  CHEN Lifei
Affiliation:1.Southwest China Institute of Electronic Technology, Chengdu 610036
2. College of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350117
Abstract:Similarity measure is an important basis for clustering analysis. However, defining an efficient similarity measure for discrete symbols (categories) is difficult. In this paper, a method is proposed to measure the similarity between categories in terms of their kernel probability density. Different from the traditional simple-matching method or frequency-estimation method, under the action of the bandwidth for kernel functions, the proposed measure no longer depends on the assumption that categories on the same attribute are statistically independent. Then, a Bayesian clustering model is established based on kernel density estimation of categories, and a clustering algorithm is derived to optimize the clustering model using a likelihood-based object-to-cluster similarity measure. Finally, three data-driven approaches are proposed by leave-one-out estimation and maximum likelihood estimation to dynamically determine the optimal bandwidths in the kernel function for clustering. Experiments are conducted on real-world datasets and the results demonstrate that the proposed algorithm achieves higher clustering accuracy compared with the existing algorithms using a simple-matching distance measure or the attribute-weighting variants. The results also show that the bandwidth estimated by the proposed algorithm has practical significance in the applications, such as important feature identification.
Keywords:Categorical Data Clustering  Probability Model  Similarity Measure  Kernel Density Estimation(KDE)  Bandwidth Estimation  
点击此处可从《模式识别与人工智能》浏览原始摘要信息
点击此处可从《模式识别与人工智能》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号