核密度估计的聚类算法<sup>*</sup> Clustering Algorithm with Kernel Density Estimation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

核密度估计的聚类算法^*

引用本文：	朱杰,陈黎飞.核密度估计的聚类算法^*[J].模式识别与人工智能,2017,30(5):439-447.

作者姓名：	朱杰陈黎飞

作者单位：	1. 中国西南电子技术研究所成都 610036 2.福建师范大学数学与计算机科学学院福州 350117

基金项目：	国家自然科学基金项目(No.61672157)、福建省自然科学基金项目(No.2015J01238)资助

摘要：	相似性度量是聚类分析的重要基础,如何有效衡量类属型符号间的相似性是相似性度量的一个难点.文中根据离散符号的核概率密度衡量符号间的相似性,与传统的简单符号匹配及符号频度估计方法不同,该相似性度量在核函数带宽的作用下,不再依赖同一属性上符号间独立性假设.随后建立类属型数据的贝叶斯聚类模型,定义基于似然的类属型对象-簇间相似性度量,给出基于模型的聚类算法.采用留一估计和最大似然估计,提出3种求解方法在聚类过程中动态确定最优的核带宽.实验表明,相比使用特征加权或简单匹配距离的聚类算法,文中算法可以获得更高的聚类精度,估计的核函数带宽在重要特征识别等应用中具有实际意义.
关键词：	类属型数据聚类概率模型相似性度量核密度估计(KDE) 带宽估计
收稿时间：	2016-09-30
Clustering Algorithm with Kernel Density Estimation

ZHU Jie,CHEN Lifei.Clustering Algorithm with Kernel Density Estimation[J].Pattern Recognition and Artificial Intelligence,2017,30(5):439-447.

Authors:	ZHU Jie CHEN Lifei

Affiliation:	1.Southwest China Institute of Electronic Technology, Chengdu 610036 2. College of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350117

Abstract:	Similarity measure is an important basis for clustering analysis. However, defining an efficient similarity measure for discrete symbols (categories) is difficult. In this paper, a method is proposed to measure the similarity between categories in terms of their kernel probability density. Different from the traditional simple-matching method or frequency-estimation method, under the action of the bandwidth for kernel functions, the proposed measure no longer depends on the assumption that categories on the same attribute are statistically independent. Then, a Bayesian clustering model is established based on kernel density estimation of categories, and a clustering algorithm is derived to optimize the clustering model using a likelihood-based object-to-cluster similarity measure. Finally, three data-driven approaches are proposed by leave-one-out estimation and maximum likelihood estimation to dynamically determine the optimal bandwidths in the kernel function for clustering. Experiments are conducted on real-world datasets and the results demonstrate that the proposed algorithm achieves higher clustering accuracy compared with the existing algorithms using a simple-matching distance measure or the attribute-weighting variants. The results also show that the bandwidth estimated by the proposed algorithm has practical significance in the applications, such as important feature identification.

Keywords:	Categorical Data Clustering Probability Model Similarity Measure Kernel Density Estimation(KDE) Bandwidth Estimation

	点击此处可从《模式识别与人工智能》浏览原始摘要信息
	点击此处可从《模式识别与人工智能》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏