基于DMA与特征划分的多源文本主题模型 Multi-Source Text Topic Model Based on DMA and Feature Division期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于DMA与特征划分的多源文本主题模型

引用本文：	许伟佳,秦永彬,黄瑞章,陈艳平. 基于DMA与特征划分的多源文本主题模型[J]. 计算机工程, 2021, 47(7): 59-66. DOI: 10.19678/j.issn.1000-3428.0058372

作者姓名：	许伟佳秦永彬黄瑞章陈艳平

作者单位：	1. 贵州大学计算机科学与技术学院, 贵阳 550025;2. 公共大数据国家重点实验室, 贵阳 550025

基金项目：	国家自然科学基金联合基金重点项目（U1836205）;国家自然科学基金重大研究计划项目（91746116）;贵州省科技厅重大专项（黔科合重大专项字2017-3002号）;贵州省科学技术基金重点项目（黔科合基础2020-1Z055号）。

摘要：	针对传统主题模型在挖掘多源文本数据集信息时存在主题发现效果不佳的问题,设计一种基于狄利克雷多项式分配(DMA)与特征划分的多源文本主题模型.以DMA模型为基础,放宽对预先输入的主题数量的限制,为每个数据源分配专有的主题分布参数,使用Gibbs采样算法估计每个数据源的主题数量.同时,对每个数据源分配专有的噪音词分布参数以...
关键词：	多源文本主题模型文本聚类狄利克雷多项分配特征划分 Gibbs采样
收稿时间：	2020-05-19
修稿时间：	2020-07-04
Multi-Source Text Topic Model Based on DMA and Feature Division

XU Weijia,QIN Yongbin,HUANG Ruizhang,CHEN Yanping. Multi-Source Text Topic Model Based on DMA and Feature Division[J]. Computer Engineering, 2021, 47(7): 59-66. DOI: 10.19678/j.issn.1000-3428.0058372

Authors:	XU Weijia QIN Yongbin HUANG Ruizhang CHEN Yanping

Affiliation:	1. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China;2. State Key Laboratory of Public Big Data, Guiyang 550025, China

Abstract:	Given the poor performance exhibited by the existing topic models for mining information on multi-source text data sets,a multi-source text topic model based on Dirichlet Multinomial Allocation(DMA) and feature division is designed.This model relaxes the restrictions on the number of pre-input topics,assigns a special topic distribution parameter for each data source,and automatically estimates the number of topics for each data source by using the Gibbs sampling algorithm.In addition,the model assigns a specific noise word distribution parameter and topic-word distribution parameter for each data source.The feature words and noise words of each data source are distinguished by using the feature categorization method,and the word features of each data source are learnt to avoid the influence of the noise word set on model clustering.Experimental results show that compared with the existing topic models,the proposed model can keep the unique word features of each data source,and has better topic discovery performance as well as improved robustness.

Keywords:	multi-source text topic model text clustering Dirichlet Multinomial Allocation(DMA) feature division Gibbs sampling
本文献已被万方数据等数据库收录！
	点击此处可从《计算机工程》浏览原始摘要信息
	点击此处可从《计算机工程》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏