首页 | 官方网站   微博 | 高级检索  
     

基于DMA与特征划分的多源文本主题模型
引用本文:许伟佳,秦永彬,黄瑞章,陈艳平. 基于DMA与特征划分的多源文本主题模型[J]. 计算机工程, 2021, 47(7): 59-66. DOI: 10.19678/j.issn.1000-3428.0058372
作者姓名:许伟佳  秦永彬  黄瑞章  陈艳平
作者单位:1. 贵州大学 计算机科学与技术学院, 贵阳 550025;2. 公共大数据国家重点实验室, 贵阳 550025
基金项目:国家自然科学基金联合基金重点项目(U1836205);国家自然科学基金重大研究计划项目(91746116);贵州省科技厅重大专项(黔科合重大专项字2017-3002号);贵州省科学技术基金重点项目(黔科合基础2020-1Z055号)。
摘    要:针对传统主题模型在挖掘多源文本数据集信息时存在主题发现效果不佳的问题,设计一种基于狄利克雷多项式分配(DMA)与特征划分的多源文本主题模型.以DMA模型为基础,放宽对预先输入的主题数量的限制,为每个数据源分配专有的主题分布参数,使用Gibbs采样算法估计每个数据源的主题数量.同时,对每个数据源分配专有的噪音词分布参数以...

关 键 词:多源文本主题模型  文本聚类  狄利克雷多项分配  特征划分  Gibbs采样
收稿时间:2020-05-19
修稿时间:2020-07-04

Multi-Source Text Topic Model Based on DMA and Feature Division
XU Weijia,QIN Yongbin,HUANG Ruizhang,CHEN Yanping. Multi-Source Text Topic Model Based on DMA and Feature Division[J]. Computer Engineering, 2021, 47(7): 59-66. DOI: 10.19678/j.issn.1000-3428.0058372
Authors:XU Weijia  QIN Yongbin  HUANG Ruizhang  CHEN Yanping
Affiliation:1. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China;2. State Key Laboratory of Public Big Data, Guiyang 550025, China
Abstract:Given the poor performance exhibited by the existing topic models for mining information on multi-source text data sets,a multi-source text topic model based on Dirichlet Multinomial Allocation(DMA) and feature division is designed.This model relaxes the restrictions on the number of pre-input topics,assigns a special topic distribution parameter for each data source,and automatically estimates the number of topics for each data source by using the Gibbs sampling algorithm.In addition,the model assigns a specific noise word distribution parameter and topic-word distribution parameter for each data source.The feature words and noise words of each data source are distinguished by using the feature categorization method,and the word features of each data source are learnt to avoid the influence of the noise word set on model clustering.Experimental results show that compared with the existing topic models,the proposed model can keep the unique word features of each data source,and has better topic discovery performance as well as improved robustness.
Keywords:multi-source text topic model  text clustering  Dirichlet Multinomial Allocation(DMA)  feature division  Gibbs sampling  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号