首页 | 官方网站   微博 | 高级检索  
     

EntropyRank: 基于主题熵的关键短语提取算法
引用本文:尹红,陈雁,李平.EntropyRank: 基于主题熵的关键短语提取算法[J].中文信息学报,2019,33(11):107-114.
作者姓名:尹红  陈雁  李平
作者单位:西南石油大学 计算机科学学院 智能与网络化系统研究中心,四川 成都 610500
基金项目:国家自然科学青年基金(61503312)
摘    要:关键短语提取是自然语言处理领域的一个重要子任务,其目的是自动识别出文本中的重要短语,现有方法主要强调词语间相关关系和词语自身影响力会影响关键短语提取效果。考虑到关键短语应准确地表示文档主题这一特点,该文提出一种基于主题熵的关键短语提取算法。该算法利用隐含狄利克雷分布训练文档和词的主题分布,并结合两个主题分布来表示特定文档下的词主题分布,然后计算词主题分布的信息熵即主题熵来表示词语自身影响力,最后在词共现网络上使用随机游走方法计算每个候选短语的得分。在6个公开数据集上的实验结果表明,与现有的无监督关键短语提取算法相比,该算法在F1指标上能提高2.61%~6.98%。

关 键 词:关键短语提取  随机游走  主题模型  词语影响力  

EntropyRank: Keyphrase Extraction Algorithm Based on Topic Entropy
YIN Hong,CHEN Yan,LI Ping.EntropyRank: Keyphrase Extraction Algorithm Based on Topic Entropy[J].Journal of Chinese Information Processing,2019,33(11):107-114.
Authors:YIN Hong  CHEN Yan  LI Ping
Affiliation:Center of Intelligence and Networked System, School of Computer Science, Southwest Petroleum University, Chengdu, Sichuan 610500, China
Abstract:Key-phrase extraction aims to automatically identify important key-phrases from documents. Most existing methods are focused on the words' importance and the relation between words. Considering that key-phrase should closely related to the article's topics, we proposed an improved method based on topic entropy. Our work firstly use Latent Dirichlet Allocation to train the theme distribution of documents and words, and combine them to get the words' topic distribution of a specific document. Then words' topic entropy are worked out to represent the words' importance. Finally, we use random walk on words' co-occurrence graph to calculate the score of each candidate phrase. Experimental results show that proposed method has an improvement of 2.61%-6.98% in F1 score compared with the existing methods.
Keywords:keyphrase extraction  random walk  topic model  word influence  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号