首页 | 官方网站   微博 | 高级检索  
     

基于TF*PDF的热点关键短语提取
引用本文:马佩勋,高 琰.基于TF*PDF的热点关键短语提取[J].计算机应用研究,2013,30(12):3610-3613.
作者姓名:马佩勋  高 琰
作者单位:1. 长沙民政学院 软件学院, 长沙410082; 2. 中南大学信息科学与工程学院, 长沙410000
基金项目:国家教育部博士点新教师基金资助项目(20090162120087); 湖南省科技计划资助项目(2009FJ3053)
摘    要:传统的TF*PDF方法提取的关键短语可精确地描述话题并进行新闻报道的追踪, 但存在误将噪声数据识别为关键短语的情况。提出了一种基于位置权重TF*PDF的两段式关键短语提取方法滤除噪声数据。该方法将传统的TF*PDF算法与位置权重相结合, 计算词汇与短语的权重, 获取候选关键短语列表, 关键短语的脉冲值则用于过滤列表中的噪声。通过关键短语识别进程根据位置信息、频率信息等将热点词汇组合成短语。TF*PDF位置权重算法同时也用于为短语分配权重, 排名前K的短语被认为是热点关键短语。以真实网络数据为基础的实验结果表明, 该提取方法与传统的TF*PDF提取方法相比, 可更好地去除关键词短语中的绝对噪声, 较好地改善了热点话题检测的准确度。

关 键 词:TF*PDF  TDT  提取  脉冲值  关键词短语

Hot keyphrase extraction based on TF*PDF
MA Pei-xun,GAO Yan.Hot keyphrase extraction based on TF*PDF[J].Application Research of Computers,2013,30(12):3610-3613.
Authors:MA Pei-xun  GAO Yan
Affiliation:1. Dept. of Software, Changsha Social Work College, Changsha 410082, China; 2. College of Information Science & Engineering, Central South University, Changsha 410000, China
Abstract:Key phrase extracted by traditional TF*PDF method could represent topic accurately and track reports effectively, while sometimes noise data may be also recognized as key phrase. This paper proposed two-step key phrase extraction method based on improved TF*PDF to filter noise data. The method combined traditional TF*PDF and position-weight to compute weight of words and phrases, it used obtain candidate hot term list and the burst value of term to filter the noise in the list. In the second step, a phrase identification process combined hot terms into phrases using position information, frequency information etc. At last the position-weighted TF*PDF algorithm are also used to weight the phrase, and chose the top K phrases as hot key phrases. The experiments on the real Web data indicate that this extraction method is able to filter noise data completely and provides a solution with improved quality at topic tracking in comparison with traditional TF*PDF.
Keywords:TF*PDF  TDT  extraction  burst value  key phrase
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号