首页 | 官方网站   微博 | 高级检索  
     

基于BERT和LightGBM的文本关键词提取方法
引用本文:何传鹏,尹玲,黄勃,王明胜,郭茹燕,张帅,巨家骥.基于BERT和LightGBM的文本关键词提取方法[J].电子科技,2023,36(3):7-13.
作者姓名:何传鹏  尹玲  黄勃  王明胜  郭茹燕  张帅  巨家骥
作者单位:上海工程技术大学 电子电气工程学院,上海 201620
基金项目:国家自然科学基金(61802251)
摘    要:传统的文本关键词提取方法忽略了上下文语义信息,不能解决一词多义问题,提取效果并不理想。基于LDA和BERT模型,文中提出LDA-BERT-LightG BM(LB-LightG BM)模型。该方法选择LDA主题模型获得每个评论的主题及其词分布,根据阈值筛选出候选关键词,将筛选出来的词和原评论文本拼接在一起输入到BERT模型中,进行词向量训练,得到包含文本主题词向量,从而将文本关键词提取问题通过LightG BM算法转化为二分类问题。通过实验对比了textrank算法、LDA算法、LightG BM算法及文中提出的LB-LightG BM模型对文本关键词提取的准确率P、召回率R以及F1。结果表明,当Top N取3~6时,F1的平均值比最优方法提升3.5%,该方法的抽取效果整体上优于实验中所选取的对比方法,能够更准确地发现文本关键词。

关 键 词:主题模型  词向量  BERT  LightGBM  候选关键词  关键词提取  文本主题  关键词
收稿时间:2021-08-21

Text Keyword Extraction Method Based on BERT and LightGBM
HE Chuanpeng,YIN Ling,HUANG Bo,WANG Mingsheng,GUO Ruyan,ZHANG Shuai,JU Jiaji.Text Keyword Extraction Method Based on BERT and LightGBM[J].Electronic Science and Technology,2023,36(3):7-13.
Authors:HE Chuanpeng  YIN Ling  HUANG Bo  WANG Mingsheng  GUO Ruyan  ZHANG Shuai  JU Jiaji
Affiliation:School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China
Abstract:Traditional text keyword extraction methods ignore the contextual semantic information and cannot solve the problem of ambiguity of a word, so the extraction effect is not ideal. Based on the LDA and BERT models, this study proposes the LDA-BERT-LightGBM (LB-LightGBM) model. The LDA topic model is selected to obtain the topic of each review and its word distribution, candidate keywords are filtered out according to the threshold, and the filtered words and the original review text are spliced and input into the BERT model. The word vector training is performed to obtain the word vector containing the text topic, so the text keyword extraction problem is converted into a two-classification problem through the LightGBM algorithm. The textrank algorithm, LDA algorithm, LightGBM algorithm and the proposed LB-LightGBM model are compared through experiments on the accuracy rate P, recall rate R and F1 of text keyword extraction in the present study. The results show that when TopN takes 3~6, the average value of F1 is 3.5% higher than that of the optimal method, indicating that the extraction effect of this method is generally better than that of the comparison method selected in the experiment, and the text keywords can be found more accurately.
Keywords:topic model  word vector  BERT  LightGBM  candidate  extraction  text theme  
点击此处可从《电子科技》浏览原始摘要信息
点击此处可从《电子科技》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号