基于BERT和LightGBM的文本关键词提取方法 Text Keyword Extraction Method Based on BERT and LightGBM期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于BERT和LightGBM的文本关键词提取方法

引用本文：	何传鹏,尹玲,黄勃,王明胜,郭茹燕,张帅,巨家骥.基于BERT和LightGBM的文本关键词提取方法[J].电子科技,2023,36(3):7-13.

作者姓名：	何传鹏尹玲黄勃王明胜郭茹燕张帅巨家骥

作者单位：	上海工程技术大学电子电气工程学院,上海 201620

基金项目：	国家自然科学基金(61802251)

摘要：	传统的文本关键词提取方法忽略了上下文语义信息，不能解决一词多义问题，提取效果并不理想。基于LDA和BERT模型，文中提出LDA-BERT-LightG BM(LB-LightG BM)模型。该方法选择LDA主题模型获得每个评论的主题及其词分布，根据阈值筛选出候选关键词，将筛选出来的词和原评论文本拼接在一起输入到BERT模型中，进行词向量训练，得到包含文本主题词向量，从而将文本关键词提取问题通过LightG BM算法转化为二分类问题。通过实验对比了textrank算法、LDA算法、LightG BM算法及文中提出的LB-LightG BM模型对文本关键词提取的准确率P、召回率R以及F1。结果表明，当Top N取3～6时，F1的平均值比最优方法提升3.5%，该方法的抽取效果整体上优于实验中所选取的对比方法，能够更准确地发现文本关键词。
关键词：	主题模型词向量 BERT LightGBM 候选关键词关键词提取文本主题关键词
收稿时间：	2021-08-21
Text Keyword Extraction Method Based on BERT and LightGBM

HE Chuanpeng,YIN Ling,HUANG Bo,WANG Mingsheng,GUO Ruyan,ZHANG Shuai,JU Jiaji.Text Keyword Extraction Method Based on BERT and LightGBM[J].Electronic Science and Technology,2023,36(3):7-13.

Authors:	HE Chuanpeng YIN Ling HUANG Bo WANG Mingsheng GUO Ruyan ZHANG Shuai JU Jiaji

Affiliation:	School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China

Abstract:	Traditional text keyword extraction methods ignore the contextual semantic information and cannot solve the problem of ambiguity of a word, so the extraction effect is not ideal. Based on the LDA and BERT models, this study proposes the LDA-BERT-LightGBM (LB-LightGBM) model. The LDA topic model is selected to obtain the topic of each review and its word distribution, candidate keywords are filtered out according to the threshold, and the filtered words and the original review text are spliced and input into the BERT model. The word vector training is performed to obtain the word vector containing the text topic, so the text keyword extraction problem is converted into a two-classification problem through the LightGBM algorithm. The textrank algorithm, LDA algorithm, LightGBM algorithm and the proposed LB-LightGBM model are compared through experiments on the accuracy rate P, recall rate R and F1 of text keyword extraction in the present study. The results show that when TopN takes 3~6, the average value of F1 is 3.5% higher than that of the optimal method, indicating that the extraction effect of this method is generally better than that of the comparison method selected in the experiment, and the text keywords can be found more accurately.

Keywords:	topic model word vector BERT LightGBM candidate extraction text theme

	点击此处可从《电子科技》浏览原始摘要信息
	点击此处可从《电子科技》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏