首页 | 官方网站   微博 | 高级检索  
     

时间加权的TF-LDA学术文献摘要主题分析
引用本文:伍哲,杨芳.时间加权的TF-LDA学术文献摘要主题分析[J].计算机技术与发展,2020(1):194-200.
作者姓名:伍哲  杨芳
作者单位:西安邮电大学计算机学院
基金项目:陕西省教育专项科研计划项目(15JK1679);西安市科技创新引导项目(201805040YD18CG24(7))
摘    要:随着网络的发展,主题提取的应用越来越广泛,尤其是学术文献的主题提取。尽管学术文献摘要是短文本,但其具有高维性的特点导致文本主题模型难以处理,其时效性的特点致使主题挖掘时容易忽略时间因素,造成主题分布不均、不明确。针对此类问题,提出一种基于TTF-LDA(time+tf-idf+latent Dirichlet allocation)的学术文献摘要主题聚类模型。通过引入TF-IDF特征提取的方法,对摘要进行特征词的提取,能有效降低LDA模型的输入文本维度,融合学术文献的发表时间因素,建立时间窗口,限定学术文献主题分析的时间,并通过文献的发表时间增加特征词的时间权重,使用特征词的时间权重之和协同主题引导特征词词库作为LDA的影响因子。通过在爬虫爬取的数据集上进行实验,与标准的LDA和MVC-LDA相比,在选取相同的主题数的情况下,模型的混乱程度更低,主题与主题之间的区分度更高,更符合学术文献本身的特点。

关 键 词:LDA  主题模型  学术文献  TF-IDF  时间因素

A Thematic Analysis Method of Academic Documents Based on TF-IDF and LDA
WU Zhe,YANG Fang.A Thematic Analysis Method of Academic Documents Based on TF-IDF and LDA[J].Computer Technology and Development,2020(1):194-200.
Authors:WU Zhe  YANG Fang
Affiliation:(School of Computer Science,Xi'an University of Posts and Telecommunications,Xi'an 710121,China)
Abstract:With the development of network,topic extraction has been applied more and more widely,especially in academic literature.Although abstracts of academic literature are short texts,their high dimensionality makes it difficult to deal with text topic models,and their timeliness makes it easy to ignore the time factor in topic mining,resulting in uneven and unclear topic distribution.In order to solve these problems,a topic clustering model of academic literature abstracts based on TTF-LDA(tf-idf+latent Dirichlet allocation)is proposed.By introducing TF-IDF feature extraction method to extract feature words from abstracts,the extraction of feature words in the abstract can effectively reduce the input text dimension of LDA model,integrate the publication time factor of academic literature,establish a time window,and limit the time of subject analysis of academic literature.The time weights of feature words are increased by the publication time of documents,and the time weights of feature words are combined with the collaborative topics to guide the feature lexicon as the influencing factors of LDA.Through experiments on data sets crawled by crawlers,compared with standard LDA and MVCLDA,the chaotic degree of the model is lower when the number of topics is the same,and the distinction between topics is higher,which is more in line with the characteristics of academic literature itself.
Keywords:LDA  thematic model  academic literature  TF-IDF  time factor
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号